When it comes to building data warehouse-on-Hadoop ecosystem, there are handful open source frameworks available. Hive and Impala are most widely used to build data warehouse on the Hadoop framework. Hive is developed by Facebook and Impala by Cloudera.
In this article, we will explain Apache Hive Performance Tuning Best Practices and steps to be followed to achieve high performance.
Apache Hive Performance Tuning Best Practices
You can adapt number of steps to tune the performance in Hive including better schema design, right file format, using proper execution engines etc.
And of course, this list is not perfect. Feel free to comment any other methods you know.
-
- Partitioning of Hive Tables
-
- Choose Appropriate File Format for the Data
-
- Avoid Loading too many Small Files
-
- Appropriate Schema Design
-
- Performance Considerations for Join Operations on Hive Tables
-
- Use Tez as a execution engine instead of Map-Reduce
-
- Cost based optimization
- Use VECTORIZATION
Partitioning and Bucketing of Hive Tables
Partitioning Apache Hive table technique physically divides the data based on the different values in frequently queried or used columns in the Hive tables. This technique allows queries to skip reading a large percentage of the data in a table, thus reducing the I/O operation and speed-up overall performance.
You should partition the table on the column that is frequently used in WHERE condition. Hive also supports the dynamic partition.
You must set below Hive property in order to enable dynamic partition.
set hive.exec.dynamic.partition.mode=nonstrict;
You can read about dynamic partition and bucketing in below post:
Choose Appropriate File Format for the Data
Chosing right file format will improve the Hive performance. Hive supports ORCfile, a new table storage format that sports fantastic speed improvements through techniques like predicate push-down, compression etc. ORC file increases esponse times for your HIVE queries.
You can also use the Parquet file format that is suited for the data warehouse tables.
Avoid Loading Too Many Small Files
Hadoop works well with large files and it applies to Hive as well. You should avoid ingestion process that produces large number of small files. When producing outside Hive, use Text file format. Once you have data into Hive tables, then you can convert that to ORC or Parquet file format.
Appropriate Schema Design
The schema design is one of the most important aspect of improving the performance.
Read my other post:
Performance Considerations for Join Operations on Hive Tables
Joins are important aspects of the SQL queries. Avoid using correlated queries and inline tables. Create temporary tables and try to use inner join wherever possible. You can also use the Hive WITH clause instead of temporary tables
Also you should avoid any queries that leads to CARTESIAN JOINS.
Use Tez as a execution engine instead of Map-Reduce
Hive can use the Apache Tez execution engine instead of the venerable Map-reduce engine. You must enable the Tez engine by using below statement:
set hive.execution.engine=tez;
Cost based optimization
Recent release of Hive supports the cost based optimization. This query uses the status to optimize the physical and logical execution plan.
Read:
In order to use the cost bases optimizer (CBO), set following hive variables:
set hive.cbo.enable=true; set hive.compute.query.using.stats=true; set hive.stats.fetch.column.stats=true; set hive.stats.fetch.partition.stats=true;
Once you set the above variable, you use ‘analyze‘ command on table to collect statistics. These statistics will be used by optimizer to create optimal execution plan.
VECTORIZATION
Vectorization improves performance of HiveQL operations like scans, aggregations, filters and joins, by performing them in batches of 1024 rows at once instead of single row each time.
Set following variables in order to use it.
set hive.vectorized.execution.enabled = true; set hive.vectorized.execution.reduce.enabled = true;