Apache Hive Performance Tuning Best Practices

When it comes to building data warehouse-on-Hadoop ecosystem, there are handful open source frameworks available. Hive and Impala are most widely used to build data warehouse on the Hadoop framework. Hive is developed by Facebook and Impala by Cloudera.

In this article, we will explain Apache Hive Performance Tuning Best Practices and steps to be followed to achieve high performance.

You can adapt number of steps to tune the performance in Hive including better schema design, right file format, using proper execution engines etc.

And of course, this list is not perfect. Feel free to comment any other methods you know.

- Partitioning of Hive Tables

- Choose Appropriate File Format for the Data

- Avoid Loading too many Small Files

- Appropriate Schema Design

- Performance Considerations for Join Operations on Hive Tables

- Use Tez as a execution engine instead of Map-Reduce

- Cost based optimization
- Use VECTORIZATION

Partitioning and Bucketing of Hive Tables

Partitioning Apache Hive table technique physically divides the data based on the different values in frequently queried or used columns in the Hive tables. This technique allows queries to skip reading a large percentage of the data in a table, thus reducing the I/O operation and speed-up overall performance.

You should partition the table on the column that is frequently used in WHERE condition. Hive also supports the dynamic partition.

You must set below Hive property in order to enable dynamic partition.

set hive.exec.dynamic.partition.mode=nonstrict;

You can read about dynamic partition and bucketing in below post:

Hadoop Hive Table Dynamic Partition and Examples

Choose Appropriate File Format for the Data

Chosing right file format will improve the Hive performance. Hive supports ORCfile, a new table storage format that sports fantastic speed improvements through techniques like predicate push-down, compression etc. ORC file increases esponse times for your HIVE queries.

You can also use the Parquet file format that is suited for the data warehouse tables.

Avoid Loading Too Many Small Files

Hadoop works well with large files and it applies to Hive as well. You should avoid ingestion process that produces large number of small files. When producing outside Hive, use Text file format. Once you have data into Hive tables, then you can convert that to ORC or Parquet file format.

Appropriate Schema Design

The schema design is one of the most important aspect of improving the performance.

Read my other post:

Hadoop HDFS Schema Design for ETL Process.

Performance Considerations for Join Operations on Hive Tables

Joins are important aspects of the SQL queries. Avoid using correlated queries and inline tables. Create temporary tables and try to use inner join wherever possible. You can also use the Hive WITH clause instead of temporary tables

Also you should avoid any queries that leads to CARTESIAN JOINS.

Use Tez as a execution engine instead of Map-Reduce

Hive can use the Apache Tez execution engine instead of the venerable Map-reduce engine. You must enable the Tez engine by using below statement:

set hive.execution.engine=tez;

Cost based optimization

Recent release of Hive supports the cost based optimization. This query uses the status to optimize the physical and logical execution plan.

Read:

In order to use the cost bases optimizer (CBO), set following hive variables:

set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;

Once you set the above variable, you use ‘analyze‘ command on table to collect statistics. These statistics will be used by optimizer to create optimal execution plan.

VECTORIZATION

Vectorization improves performance of HiveQL operations like scans, aggregations, filters and joins, by performing them in batches of 1024 rows at once instead of single row each time.

Set following variables in order to use it.

set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;