Cloudera Impala Performance Tuning Best Practices

When it comes to SQL-on-Hadoop, there are handful frameworks available in market. Hive and Impala are most widely used to build data warehouse on the Hadoop framework. In this article, i will explain you on Cloudera Impala performance tuning best practices.

When it comes to SQL-on-Hadoop, there are number of choices available in tools, file formats, schema design, and configurations. Making good design choices when you start is the best way to avoid some of the common mistakes later on.

Cloudera Impala Performance Tuning Best Practices

Following sections explain you Cloudera Impala performance tuning best practices:

Of course, this is not the perfect list, feel free to add any other method in comment section.

Partitioning of Impala Tables

Partitioning Impala table technique physically divides the data based on the different values in frequently queried or used columns in the impala tables. This technique allows queries to skip reading a large percentage of the data in a table, thus reducing the I/O operation and speed-up overall performance.

Choose Appropriate File Format for the Data

Choosing right file format for the Impala tables is very much important for the performance point of view.

Typically, for large volume of the data, the Parquet file format performs best because of its combination of columnar storage layout, large I/O request size, and compression and encoding.

You should also choose the appropriate block size when creating tables. By default, the Parquet block size is 256MB.

Appropriate Schema Design

The schema design is one of the most important aspect of improving the performance.

Read my other post on Hadoop HDFS Schema Design for ETL Process.

Performance Considerations for Join Operations on Impala Tables

Joins are important aspects of the SQL queries. Avoid using correlated queries and inline tables. Create temporary tables and try to use inner join wherever possible. Generate stats on the column and table level.

Generate Statistics

Use COMPUTE STATS statement to generate the statistics on the tables and columns. Latest statistics helps impala automatically optimize the performance on the queries especially when you are joining two or more tables.

Avoid Loading too many Small Files

Hadoop works well with large files and it applies to Imapala as well. You should avoid ingestion process that produces large number of small files. When producing the files outside of the Impala, prefer either text or avro format. Once you have data into Impala tables, then you can convert that to Parquet file format.

Minimize the overhead of Transmitting Data back to Client Application

Do not transmit the data back to client as it is in table. Use the techniques such as aggregation, functions, filtering, limit etc.

Read: