When it comes to SQL-on-Hadoop, there are handful frameworks available in market. Hive and Impala are most widely used to build data warehouse on the Hadoop framework. In this article, i will explain you on Cloudera Impala performance tuning best practices.
When it comes to SQL-on-Hadoop, there are number of choices available in tools, file formats, schema design, and configurations. Making good design choices when you start is the best way to avoid some of the common mistakes later on.
Cloudera Impala Performance Tuning Best Practices
Following sections explain you Cloudera Impala performance tuning best practices:
Of course, this is not the perfect list, feel free to add any other method in comment section.
Partitioning of Impala Tables
Partitioning Impala table technique physically divides the data based on the different values in frequently queried or used columns in the impala tables. This technique allows queries to skip reading a large percentage of the data in a table, thus reducing the I/O operation and speed-up overall performance.
Choose Appropriate File Format for the Data
Choosing right file format for the Impala tables is very much important for the performance point of view.
Typically, for large volume of the data, the Parquet file format performs best because of its combination of columnar storage layout, large I/O request size, and compression and encoding.
You should also choose the appropriate block size when creating tables. By default, the Parquet block size is 256MB.
Appropriate Schema Design
The schema design is one of the most important aspect of improving the performance.
- Read my other post on Hadoop HDFS Schema Design for ETL Process.
Performance Considerations for Join Operations on Impala Tables
Joins are important aspects of the SQL queries. Avoid using correlated queries and inline tables. Create temporary tables and try to use inner join wherever possible. Generate stats on the column and table level.
Generate Statistics
Use COMPUTE STATS statement to generate the statistics on the tables and columns. Latest statistics helps impala automatically optimize the performance on the queries especially when you are joining two or more tables.
Avoid Loading too many Small Files
Hadoop works well with large files and it applies to Imapala as well. You should avoid ingestion process that produces large number of small files. When producing the files outside of the Impala, prefer either text or avro format. Once you have data into Impala tables, then you can convert that to Parquet file format.
Minimize the overhead of Transmitting Data back to Client Application
Do not transmit the data back to client as it is in table. Use the techniques such as aggregation, functions, filtering, limit etc.
Read:
- Impala Architecture
- Import using Apache Sqoop
- Export using Apache Sqoop
- Hadoop Data Warehouse and Design Considerations
- Hadoop HDFS Architecture Introduction and Design