Apache Hive Table Design Best Practices and Considerations

  • Post author:
  • Post last modified:February 26, 2018
  • Post category:BigData
  • Reading time:4 mins read

As you plan your database or data warehouse migration to Hadoop ecosystem, there are key table design decisions that will heavily influence overall Hive query performance. In this article, we will check Apache Hive table design best practices. 

Apache Hive Table Design Best Practices

Apache Hive Table Design Best Practices

Table design play very important roles in Hive query performance. These design choices also have a significant effect on storage requirements, which in turn affects query performance by reducing the number of I/O operations and minimizing the memory required to process Hive queries.

Read:

Below are some of Apache Hive table design best practices:

  • Choose appropriate storage format
  • Compression Techniques
  • Partition Tables
  • Bucketing Tables
  • Data Locality

Choose Appropriate Storage Format

Hive uses the HDFS as its storage. Ultimately, all your Hive tables are stored as Hadoop HDFS files. You should choose appropriate storage format that boost the query performance and improves data retrieval speed.

Avoid using TEXT format, Sequence file format or complex storage format such as JSON. Ideally, RCFile (Row Columnar File) or Parquet files are best suited. If you are building data warehouse on Hive, for better performance use Parquet file format.

For more information on different file formats read:

CREATE TABLE IF NOT EXISTS test_table (
col1 int, 
col2 string ) 
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ',' 
STORED AS PARQUET;

Compression Techniques

Compression of data on HDFS query execution time. Even though compression-decompression consumes lot of CPU power, it does not affect map reduce jobs as they are I/O bound. It is recommended to use compressions whenever and wherever possible.

Try to split compression algorithms provided by Hadoop & Hive like Snappy. Also you should avoid Gzip because it is not splittable and is CPU intensive.

CREATE TABLE IF NOT EXISTS test_table (
col1 int, 
col2 string ) 
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ',' 
STORED AS PARQUET
TBLPROPERTIES ("parquet.compress"="SNAPPY") ";

Also ensure your final and intermediate Hive output is always compressed. You have to set following variables in Hive:

set hive.exec.compress.output=true;
set hive.exec.compress.intermediate=true;

Partition Large Tables

Partition will significantly reduce the table scan time. You should partition large Hive tables if you are collecting time series data. Hive Partition will store data in subdirectories like year/month/day. When you query the data, instead of scanning entire table, Hive will go to particular subdirectory and get you required data.

Read:

CREATE TABLE IF NOT EXISTS test_table (
col1 int, 
col2 string ) 
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ',' 
STORED AS PARQUET
PARTITIONED BY ( col3 STRING) ;

Bucketing Hive Tables

Hadoop Hive bucket concept is dividing Hive partition into number of equal clusters or buckets. Hive Buckets distribute the data load into user defined set of clusters. Hive distribute on hash code of key mentioned in query. Bucketing is useful when it is difficult to create partition on a column as it would be having huge variety of data in that column on which we want to run queries.

For more information on Hive Bucketing read:

CREATE TABLE bucketed_table ( 
Col1 integer, 
col2 string, 
col3 strng, 
col4 date, ... ) 
PARTITIONED BY (col4 date) 
CLUSTERED BY (col1) INTO 32 BUCKETS 
STORED AS PARQUET;

Data Locality

Hadoop support Amazon S3 storage. Use Amazon S3 only for backup and restore. Do not use it as a main storage as read/write will be slow. For all other, use Hadoop HDFS storage to take advantage of data locality. Read/write would be faster as everything is available in same network.