Hive CREATE INDEX to Optimize and Improve Query Performance

  • Post author:
  • Post last modified:February 26, 2018
  • Post category:BigData
  • Reading time:4 mins read

The main goal of creating INDEX on Hive table is to improve the data retrieval speed and optimize query performance. For example, let us say you are executing Hive query with filter condition WHERE col1 = 100, without index hive will load entire table or partition to process records and with index on col1 would load part of HDFS file to process records.

Hive Create Index

But be informed that Index on hive table is not recommended. The create index will help if you are migrating your existing data warehouse to Hive and you have transformed the query with index as it is.

Read:

In this article, we will learn Hive CREATE INDEX on table table to optimize and improve query performance with an example.

Hive CREATE INDEX Syntax

You can create INDEX on particular column of the table by using CREATE INDEX statement. Below is the syntax:

CREATE INDEX index_name
 ON TABLE base_table_name (col_name, ...)
 AS index_type
 [WITH DEFERRED REBUILD]
 [IDXPROPERTIES (property_name=property_value, ...)]
 [IN TABLE index_table_name]
 [ [ ROW FORMAT ...] STORED AS ...
 | STORED BY ... ]
 [LOCATION hdfs_path]
 [TBLPROPERTIES (...)]
 [COMMENT "index comment"];

Hive CREATE INDEX Example

Below example shows how to create index on Hive tables:

hive> CREATE INDEX index_students ON TABLE students(id) 
 > AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' 
 > WITH DEFERRED REBUILD ; 
OK

Time taken: 0.493 seconds

Hive ALTER INDEX

ALTER INDEX … REBUILD builds an index that was created using the WITH DEFERRED REBUILD clause, or rebuilds a previously built index on the table. You should provide PARTITION details if the table is partitioned.

Hive ALTER INDEX Syntax

ALTER INDEX index_name ON table_name [PARTITION partition_spec] REBUILD;

Hive ALTER INDEX Examples

Now let us build the index that we have created on students tables:

hive> ALTER INDEX index_students ON students REBUILD; 
Query ID = cloudera_20171111093030_6a37b92b-bae8-4fd1-91bb-d13e9d411513 
Total jobs = 1 
...
Total MapReduce CPU Time Spent: 4 seconds 180 msec 
OK

Hive DROP INDEX

DROP INDEX statement drops the index and delete index table.

Hive DROP INDEX Syntax

DROP INDEX [IF EXISTS] index_name ON table_name;

Hive DROP INDEX Examples

hive> DROP INDEX IF EXISTS index_students ON students; 
OK

Time taken: 0.27 seconds

Disadvantage of Hive Index

Below are some of disadvantages of index on hive tables:

  • The improvement in query speed that an index can provide comes at the cost of additional processing to create the index and disk space to store the index references.
  • Indexes are advised to build on the columns which you use in filter conditions. Index on table may degrade your performance in case you are not using them.
  • Building more number of index on same table will degrade the performance of the query