The main goal of creating INDEX on Hive table is to improve the data retrieval speed and optimize query performance. For example, let us say you are executing Hive query with filter condition WHERE col1 = 100, without index hive will load entire table or partition to process records and with index on col1 would load part of HDFS file to process records.
But be informed that Index on hive table is not recommended. The create index will help if you are migrating your existing data warehouse to Hive and you have transformed the query with index as it is.
Read:
- Hadoop Hive Bucket Concept and Bucketing Examples
- Hive Create Table Command and Examples
- Apache Hive Create External Tables and Examples
- Different Hive Join Types and Examples
- Hive Create View Syntax and Examples
In this article, we will learn Hive CREATE INDEX on table table to optimize and improve query performance with an example.
Hive CREATE INDEX Syntax
You can create INDEX on particular column of the table by using CREATE INDEX statement. Below is the syntax:
CREATE INDEX index_name ON TABLE base_table_name (col_name, ...) AS index_type [WITH DEFERRED REBUILD] [IDXPROPERTIES (property_name=property_value, ...)] [IN TABLE index_table_name] [ [ ROW FORMAT ...] STORED AS ... | STORED BY ... ] [LOCATION hdfs_path] [TBLPROPERTIES (...)] [COMMENT "index comment"];
Hive CREATE INDEX Example
Below example shows how to create index on Hive tables:
hive> CREATE INDEX index_students ON TABLE students(id) > AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' > WITH DEFERRED REBUILD ; OK Time taken: 0.493 seconds
Hive ALTER INDEX
ALTER INDEX … REBUILD builds an index that was created using the WITH DEFERRED REBUILD clause, or rebuilds a previously built index on the table. You should provide PARTITION details if the table is partitioned.
Hive ALTER INDEX Syntax
ALTER INDEX index_name ON table_name [PARTITION partition_spec] REBUILD;
Hive ALTER INDEX Examples
Now let us build the index that we have created on students tables:
hive> ALTER INDEX index_students ON students REBUILD; Query ID = cloudera_20171111093030_6a37b92b-bae8-4fd1-91bb-d13e9d411513 Total jobs = 1 ... Total MapReduce CPU Time Spent: 4 seconds 180 msec OK
Hive DROP INDEX
DROP INDEX statement drops the index and delete index table.
Hive DROP INDEX Syntax
DROP INDEX [IF EXISTS] index_name ON table_name;
Hive DROP INDEX Examples
hive> DROP INDEX IF EXISTS index_students ON students; OK Time taken: 0.27 seconds
Disadvantage of Hive Index
Below are some of disadvantages of index on hive tables:
- The improvement in query speed that an index can provide comes at the cost of additional processing to create the index and disk space to store the index references.
- Indexes are advised to build on the columns which you use in filter conditions. Index on table may degrade your performance in case you are not using them.
- Building more number of index on same table will degrade the performance of the query