Hive uses cost based optimizer. Statistics serve as the input to the cost functions of the Hive optimizer so that it can compare different plans and choose best among them. Hive uses the statistics such as number of rows in tables or table partition to generate an optimal query plan. Other than optimizer, hive uses mentioned statistics in many other ways. In this post, we will check Apache Hive table statistics – Hive ANALYZE TABLE command and some examples.
Uses of Hive Table or Partition Statistics
There are many ways statistics can be useful.
- Hive cost based optimizer uses the statistics to generate an optimal query plan.
- Users can quickly get the answers for some of their queries by only querying stored statistics rather than firing long-running execution plans.
- Use can get data distribution details as well, such as top 10 product sold, age distribution in person table etc.
What Statistics are collected?
The following statistics currently supported for table and partitions:
- Number of rows
- Number of files
- Size in Bytes
- Number of partition if the table is partitioned
Hive ANALYZE TABLE Command
Apache Hive uses ANALYZE TABLE command to collect statistics on a given table. This command collects statistics and stored in Hive metastore.
Hive ANALYZE TABLE Command Syntax
Below is the syntax to collect statistics:
ANALYZE TABLE [db_name.]tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)]
COMPUTE STATISTICS [FOR COLUMNS]
[NOSCAN];
For examples, below command will compute statistics for test1 table
ANALYZE TABLE test1 COMPUTE STATISTICS;
View Hive Table Statistics
You can view Hive table statistics using DESCRIBE command.
DESCRIBE EXTENDED TABLE1;
For example;
DESCRIBE EXTENDED test1;
You should see the basic table statistics in parameter tag.
parameters:{totalSize=0, numRows=0, rawDataSize=0...
When Hive Table Statistics are Computed?
There are two ways Hive table statistics are computed.
- Automatic Hive Table Statistics: For newly created tables and/or partition, utomatically computed by default.
The user has to explicitly set the boolean variable hive.stats.autogather to false so that statistics are not automatically computed and stored into Hive MetaStore.
set hive.stats.autogather=false;
- Use ANALYZE to collect statistics for existing tables: For existing tables and/or partitions, the user can issue the ANALYZE command to gather statistics and write them into Hive MetaStore.
Related reading: Apache Hive EXPLAIN Command and Example
Hope this helps 🙂