Latest version of Hive uses Cost Based Optimizer (CBO) to increase the Hive query performance. Hive uses a cost-based optimizer to determine the best method for scan and join operations, join order, and aggregate operations. You can use the Apache Hive EXPLAIN command to display the actual execution plan that Hive query engine generates and uses while executing any query in the Hadoop ecosystem.
Read:
Apache Hive Cost Based Optimizer
Latest version of Apache Hive uses the cost based optimizer to determine the best methods for the query to be executed in the Hadoop ecosystem.
Optimizer uses the statistics to determine the optimal and best execution plan for Hive queries that involves complex logic and multiple table joins.
The statistics include the following:
- The number of rows
- It also includes the number files in the Hadoop directory
- File size in bytes
Apache Hive EXPLAIN command
Hive EXPLAIN command is very useful and comes handy when you are trying to optimize the Hive query. You can use the EXPLAIN command to show a statement execution plan.
Apache Hive EXPLAIN command Syntax
Below is the syntax to use the EXPLAIN command:
EXPLAIN [EXTENDED|DEPENDENCY|AUTHORIZATION] query;
The EXTENDED in the EXPLAIN statement produces extra information about the operators in the plan. This is typically physical information like file names. The EXPLAIN command also provides dependencies between the different stages of the plan. The description of each of the stages.
Apache Hive EXPLAIN command Examples
Below is example on how execution plan looks like:
hive> EXPLAIN select * from stud; OK Plan not optimized by CBO. Stage-0 Fetch Operator limit:-1 Select Operator [SEL_1] outputColumnNames:["_col0","_col1","_col2"] TableScan [TS_0] alias:stud Time taken: 0.033 seconds, Fetched: 10 row(s) hive>
Collect Hive Statistics using Hive ANALYZE command
You can collect the statistics on the table by using Hive ANALAYZE command. Hive cost based optimizer make use of these statistics to create optimal execution plan.
Below is the example of computing statistics on Hive tables:
hive> ANALYZE TABLE stud COMPUTE STATISTICS; Query ID = impadmin_20171115185549_a73662c3-5332-42c9-bb42-d8ccf21b7221 Total jobs = 1 Launching Job 1 out of 1 … Table training_db.stud stats: [numFiles=5, numRows=5, totalSize=50, rawDataSize=45] OK Time taken: 8.202 seconds
Understand Apache Hive EXPLAIN plan
Below are some of the features that are available in Hive EXPLAIN plan:
Feature Name | Description |
TableScan | If one or more tables are small enough to fit in memory, the mapper scans the large table and do the joins. This table scan operation performs large table scan. |
Select | Identifies the Select operators in the given query. This operator projects only columns that are given in the select clause of query. Query may have multiple select based on the complexity of the query. |
GroupBy | This feature identifies grouping on records during computations. Feature may also vary on the complexity of the query. |
Map | This is the map step that Hive uses to execute query. The map phase reads the tables and output the join key-value pairs into an intermediate file. |
Reduce | This is the reduce phase that Hive uses. The reduce gets the sorted data and does the join. |
MapJoin | If there are multiple tables joins that involve small and big table MapJoin phase is used. Small table (dimension table) joins big table (fact table). It is very fast since it saves shuffle and reduce stage. |
Filter | This operator identifies the filter conditions such as WHERE clause. |
Broadcast | This operator identifies the broadcast phase. If two tables are joined together, if joining table is small, Hive sends copy of table to all nodes to make tables as collocated. |
Reducer | Reduces phases combines the results from reduce phase |
Drop | This operator identifies drop table clause. |
Create | This operator identifies any create table or create table as clause in the query. |
Alter | This operator identifies any alter table clause in the query. Alter table could be adding new column, renaming table, changing datatype of column, dropping columns etc. |