Apache Hive EXPLAIN Command and Example

Latest version of Hive uses Cost Based Optimizer (CBO) to increase the Hive query performance. Hive uses a cost-based optimizer to determine the best method for scan and join operations, join order, and aggregate operations. You can use the Apache Hive EXPLAIN command to display the actual execution plan that Hive query engine generates and uses while executing any query in the Hadoop ecosystem.

Read:

Apache Hive EXPLAIN Command

Apache Hive Cost Based Optimizer

Latest version of Apache Hive uses the cost based optimizer to determine the best methods for the query to be executed in the Hadoop ecosystem.

Optimizer uses the statistics to determine the optimal and best execution plan for Hive queries that involves complex logic and multiple table joins.

The statistics include the following:

The number of rows
It also includes the number files in the Hadoop directory
File size in bytes

Apache Hive EXPLAIN command

Hive EXPLAIN command is very useful and comes handy when you are trying to optimize the Hive query. You can use the EXPLAIN command to show a statement execution plan.

Apache Hive EXPLAIN command Syntax

Below is the syntax to use the EXPLAIN command:

EXPLAIN [EXTENDED|DEPENDENCY|AUTHORIZATION] query;

The EXTENDED in the EXPLAIN statement produces extra information about the operators in the plan. This is typically physical information like file names. The EXPLAIN command also provides dependencies between the different stages of the plan. The description of each of the stages.

Apache Hive EXPLAIN command Examples

Below is example on how execution plan looks like:

hive> EXPLAIN select * from stud;
OK
Plan not optimized by CBO.

Stage-0
 Fetch Operator
 limit:-1
 Select Operator [SEL_1]
 outputColumnNames:["_col0","_col1","_col2"]
 TableScan [TS_0]
 alias:stud

Time taken: 0.033 seconds, Fetched: 10 row(s)
hive>

Collect Hive Statistics using Hive ANALYZE command

You can collect the statistics on the table by using Hive ANALAYZE command. Hive cost based optimizer make use of these statistics to create optimal execution plan.

Below is the example of computing statistics on Hive tables:

hive> ANALYZE TABLE stud COMPUTE STATISTICS;
 Query ID = impadmin_20171115185549_a73662c3-5332-42c9-bb42-d8ccf21b7221
 Total jobs = 1
 Launching Job 1 out of 1
 …
 Table training_db.stud stats: [numFiles=5, numRows=5, totalSize=50, rawDataSize=45]
 OK
 Time taken: 8.202 seconds

Understand Apache Hive EXPLAIN plan

Below are some of the features that are available in Hive EXPLAIN plan:

Feature Name	Description
TableScan	If one or more tables are small enough to fit in memory, the mapper scans the large table and do the joins. This table scan operation performs large table scan.
Select	Identifies the Select operators in the given query. This operator projects only columns that are given in the select clause of query. Query may have multiple select based on the complexity of the query.
GroupBy	This feature identifies grouping on records during computations. Feature may also vary on the complexity of the query.
Map	This is the map step that Hive uses to execute query. The map phase reads the tables and output the join key-value pairs into an intermediate file.
Reduce	This is the reduce phase that Hive uses. The reduce gets the sorted data and does the join.
MapJoin	If there are multiple tables joins that involve small and big table MapJoin phase is used. Small table (dimension table) joins big table (fact table). It is very fast since it saves shuffle and reduce stage.
Filter	This operator identifies the filter conditions such as WHERE clause.
Broadcast	This operator identifies the broadcast phase. If two tables are joined together, if joining table is small, Hive sends copy of table to all nodes to make tables as collocated.
Reducer	Reduces phases combines the results from reduce phase
Drop	This operator identifies drop table clause.
Create	This operator identifies any create table or create table as clause in the query.
Alter	This operator identifies any alter table clause in the query. Alter table could be adding new column, renaming table, changing datatype of column, dropping columns etc.