As we have seen in my other post Steps to Optimize SQL Query Performance, we can improve the performance of back-end SQL by adding simple improvement while writing SQL queries. Apache Hive architecture behaves differently with data and type of HQL query you write. In this post, we will check best practices to optimize Hive query performance with some examples.
In data warehouse environment, we write lot of queries and pay very little attention to the optimization part. Tuning performance of Hive query is one of important step and require lot of SQL and domain knowledge.
Best Practices to Optimize Hive Query Performance
Below are some of the simple steps that can improve HQL query performance on Apache Hive:
Use Column Names instead of * in SELECT Clause
This seems to be odd but it will definitely improve the performance of Hive query on TEXT file format. Hive file formats such as PARQUET, ORC replaces * with actual column names.
In case if you are using TEXT file format, do not use * in your HQL queries, instead, use actual column names that you want to return.
Original Query:
Select * from table_name;
Optimized Query:
select col1, col2, col3 from table_name;
Use SORT BY instead of ORDER BY Clause
Hive supports both ORDER BY and SORT BY causes. ORDER BY works on single reducer and it causes performance bottleneck. Hive provides an alternative, SORT BY clause, that orders the data only within each reducer and performs a local ordering where each reducer’s output will be sorted.
Better performance is guaranteed if you use SORT BY instead of ORDER BY clause.
Original Query:
SELECT I_CLASS, I_CATEGORY
FROM ITEM
GROUP BY I_CLASS, I_CATEGORY
ORDER BY I_CLASS ASC, I_CATEGORY DESC ;
Optimized Query:
SELECT I_CLASS, I_CATEGORY
FROM ITEM
GROUP BY I_CLASS, I_CATEGORY
SORT BY I_CLASS ASC, I_CATEGORY DESC;
Use Hive Cost Based Optimizer (CBO) and Update Stats
Apache Hive provides cost based optimizer to improve the performance. You should enable the CBO and update statistics regularly using Apache Hive ANALYZE command.
Apache Hive uses table statistics to generate an optimal execution plan. Tables stats play important role in improving performance of HQL query execution.
Hive Command to Enable CBO
You can enable CBO on Hadoop Hive cluster using below command:
hive.cbo.enable=true
Related article: Hive ANALYZE TABLE Command – Table Statistics
Use WHERE instead of HAVING to Define Filters on non-aggregate Columns
HAVING clause will work on filtering aggregated records. Avoid using the HAVING clause on non-aggregate columns. You can pull only records that are required to perform given task. HAVING clause will filter out records only after grouping them, instead, we can filter out these in initial step using WHERE clause.
Original Query:
SELECT C.CustomerID, C.Name, Count(S.SalesID)
FROM Customers as C
INNER JOIN Sales as S
ON C.CustomerID = S.CustomerID
GROUP BY C.CustomerID, C.Name
HAVING S.LastSaleDate BETWEEN ‘1/1/2019’ AND ‘12/31/2019’;
Optimized Query:
SELECT C.CustomerID, C.Name, Count(S.SalesID)
FROM Customers as C
INNER JOIN Sales as S
ON C.CustomerID = S.CustomerID
WHERE S.LastSaleDate BETWEEN ‘1/1/2019’ AND ‘12/31/2019’
GROUP BY C.CustomerID, C.Name;
Avoid using Functions in Predicates
Another simple and important optimization technique is to avoid using function in predicates. Execution will be slower in case of any function on columns which are used in WHERE clause.
For example, avoid queries something like below:\
select t1.col1, t2.col2 from table1 as t1
join table2 as t2
on upper(t1.col1) = t2.col2;
In case if functions are required, create CTE to handle that functionality.
with CTE1 as (select upper(col1) as C1, col1 FROM table1)
select t1.col1, t2.col2 from CTE1 as t1
join table2 as t2
on t1.C1 = t2.col2;
Avoid Calculated Fields in JOIN and WHERE Clause
Just like functions, you should also avoid using any calculated fields in JOIN and WHERE clause. Create CTE to handle those functionalities and use that CTE inside your queries.
Original query:
select t1.col1, t2.col2 from table1 as t1
join table2 as t2
on (t1.col1 + 100 = t2.col2);
Optimized query:
with CTE1 as (select t1.col1 + 100 as C1, col1 FROM table1)
select t1.col1, t2.col2 from CTE1 as t1
join table2 as t2
on (t1.C1 = t2.col2);
I will keep on updating this list. Hope this helps 🙂
Good work vithal, please update more steps, I shall also provide some. Let me know your email id
Thanks Pradeep,
Please add them to this comment chain if you have more optimization opportunities. I will keep on updating this post.
Thanks