It is common to write the queries using GROUP BY and HAVING clause to group records or rows. Group by clause use columns in Hive or relational database tables for grouping particular column values mentioned with the group by. But, GROUP BY and DISTINCT operations are costly. It is applicable to both Hive and relational databases. But, in some cases, you can rewrite the queries to remove GROUP BY clause. In this article, we will check what are GROUP BY alternative methods available in Hive and SQL.
SQL and Hive GROUP BY Alternative
As mentioned in the previous section, Hive or SQL uses group by clause to group records in the table.
Following are the alternative method that you can use to replace group by in your queries.
SQL RANK Analytic Function as GROUP BY Alternative
You can use RANK or ROW_NUMBER analytical function if you are using MIN, MAX aggregate function in your Hive or SQL query.
For example, consider following example returns the MAX salary for each department id’s.
select deptID, max(salary) from TEST2 group by DEPTID;
+--------+-------------+
| DEPTID | MAX(SALARY) |
|--------+-------------|
| 10 | 1100 |
| 11 | 1200 |
| 12 | 1000 |
+--------+-------------+
In the above example, we have defined the DEPTID as a group by column.
However, you can get same results with RANK or ROW_NUMBER window function.
SELECT deptid,
salary
FROM (SELECT Rank()
OVER(
partition BY salary
ORDER BY deptid DESC) AS rk,
deptid,
salary
FROM test2) AS tmp
WHERE rk = 1
ORDER BY deptid;
+--------+--------+
| DEPTID | SALARY |
|--------+--------|
| 10 | 1100 |
| 11 | 1200 |
| 12 | 1000 |
+--------+--------+
As you can see, both query returns same results.
SQL Sub-query as a GROUP BY and HAVING Alternative
You can use a sub-query to remove the GROUP BY from the query which is using SUM aggregate function. There are many types of subqueries in Hive, but, you can use correlated subquery to calculate sum part.
For example, consider below query which calculates the SUM or salary for each department and return deptid which has salary more than 1100.
select deptID, sum(salary) from test2 group by deptID having sum(salary) > 1100;
+--------+-------------+
| DEPTID | SUM(SALARY) |
|--------+-------------|
| 10 | 2100 |
| 11 | 1200 |
+--------+-------------+
Now, rewrite query using correlated subquery.
For example,
SELECT A.deptid,
A.total_sal
FROM (SELECT DISTINCT t1.deptid,
(SELECT Sum(salary)
FROM test2 t2
WHERE t1.deptid = t2.deptid) total_sal
FROM test2 t1) AS A
WHERE total_sal > 1100;
+--------+-----------+
| DEPTID | TOTAL_SAL |
|--------+-----------|
| 10 | 2100 |
| 11 | 1200 |
+--------+-----------+
Note that, this method will work only with the RDBMS. Hive does not support inline sub queries in SELECT clause.
Related Articles,
Hope this helps 🙂