SQL and Hive GROUP BY Alternative-Example

It is common to write the queries using GROUP BY and HAVING clause to group records or rows. Group by clause use columns in Hive or relational database tables for grouping particular column values mentioned with the group by. But, GROUP BY and DISTINCT operations are costly. It is applicable to both Hive and relational databases. But, in some cases, you can rewrite the queries to remove GROUP BY clause. In this article, we will check what are GROUP BY alternative methods available in Hive and SQL.

SQL and Hive GROUP BY Alternative

As mentioned in the previous section, Hive or SQL uses group by clause to group records in the table.

Following are the alternative method that you can use to replace group by in your queries.

SQL RANK Analytic Function as GROUP BY Alternative

You can use RANK or ROW_NUMBER analytical function if you are using MIN, MAX aggregate function in your Hive or SQL query.

For example, consider following example returns the MAX salary for each department id’s.

select deptID,  max(salary) from TEST2 group by DEPTID;
+--------+-------------+
| DEPTID | MAX(SALARY) |
|--------+-------------|
|     10 |        1100 |
|     11 |        1200 |
|     12 |        1000 |
+--------+-------------+

In the above example, we have defined the DEPTID as a group by column.

However, you can get same results with RANK or ROW_NUMBER window function.

SELECT deptid, 
       salary 
FROM   (SELECT Rank() 
                 OVER( 
                   partition BY salary 
                   ORDER BY deptid DESC) AS rk, 
               deptid, 
               salary 
        FROM   test2) AS tmp 
WHERE  rk = 1 
ORDER  BY deptid;

+--------+--------+
| DEPTID | SALARY |
|--------+--------|
|     10 |   1100 |
|     11 |   1200 |
|     12 |   1000 |
+--------+--------+

As you can see, both query returns same results.

SQL Sub-query as a GROUP BY and HAVING Alternative

You can use a sub-query to remove the GROUP BY from the query which is using SUM aggregate function. There are many types of subqueries in Hive, but, you can use correlated subquery to calculate sum part.

For example, consider below query which calculates the SUM or salary for each department and return deptid which has salary more than 1100.

select deptID, sum(salary)  from test2 group by deptID having sum(salary) > 1100;
+--------+-------------+
| DEPTID | SUM(SALARY) |
|--------+-------------|
|     10 |        2100 |
|     11 |        1200 |
+--------+-------------+

Now, rewrite query using correlated subquery.

For example,

SELECT A.deptid, 
       A.total_sal 
FROM   (SELECT DISTINCT t1.deptid, 
                        (SELECT Sum(salary) 
                         FROM   test2 t2 
                         WHERE  t1.deptid = t2.deptid) total_sal 
        FROM   test2 t1) AS A 
WHERE  total_sal > 1100; 

+--------+-----------+
| DEPTID | TOTAL_SAL |
|--------+-----------|
|     10 |      2100 |
|     11 |      1200 |
+--------+-----------+

Note that, this method will work only with the RDBMS. Hive does not support inline sub queries in SELECT clause.