Apache Hive Grouping Function, Alternative and Examples

  • Post author:
  • Post last modified:June 4, 2018
  • Post category:BigData
  • Reading time:3 mins read

Most of the relational databases supports Grouping function to segregate super-aggregated rows.  Apache Hive Support for SQL grouping function was added in Hive 2.3.0. But who are using lower version of Hive will have difficult time in porting SQL queries that are written using grouping functions. In this article, we will check Apache Hive Grouping function alternative and examples.

Grouping Function

In general, the grouping function indicates whether an expression in a GROUP BY clause is aggregated or not for a given row. The value 0 represents a column that is part of the grouping set, while the value 1 represents a column that is not part of the grouping set.

In various relational databases, GROUP BY extensions such as ROLLUP and CUBE produce super-aggregate rows where the set of all values is represented by null. GROUPING function distinguishes super-aggregate rows from regular grouped rows.

Below is the example of relational database grouping function:

select
item_id,
state,
grouping(state) grp_state,
avg(quantity) agg1
from store, item
where store.item_id = item.item_id
group by rollup (item_id, state)
order by item_id ,state
limit 100;

Apache Hive Grouping Function Alternative

Apache Hive Support for SQL grouping function was added in Hive 2.3.0. You can follow below method if you are using Hive other than Hive 2.2.0. This method is not perfect, you can improve this method 🙂

Apache Hive Grouping__ID Function

You can alternatively use Grouping__ID function to distinguish super-aggregated rows from regular grouped rows.

This function returns a bitvector corresponding to whether each column is present or not. For each column, a value of “1” is produced for a row in the result set if that column has been aggregated in that row, otherwise the value is “0”. This can be used to differentiate when there are nulls in the data.

SELECT item_id,
state,
(case when (CAST (GROUPING__ID AS INT) & 1) = 0 then 1 else 0 end) as grp_state
from store, item
where store.item_id = item.item_id
GROUP BY state, item_id WITH ROLLUP
ORDER BY item_id, state
limit 100;

The value after the & is 2^ position of the expression in the group by.

Hope this article helped you. Please let me know if you know any other alternative methods to use GROUPING function in Apache Hive.

Read: