Hadoop Hive Analytic Functions and Examples

Hadoop Hive analytic functions compute an aggregate value that is based on a group of rows. A Hadoop Hive HQL analytic function works on the group of rows and ignores the NULL in the data if you specify.

Hadoop Hive analytic functions

Latest Hive version includes many useful functions that can perform day to day aggregation. Note that, Hive is batch query processing engine and hence take more time to execute.

Read:

Hadoop Hive COUNT Analytic Function

Returns number of rows in query or group of rows.

Syntax:

COUNT(column reference | value expression | *) over(window_spec)

For Example;

Below example explains the Hive count analytic function:

select pat_id, 
dept_id, 
count(*) over (partition by dept_id order by dept_id asc) as pat_cnt 
from patient;

pat_id	dept_id	pat_cnt
6	111	4
2	111	4
5	111	4
1	111	4
4	222	3
5	222	3
3	222	3
7	333	1
8	444	1

Hadoop Hive SUM Analytic Function

Just like count function, sum Hive analytic function is used to compute the sum of columns or expression. Sum analytic function is used to compute the sum of all rows of table or rows within the groups.

Syntax:

SUM(column | expression) OVER( window_spec)

For example:

Calculate sum insured amount of all patients within each department. Query and output as follows:

select pat_id, 
dept_id, 
ins_amt, 
sum(ins_amt) over (partition by dept_id order by dept_id asc) as total_ins_amt 
from patient ;

pat_id	dept_id	ins_amt	total_ins_amt
6	111	90000	390000
2	111	150000	390000
5	111	50000	390000
1	111	100000	390000
4	222	250000	1290000
5	222	890000	1290000
3	222	150000	1290000
7	333	110000	110000
8	444	10000	10000

Hadoop Hive MIN and MAX Analytic Function

Like the Hive HQL MIN and MAX functions, Hadoop Hive analytic MIN and MAX functions are used to compute the MIN and MAX of the rows in the column or expression and on rows within group.

Syntax:

MIN(column | expression) OVER( window_spec)
MAX(column | expression) OVER( window_spec)

For example:

Calculate Min and Max of insured amount of all patients within each department. Query and output as follows:

select pat_id, 
dept_id, 
ins_amt, 
min(ins_amt) over (partition by dept_id order by dept_id asc) as min_ins_amt, 
max(ins_amt) over (partition by dept_id order by dept_id asc) as max_ins_amt 
from patient ;

pat_id	dept_id	ins_amt	min_ins_amt	max_ins_amt
6	111	90000	50000	150000
2	111	150000	50000	150000
5	111	50000	50000	150000
1	111	100000	50000	150000
4	222	250000	150000	890000
5	222	890000	150000	890000
3	222	150000	150000	890000
7	333	110000	110000	110000
8	444	10000	10000	10000

Hadoop Hive LEAD and LAG Analytic Function

Lead and Lag Hadoop Hive analytic functions used to compare different rows of a table by specifying an offset from the current row. You can use these functions to analyze change and variation in the data.

Syntax:

LEAD(column, offset, default) OVER( window_spec)LAG(column, offset, default) OVER( window_spec)

The default value of offset is 1. Offset is the relative position of the row to be accessed. If there is no row next/prior to access the LEAD/LAG function returns NULL, You can change this NULL value by specifying the “default” values.

For example;

Get the insured amount of the patient later and prior than the current rows in each department. Query and output as follows:

select pat_id, 
dept_id, 
ins_amt, 
lead(ins_amt,1,0) over (partition by dept_id order by dept_id asc ) as lead_ins_amt, 
lag(ins_amt,1,0) over (partition by dept_id order by dept_id asc ) as lag_ins_amt 
from patient;

pat_id	dept_id	ins_amt	lead_ins_amt	lag_ins_amt
6	111	90000	150000	0
2	111	150000	50000	90000
5	111	50000	100000	150000
1	111	100000	0	50000
4	222	250000	890000	0
5	222	890000	150000	250000
3	222	150000	0	890000
7	333	110000	0	0
8	444	10000	0	0

Hadoop Hive FIRST_VALUE and LAST_VALUE Analytic Function

You can use the Hadoop Hive first_value and last_value analytic functions to find the first value and last value in a column or expression or within group of rows. You must specify the sort criteria to determine the first and last values.

Syntax:

FIRST_VALUE(column | expression) OVER( window_spec)LAST_VALUE(column | expression) OVER( window_spec)

For example;

Compute the lowest and highest insured patients in each department. Query and output as follows:

select pat_id, 
dept_id, 
ins_amt, 
first_value(ins_amt) over (partition by dept_id order by ins_amt ) as low_ins_amt, 
last_value(ins_amt) over (partition by dept_id order by ins_amt ) as high_ins_amt 
from patient;

pat_id	dept_id	ins_amt	low_ins_amt	high_ins_amt
5	111	50000	50000	50000
6	111	90000	50000	90000
1	111	100000	50000	100000
2	111	150000	50000	150000
3	222	150000	150000	150000
4	222	250000	150000	250000
5	222	890000	150000	890000
7	333	110000	110000	110000
8	444	10000	10000	10000

Hadoop Hive ROW_NUMBER, RANK and DENSE_RANK Analytical Functions

The row_number Hive analytic function is used to assign unique values to each row or rows within group based on the column values used in OVER clause.

The Rank Hive analytic function is used to get rank of the rows in column or within group. Rows with equal values receive the same rank with next rank value skipped. The rank analytic function is used in top n analysis.

The Dense rank Hive function returns the rank of a value in a group. Rows with equal values for ranking criteria receive the same rank and assign rank in sequential order i.e. no rank values are skipped. The rank analytic function is used in top n analysis

Syntax:

ROW_NUMBER() OVER( window_spec),RANK() OVER( window_spec),DENSE_RANK() OVER( window_spec)

For example;

Assign row number, rank on insured amount using Hadoop Hive analytic functions. Query and output as follows:

select pat_id, 
dept_id, 
ins_amt, 
row_number() over (order by ins_amt) as rn, 
rank() over (order by ins_amt ) as rk, 
dense_rank() over (order by ins_amt ) as dense_rk 
from patient;

pat_id	dept_id	ins_amt	rn	rk	dense_rk
8	444	10000	1	1	1
5	111	50000	2	2	2
6	111	90000	3	3	3
1	111	100000	4	4	4
7	333	110000	5	5	5
2	111	150000	6	6	6
3	222	150000	7	6	6
4	222	250000	8	8	7
5	222	890000	9	9	8

Read:

This Post Has 5 Comments

Rahul Das June 25, 2018

Can I have this data set which used in this post.
1. Vithal S June 26, 2018
  
  Hi Rahul,
  
  You can find test data in another post. here is the link.
  
  Thanks
  1. Adnan Ahmed September 29, 2018
    
    Great stuff. I couldn’t find the download link of dataset. Can you please provide download link?
    1. Vithal S October 1, 2018
      
      Hi,
      
      Data set is not available with me. I have mocked up data to execute Hive functions. You can also mock up data set to check our of functions.
      
      Thanks
Ganesh August 6, 2018

Thanks for useful stuff 🙂

Comments are closed.