Spark SQL Analytic Functions and Examples

Spark SQL analytic functions sometimes called as Spark SQL windows function compute an aggregate value that is based on groups of rows. These functions optionally partition among rows based on partition column in the windows spec. Like other analytic functions such as Hive Analytics functions, Netezza analytics functions and Teradata Analytics functions, Spark SQL analytic functions works on groups of rows. These functions optionally ignore NULL values in the data.

Spark SQL Analytic Functions

There are two types of Spark SQL windows functions: Ranking functions and Analytic functions

Spark SQL Ranking functions

Below is the list of functions that can be used in ranking rows.

rank
dense_rank
percent_rank
ntile
row_number

Analytic functions

Below is the list of functions that can be used as an analytics functions:

cume_dist
first_value
last_value
lag
lead

Now let us check syntax and usage of these functions.

Spark SQL Rank Analytic Function

The Spark SQL rank analytic function is used to get rank of the rows in column or within group. The Rows with equal or similar values receive the same rank with next rank value skipped. The rank analytic function is usually used in top n analysis.

Syntax:

RANK()
OVER( window_spec)

Example:

Below example demonstrates usage of RANK analytic function is Spark SQL:

SELECT pat_id, 
        dept_id, 
        ins_amt, 
        Rank() OVER ( ORDER BY ins_amt ) AS RK 
 FROM   patient 
 ORDER  BY rk; 

Output:

 8       444     10000   1
 5       111     50000   2
 6       111     90000   3
 1       111     100000  4
 7       333     110000  5
 2       111     150000  6
 3       222     150000  6
 4       222     250000  8
 5       222     890000  9
 Time taken: 0.192 seconds, Fetched 9 row(s)

Spark SQL dense_rank Analytic Function

The Spark SQL dense_rank analytic function returns the rank of a value in a group. Rows with the equal values for ranking criteria receive the same rank and assign rank in sequential order i.e. no rank values are skipped. The dense_rank analytic function is also used in top n analysis.

Syntax:

DENSE_RANK()
OVER( window_spec)

Example:

Below example demonstrates usage of DENSE_RANK analytic function is Spark SQL:

SELECT pat_id, 
        dept_id, 
        ins_amt, 
        DENSE_RANK() OVER ( ORDER BY ins_amt ) AS RK 
 FROM   patient 
 ORDER  BY rk; 

Output:

 8       444     10000   1
 5       111     50000   2
 6       111     90000   3
 1       111     100000  4
 7       333     110000  5
 2       111     150000  6
 3       222     150000  6
 4       222     250000  7
 5       222     890000  8
 Time taken: 0.662 seconds, Fetched 9 row(s)

Spark SQL percent_rank Analytic Function

The Spark SQL percent_rank analytic function return the relative rank of a value within a group of values.

Syntax:

PERCENT_RANK() OVER( window_spec)

Example:

Below example demonstrates usage of PERCENT_RANK analytic function is Spark SQL:

SELECT pat_id, 
        dept_id, 
        ins_amt, 
        PERCENT_RANK() OVER ( ORDER BY ins_amt ) AS RK 
 FROM   patient 
 ORDER  BY rk; 

Output:

 8       444     10000   0.0
 5       111     50000   0.125
 6       111     90000   0.25
 1       111     100000  0.375
 7       333     110000  0.5
 2       111     150000  0.625
 3       222     150000  0.625
 4       222     250000  0.875
 5       222     890000  1.0

Spark SQL NTILE Analytic Function

NTILE is an Spark SQL analytic function. It divides an ordered data set into a number of buckets indicated by expr and assigns the appropriate bucket number to each row. The buckets are numbered 1 through expr. The expr value must resolve to a positive constant for each partition.

Syntax:

NTILE(expr) OVER( window_spec)

Example:

Below example demonstrates usage of NTILE analytic function is Spark SQL:

SELECT pat_id, 
        dept_id, 
        ins_amt, 
        NTILE(3) OVER ( ORDER BY ins_amt ) AS RK 
 FROM   patient 
 ORDER  BY rk; 

Output:

 8       444     10000   1
 5       111     50000   1
 6       111     90000   1
 1       111     100000  2
 7       333     110000  2
 2       111     150000  2
 3       222     150000  3
 4       222     250000  3
 5       222     890000  3
 Time taken: 0.323 seconds, Fetched 9 row(s)

Spark SQL row_number Analytical Functions

The row_number analytic function is used to assign unique values to each row or rows within group based on the column values used in OVER clause.

Syntax:

ROW_NUMBER() OVER( window_spec)

Example:

Below example demonstrates usage of row_number analytic function is Spark SQL:

SELECT pat_id, 
        dept_id, 
        ins_amt, 
        ROW_NUMBER() OVER ( ORDER BY ins_amt ) AS RK 
 FROM   patient 
 ORDER  BY rk; 

Output:

 8       444     10000   1
 5       111     50000   2
 6       111     90000   3
 1       111     100000  4
 7       333     110000  5
 2       111     150000  6
 3       222     150000  7
 4       222     250000  8
 5       222     890000  9
 Time taken: 0.315 seconds, Fetched 9 row(s)

Spark SQL cume_dist Analytic Function

The CUME_DIST calculates the cumulative distribution of a value in a group of values. The range of values returned by CUME_DIST is >0 to <=1. Tie values always evaluate to the same cumulative distribution value.

Syntax:

CUME_DIST() OVER( window_spec)

Example:

Below example demonstrates usage of CUME_DIST analytic function is Spark SQL:

SELECT pat_id, 
        dept_id, 
        ins_amt, 
        CUME_DIST() OVER ( ORDER BY ins_amt ) AS RK 
 FROM   patient 
 ORDER  BY rk; 

Output:

 8       444     10000   0.1111111111111111
 5       111     50000   0.2222222222222222
 6       111     90000   0.3333333333333333
 1       111     100000  0.4444444444444444
 7       333     110000  0.5555555555555556
 2       111     150000  0.7777777777777778
 3       222     150000  0.7777777777777778
 4       222     250000  0.8888888888888888
 5       222     890000  1.0
 Time taken: 0.363 seconds, Fetched 9 row(s)

Spark SQL FIRST_VALUE and LAST_VALUE Analytic Function

You can use the Spark SQL first_value and last_value analytic functions to find the first value and last value in a column or expression or within group of rows. You must specify the sort criteria to determine the first and last values.

Syntax:

FIRST_VALUE(column | expression) OVER( window_spec),
LAST_VALUE(column | expression) OVER( window_spec)

Example:

Below example demonstrates usage of FIRST_VALUE and LAST_VALUE analytic functions is Spark SQL:

SELECT pat_id, 
        dept_id, 
        ins_amt, 
        First_value(ins_amt) OVER ( partition BY dept_id ORDER BY ins_amt ) AS low_ins_amt, 
        Last_value(ins_amt) OVER ( partition BY dept_id ORDER BY ins_amt ) AS high_ins_amt 
 FROM   patient; 

Output:

 7       333     110000  110000  110000
 3       222     150000  150000  150000
 4       222     250000  150000  250000
 5       222     890000  150000  890000
 5       111     50000   50000   50000
 6       111     90000   50000   90000
 1       111     100000  50000   100000
 2       111     150000  50000   150000
 8       444     10000   10000   10000
 Time taken: 0.965 seconds, Fetched 9 row(s)

Spark SQL LEAD and LAG Analytic Function

Lead and Lag Spark SQL analytic functions used to compare different rows of a table by specifying an offset from the current row. You can use these functions to analyze change and variation in the data.

Syntax:

LEAD(column, offset, default) OVER( window_spec),
LAG(column, offset, default) OVER( window_spec)

The default value of offset is 1. Offset is the relative position of the row to be accessed. If there is no row next/prior to access the LEAD/LAG function returns NULL, you can change this NULL value by specifying the “default” values.

Example:

Below example demonstrates usage of LEAD and LAG analytic functions is Spark SQL:

SELECT pat_id, 
        dept_id, 
        ins_amt, 
        Lead(ins_amt, 1, 0)OVER (partition BY dept_id ORDER BY dept_id ASC ) AS lead_ins_amt, 
        Lag(ins_amt, 1, 0)OVER ( partition BY dept_id  ORDER BY dept_id ASC ) AS lag_ins_amt 
 FROM   patient; 

Output:

 7       333     110000  0       0
 4       222     250000  890000  0
 5       222     890000  150000  250000
 3       222     150000  0       890000
 6       111     90000   150000  0
 2       111     150000  50000   90000
 5       111     50000   100000  150000
 1       111     100000  0       50000
 8       444     10000   0       0
 Time taken: 0.708 seconds, Fetched 9 row(s)