Spark SQL Analytic Functions and Examples

  • Post author:
  • Post last modified:November 29, 2022
  • Post category:Apache Spark
  • Reading time:12 mins read

Spark SQL analytic functions sometimes called as Spark SQL windows function compute an aggregate value that is based on groups of rows. These functions optionally partition among rows based on partition column in the windows spec. Like other analytic functions such as Hive Analytics functions, Netezza analytics functions and Teradata Analytics functions, Spark SQL analytic functions works on groups of rows. These functions optionally ignore NULL values in the data.

Spark SQL Analytic Functions

Spark SQL Analytic Functions

There are two types of Spark SQL windows functions: Ranking functions and Analytic functions

Related Articles:

Spark SQL Ranking functions

Below is the list of functions that can be used in ranking rows.

  • rank
  • dense_rank
  • percent_rank
  • ntile
  • row_number

Analytic functions

Below is the list of functions that can be used as an analytics functions:

  • cume_dist
  • first_value
  • last_value
  • lag
  • lead

Now let us check syntax and usage of these functions.

Spark SQL Rank Analytic Function

The Spark SQL rank analytic function is used to get rank of the rows in column or within group. The Rows with equal or similar values receive the same rank with next rank value skipped. The rank analytic function is usually used in top n analysis.

Syntax:

RANK()
OVER( window_spec)

Example:

Below example demonstrates usage of RANK analytic function is Spark SQL:

SELECT pat_id, 
dept_id,
ins_amt,
Rank() OVER ( ORDER BY ins_amt ) AS RK
FROM patient
ORDER BY rk;

Output:

8 444 10000 1
5 111 50000 2
6 111 90000 3
1 111 100000 4
7 333 110000 5
2 111 150000 6
3 222 150000 6
4 222 250000 8
5 222 890000 9
Time taken: 0.192 seconds, Fetched 9 row(s)

Spark SQL dense_rank Analytic Function

The Spark SQL dense_rank analytic function returns the rank of a value in a group. Rows with the equal values for ranking criteria receive the same rank and assign rank in sequential order i.e. no rank values are skipped. The dense_rank analytic function is also used in top n analysis.

Syntax:

DENSE_RANK()
OVER( window_spec)

Example:

Below example demonstrates usage of DENSE_RANK analytic function is Spark SQL:

SELECT pat_id, 
dept_id,
ins_amt,
DENSE_RANK() OVER ( ORDER BY ins_amt ) AS RK
FROM patient
ORDER BY rk;

Output:

8 444 10000 1
5 111 50000 2
6 111 90000 3
1 111 100000 4
7 333 110000 5
2 111 150000 6
3 222 150000 6
4 222 250000 7
5 222 890000 8
Time taken: 0.662 seconds, Fetched 9 row(s)

Spark SQL percent_rank Analytic Function

The Spark SQL percent_rank analytic function return the relative rank of a value within a group of values.

Syntax:

PERCENT_RANK() OVER( window_spec)

Example:

Below example demonstrates usage of PERCENT_RANK analytic function is Spark SQL:

SELECT pat_id, 
dept_id,
ins_amt,
PERCENT_RANK() OVER ( ORDER BY ins_amt ) AS RK
FROM patient
ORDER BY rk;

Output:

8 444 10000 0.0
5 111 50000 0.125
6 111 90000 0.25
1 111 100000 0.375
7 333 110000 0.5
2 111 150000 0.625
3 222 150000 0.625
4 222 250000 0.875
5 222 890000 1.0

Spark SQL NTILE Analytic Function

NTILE is an Spark SQL analytic function. It divides an ordered data set into a number of buckets indicated by expr and assigns the appropriate bucket number to each row. The buckets are numbered 1 through expr. The expr value must resolve to a positive constant for each partition.

Syntax:

NTILE(expr) OVER( window_spec)

Example:

Below example demonstrates usage of NTILE analytic function is Spark SQL:

SELECT pat_id, 
dept_id,
ins_amt,
NTILE(3) OVER ( ORDER BY ins_amt ) AS RK
FROM patient
ORDER BY rk;

Output:

8 444 10000 1
5 111 50000 1
6 111 90000 1
1 111 100000 2
7 333 110000 2
2 111 150000 2
3 222 150000 3
4 222 250000 3
5 222 890000 3
Time taken: 0.323 seconds, Fetched 9 row(s)

Spark SQL row_number Analytical Functions

The row_number analytic function is used to assign unique values to each row or rows within group based on the column values used in OVER clause.

Syntax:

ROW_NUMBER() OVER( window_spec)

Example:

Below example demonstrates usage of row_number analytic function is Spark SQL:

SELECT pat_id, 
dept_id,
ins_amt,
ROW_NUMBER() OVER ( ORDER BY ins_amt ) AS RK
FROM patient
ORDER BY rk;

Output:

8 444 10000 1
5 111 50000 2
6 111 90000 3
1 111 100000 4
7 333 110000 5
2 111 150000 6
3 222 150000 7
4 222 250000 8
5 222 890000 9
Time taken: 0.315 seconds, Fetched 9 row(s)

Spark SQL cume_dist Analytic Function

The CUME_DIST calculates the cumulative distribution of a value in a group of values. The range of values returned by CUME_DIST is >0 to <=1. Tie values always evaluate to the same cumulative distribution value.

Syntax:

CUME_DIST() OVER( window_spec)

Example:

Below example demonstrates usage of CUME_DIST analytic function is Spark SQL:

SELECT pat_id, 
dept_id,
ins_amt,
CUME_DIST() OVER ( ORDER BY ins_amt ) AS RK
FROM patient
ORDER BY rk;

Output:

8 444 10000 0.1111111111111111
5 111 50000 0.2222222222222222
6 111 90000 0.3333333333333333
1 111 100000 0.4444444444444444
7 333 110000 0.5555555555555556
2 111 150000 0.7777777777777778
3 222 150000 0.7777777777777778
4 222 250000 0.8888888888888888
5 222 890000 1.0
Time taken: 0.363 seconds, Fetched 9 row(s)

Spark SQL FIRST_VALUE and LAST_VALUE Analytic Function

You can use the Spark SQL first_value and last_value analytic functions to find the first value and last value in a column or expression or within group of rows. You must specify the sort criteria to determine the first and last values.

Syntax:

FIRST_VALUE(column | expression) OVER( window_spec),
LAST_VALUE(column | expression) OVER( window_spec)

Example:

Below example demonstrates usage of FIRST_VALUE and LAST_VALUE analytic functions is Spark SQL:

SELECT pat_id, 
dept_id,
ins_amt,
First_value(ins_amt) OVER ( partition BY dept_id ORDER BY ins_amt ) AS low_ins_amt,
Last_value(ins_amt) OVER ( partition BY dept_id ORDER BY ins_amt ) AS high_ins_amt
FROM patient;

Output:

7 333 110000 110000 110000
3 222 150000 150000 150000
4 222 250000 150000 250000
5 222 890000 150000 890000
5 111 50000 50000 50000
6 111 90000 50000 90000
1 111 100000 50000 100000
2 111 150000 50000 150000
8 444 10000 10000 10000
Time taken: 0.965 seconds, Fetched 9 row(s)

Spark SQL LEAD and LAG Analytic Function

Lead and Lag Spark SQL analytic functions used to compare different rows of a table by specifying an offset from the current row. You can use these functions to analyze change and variation in the data.

Syntax:

LEAD(column, offset, default) OVER( window_spec),
LAG(column, offset, default) OVER( window_spec)

The default value of offset is 1. Offset is the relative position of the row to be accessed. If there is no row next/prior to access the LEAD/LAG function returns NULL, you can change this NULL value by specifying the “default” values.

Example:

Below example demonstrates usage of LEAD and LAG analytic functions is Spark SQL:

SELECT pat_id, 
dept_id,
ins_amt,
Lead(ins_amt, 1, 0)OVER (partition BY dept_id ORDER BY dept_id ASC ) AS lead_ins_amt,
Lag(ins_amt, 1, 0)OVER ( partition BY dept_id ORDER BY dept_id ASC ) AS lag_ins_amt
FROM patient;

Output:

7 333 110000 0 0
4 222 250000 890000 0
5 222 890000 150000 250000
3 222 150000 0 890000
6 111 90000 150000 0
2 111 150000 50000 90000
5 111 50000 100000 150000
1 111 100000 0 50000
8 444 10000 0 0
Time taken: 0.708 seconds, Fetched 9 row(s)

Related Articles,

Hope this helps 🙂