Spark SQL Count Distinct Window Function

Windows are commonly used analytical functions in a Spark SQL query. The COUNT is one of such a windows functions that will allow you to count over certain window. Many relational databases such as Oracle support COUNT window function with distinct keyword. However, Spark SQL does not support count distinct window function.

Spark SQL Count Distinct Window Function

Spark SQL supports count window functions. However, the COUNT window function with distinct keyword is not supported as of now.

Following is the example of an Oracle COUNT window function with distinct keyword. This example, gets all of the data plus the number of distinct items measured per day, you may want to use this window function:

SELECT
	count(DISTINCT item) OVER (PARTITION BY purchase_dt) distinct_cnt
FROM
	test_count_distinct
ORDER BY 1;

|DISTINCT_CNT|
|------------|
|1           |
|3           |
|3           |
|3           |
|5           |
|5           |
|5           |
|5           |
|5           |

However, if you try to execute count analytic function with distinct keyword, you will end up with the following error.

spark.sql("select count( distinct item) over (partition by purchase_dt order by null) distinct_cnt from test_count_distinct").show()

org.apache.spark.sql.AnalysisException: Distinct window functions are not supported

Spark SQL Count Distinct Windows Function Alternative Methods

There are a couple of methods that you can use as a Spark SQL count distinct windows function alternative methods.

Spark SQL approx_count_distinct Window Function as a Count Distinct Alternative
Spark SQL DENSE_RANK() Window function as a Count Distinct Alternative

Now, let us check these with an examples

Spark SQL approx_count_distinct Window Function as a Count Distinct Alternative

The approx_count_distinct windows function returns the estimated number of distinct values in a column within the group.

Following Spark SQL example uses the approx_count_distinct windows function to return distinct count.

SELECT
	approx_count_distinct(item) OVER (PARTITION BY purchase_dt) AS dense_rank
FROM
	test_count_distinct
order by
	1;

+----------+
|dense_rank|
+----------+
|         1|
|         3|
|         3|
|         3|
|         5|
|         5|
|         5|
|         5|
|         5|
+----------+

Note that, as the name suggests, it only returns an approximate distinct count and may not be actual count.

Spark SQL DENSE_RANK() Window function as a Count Distinct Alternative

The Spark SQL rank analytic function is used to get a rank of the rows in column or within a group. In the result set, the rows with equal or similar values receive the same rank with next rank value skipped.

Following dense_rank example chooses max dense_rank value and return as a count of distinct values within the group.

WITH CTE AS (
SELECT
	purchase_dt,
	DENSE_RANK() OVER (PARTITION BY purchase_dt
ORDER BY
	item) AS dense_rank
FROM
	test_count_distinct)
SELECT
	max(dense_rank) OVER (PARTITION BY purchase_dt) as distinct_count
FROM
	CTE
ORDER BY 1;

+--------------+
|distinct_count|
+--------------+
|             1|
|             3|
|             3|
|             3|
|             5|
|             5|
|             5|
|             5|
|             5|
+--------------+