Windows are commonly used analytical functions in a Spark SQL query. The COUNT
is one of such a windows functions that will allow you to count over certain window. Many relational databases such as Oracle support COUNT window function with distinct keyword. However, Spark SQL does not support count distinct window function.
Spark SQL Count Distinct Window Function
Spark SQL supports count window functions. However, the COUNT window function with distinct
keyword is not supported as of now.
Following is the example of an Oracle COUNT window function with distinct keyword. This example, gets all of the data plus the number of distinct items measured per day, you may want to use this window function:
SELECT
count(DISTINCT item) OVER (PARTITION BY purchase_dt) distinct_cnt
FROM
test_count_distinct
ORDER BY 1;
|DISTINCT_CNT|
|------------|
|1 |
|3 |
|3 |
|3 |
|5 |
|5 |
|5 |
|5 |
|5 |
However, if you try to execute count analytic function with distinct keyword, you will end up with the following error.
spark.sql("select count( distinct item) over (partition by purchase_dt order by null) distinct_cnt from test_count_distinct").show()
org.apache.spark.sql.AnalysisException: Distinct window functions are not supported
Spark SQL Count Distinct Windows Function Alternative Methods
There are a couple of methods that you can use as a Spark SQL count distinct windows function alternative methods.
- Spark SQL approx_count_distinct Window Function as a Count Distinct Alternative
- Spark SQL DENSE_RANK() Window function as a Count Distinct Alternative
Now, let us check these with an examples
Spark SQL approx_count_distinct Window Function as a Count Distinct Alternative
The approx_count_distinct
windows function returns the estimated number of distinct values in a column within the group.
Following Spark SQL example uses the approx_count_distinct
windows function to return distinct count.
SELECT
approx_count_distinct(item) OVER (PARTITION BY purchase_dt) AS dense_rank
FROM
test_count_distinct
order by
1;
+----------+
|dense_rank|
+----------+
| 1|
| 3|
| 3|
| 3|
| 5|
| 5|
| 5|
| 5|
| 5|
+----------+
Note that, as the name suggests, it only returns an approximate distinct count and may not be actual count.
Related Articles,
Spark SQL DENSE_RANK() Window function as a Count Distinct Alternative
The Spark SQL rank analytic function is used to get a rank of the rows in column or within a group. In the result set, the rows with equal or similar values receive the same rank with next rank value skipped.
Following dense_rank example chooses max dense_rank value and return as a count of distinct values within the group.
WITH CTE AS (
SELECT
purchase_dt,
DENSE_RANK() OVER (PARTITION BY purchase_dt
ORDER BY
item) AS dense_rank
FROM
test_count_distinct)
SELECT
max(dense_rank) OVER (PARTITION BY purchase_dt) as distinct_count
FROM
CTE
ORDER BY 1;
+--------------+
|distinct_count|
+--------------+
| 1|
| 3|
| 3|
| 3|
| 5|
| 5|
| 5|
| 5|
| 5|
+--------------+
Related Articles,
- Spark SQL Cumulative Average Function and Examples
- How to Remove Duplicate Records from Spark DataFrame – Pyspark and Scala
- Cumulative Sum Function in Spark SQL and Examples
Hope this helps 🙂