Spark SQL Correlated Subquery and Usage Restrictions

The Correlated subquery in a Spark SQL is a query within a query that refer the columns from the parent or outer query table. These kind of subquery contains one or more correlations between its columns and the columns produced by the outer query. Spark SQL supports the regular and correlated subqueries. You can use the subqueries to improve the performance of the Spark SQL queries such as limiting the number of records returned by the subquery.

Spark SQL Correlated Subquery

Spark SQL supports many types of subqueries. However, it supports only two types of correlated subqueries.

Correlated Subquery in WHERE Clause
Correlated Scalar Subquery

These two types of Spark SQL subqueries server different purpose. You can use them as per your requirements. For instance, use a scalar correlated subquery to when your requirement is to correlate two tables in SELECT clause.

Correlated Subquery in WHERE Clause

The correlated subqueries in WHERE clause are very common subquery in Spark SQL or any other relational databases. In these kind of subqueries, you can use equality or non-equality condition, EXISTS or NOT EXISTS operator and IN or NOT IN conditions.

Spark SQL Correlated Subquery with EXIST and NOT EXISTS Operator

The following example demonstrates a correlated subquery in the WHERE clause with EXISTS and NOT EXISTS operators.

-- Correlated subquery with EXISTS operator
SELECT *
FROM   t1 AS t1
WHERE  EXISTS (SELECT 1
               FROM   t2 AS t2
               WHERE  t1.col1 = t2.col1);
|col1|col2|
|----|----|
|1   |1   |
|2   |2   |
|3   |3   |

-- Correlated subquery with NOT EXISTS operator
SELECT *
FROM   t1 AS t1
WHERE NOT EXISTS (SELECT 1
               FROM   t2 AS t2
               WHERE  t1.col1 = t2.col1);
|col1|col2|
|----|----|
|5   |5   |

As you can see, subquery refers col1 from outer query i.e. t1.col1 hence correlated subqueries.

Spark SQL Correlated Subquery with IN and NOT IN Condition Operators

The following example demonstrates a correlated subquery in the WHERE clause with IN and NOT IN operators.

-- Correlated subquery with IN operator
SELECT *
FROM   t1 AS t1
WHERE t1.col1 in (SELECT t2.col1
               FROM   t2 AS t2
               WHERE  t1.col1 = t2.col1);
|col1|col2|
|----|----|
|1   |1   |
|2   |2   |
|3   |3   |

-- Correlated subquery with NOT IN operator
SELECT *
FROM   t1 AS t1
WHERE t1.col1 not in (SELECT t2.col1
               FROM   t2 AS t2
               WHERE  t1.col1 = t2.col1);
|col1|col2|
|----|----|
|5   |5   |

Spark SQL Correlated Subquery with equality or non-equality condition

The correlated subqueries with equality and non-equality conditions are also very common in Spark SQL as well as relational databases.

The following example demonstrates a correlated subquery in the WHERE clause with equality and non-equality condition.

-- Correlated subquery with equality condition
SELECT *
FROM   t1 AS t1
WHERE t1.col1 = (SELECT FIRST (t2.col1)
               FROM   t2 AS t2
               WHERE  t1.col1 = t2.col1);
|col1|col2|
|----|----|
|1   |1   |
|2   |2   |
|3   |3   |

-- Correlated subquery with non-equality condition
SELECT *
FROM   t1 AS t1
WHERE t1.col1 != (SELECT FIRST (t2.col1)
               FROM   t2 AS t2
               WHERE  t1.col1 = t2.col1);

Note that, it is mandatory to use aggregate functions such as FIRST, LAST, MAX, MIN, AVG on the column returned by subquery to make sure single row is returned. Otherwise, you will end up Spark error such as “Correlated scalar subqueries must be aggregated”

Correlated Scalar Subquery

A correlated scalar subquery is a subquery that returns exactly one column value from one row and refers column from the parent table. Similar to the correlated subqueries with equality and non-equality condition, correlated scalar subqueries must be aggregated. You should use aggregate function, for example, MAX in our following example.

The following example demonstrates a correlated scalar subquery.

-- correlated scalar subquery
SELECT col1,
       COALESCE ((SELECT Max(col2)
                  FROM   t1
                  WHERE  t1.col1 = t2.col1), 0) AS col2
FROM   t2;
|col1|col2|
|----|----|
|4   |0   |
|1   |1   |
|2   |2   |
|3   |3   |

As you can see, subquery refers col1 from outer query i.e. t1.col1 hence correlated subqueries.

Spark SQL Correlated Subquery Restrictions

If you choose to use the correlated subqueries in Spark SQL, you should keep in mind following restrictions in your mind.

You cannot use correlated subqueries in joins.
It is mandatory to use aggregate functions while using correlated subqueries with equality and non-equality conditions.
It is mandatory to use aggregate functions while using correlated scalar subqueries.
CASE/WHEN expressions don’t support correlated subqueries.