The Correlated subquery in a Spark SQL is a query within a query that refer the columns from the parent or outer query table. These kind of subquery contains one or more correlations between its columns and the columns produced by the outer query. Spark SQL supports the regular and correlated subqueries. You can use the subqueries to improve the performance of the Spark SQL queries such as limiting the number of records returned by the subquery.
Spark SQL Correlated Subquery
Spark SQL supports many types of subqueries. However, it supports only two types of correlated subqueries.
These two types of Spark SQL subqueries server different purpose. You can use them as per your requirements. For instance, use a scalar correlated subquery to when your requirement is to correlate two tables in SELECT clause.
Correlated Subquery in WHERE Clause
The correlated subqueries in WHERE clause are very common subquery in Spark SQL or any other relational databases. In these kind of subqueries, you can use equality or non-equality condition, EXISTS or NOT EXISTS operator and IN or NOT IN conditions.
Spark SQL Correlated Subquery with EXIST and NOT EXISTS Operator
The following example demonstrates a correlated subquery in the WHERE clause with EXISTS and NOT EXISTS operators.
-- Correlated subquery with EXISTS operator
SELECT *
FROM t1 AS t1
WHERE EXISTS (SELECT 1
FROM t2 AS t2
WHERE t1.col1 = t2.col1);
|col1|col2|
|----|----|
|1 |1 |
|2 |2 |
|3 |3 |
-- Correlated subquery with NOT EXISTS operator
SELECT *
FROM t1 AS t1
WHERE NOT EXISTS (SELECT 1
FROM t2 AS t2
WHERE t1.col1 = t2.col1);
|col1|col2|
|----|----|
|5 |5 |
As you can see, subquery refers col1
from outer query i.e. t1.col1
hence correlated subqueries.
Spark SQL Correlated Subquery with IN and NOT IN Condition Operators
The following example demonstrates a correlated subquery in the WHERE clause with IN and NOT IN operators.
-- Correlated subquery with IN operator
SELECT *
FROM t1 AS t1
WHERE t1.col1 in (SELECT t2.col1
FROM t2 AS t2
WHERE t1.col1 = t2.col1);
|col1|col2|
|----|----|
|1 |1 |
|2 |2 |
|3 |3 |
-- Correlated subquery with NOT IN operator
SELECT *
FROM t1 AS t1
WHERE t1.col1 not in (SELECT t2.col1
FROM t2 AS t2
WHERE t1.col1 = t2.col1);
|col1|col2|
|----|----|
|5 |5 |
Spark SQL Correlated Subquery with equality or non-equality condition
The correlated subqueries with equality and non-equality conditions are also very common in Spark SQL as well as relational databases.
The following example demonstrates a correlated subquery in the WHERE clause with equality and non-equality condition.
-- Correlated subquery with equality condition
SELECT *
FROM t1 AS t1
WHERE t1.col1 = (SELECT FIRST (t2.col1)
FROM t2 AS t2
WHERE t1.col1 = t2.col1);
|col1|col2|
|----|----|
|1 |1 |
|2 |2 |
|3 |3 |
-- Correlated subquery with non-equality condition
SELECT *
FROM t1 AS t1
WHERE t1.col1 != (SELECT FIRST (t2.col1)
FROM t2 AS t2
WHERE t1.col1 = t2.col1);
Note that, it is mandatory to use aggregate functions such as FIRST, LAST, MAX, MIN, AVG on the column returned by subquery to make sure single row is returned. Otherwise, you will end up Spark error such as “Correlated scalar subqueries must be aggregated”
Correlated Scalar Subquery
A correlated scalar subquery is a subquery that returns exactly one column value from one row and refers column from the parent table. Similar to the correlated subqueries with equality and non-equality condition, correlated scalar subqueries must be aggregated. You should use aggregate function, for example, MAX in our following example.
The following example demonstrates a correlated scalar subquery.
-- correlated scalar subquery
SELECT col1,
COALESCE ((SELECT Max(col2)
FROM t1
WHERE t1.col1 = t2.col1), 0) AS col2
FROM t2;
|col1|col2|
|----|----|
|4 |0 |
|1 |1 |
|2 |2 |
|3 |3 |
As you can see, subquery refers col1
from outer query i.e. t1.col1
hence correlated subqueries.
Spark SQL Correlated Subquery Restrictions
If you choose to use the correlated subqueries in Spark SQL, you should keep in mind following restrictions in your mind.
- You cannot use correlated subqueries in joins.
- It is mandatory to use aggregate functions while using correlated subqueries with equality and non-equality conditions.
- It is mandatory to use aggregate functions while using correlated scalar subqueries.
- CASE/WHEN expressions don’t support correlated subqueries.
Related Articles,
- Apache Spark SQL Supported Subqueries and Examples
- Apache Hive Correlated Subquery and it’s Restrictions
hope this helps 🙂