Apache Spark SQL Supported Subqueries and Examples

A subquery in Spark SQL is a select expression that is enclosed in parentheses as a nested query block in a query statement. The subquery in Apache Spark SQL is similar to subquery in other relational databases that may return zero to one or more values to its upper select statements. In this article, we will check Apache Spark SQL supported subqueries and some examples.

Spark SQL Supported Subqueries

Spark SQL subqueries are another select statement or expression enclosed in parenthesis as a nested query block.

You can use these nested query blocks in any of the following Spark SQL:

SELECT
CREATE TABLE AS
INSERT INTO

The upper query or parent query that contains the subquery is called a super query or outer query. You can use subqueries to calculate or derive values that will be used in outer SELECT statements. For example, parent query can use a subquery in the WHERE clause to restrict rows.

Types of Spark SQL Subqueries

Spark SQL does not support all types of subqueries that are available in relational databases such as Oracle, Teradata, etc.

Following are the subqueries that supported in Spark SQL.

Table Subquery
Scalar Subquery Expressions
Subquery in WHERE clause
Correlated Subquery
Correlated Scalar Subquery

Table Subquery

You can write the subquery in place of table name. Consider following example which uses subquery in place of table name.

SELECT *
FROM   (SELECT *
        FROM   t1) AS tab; 
|col1|col2|
|----|----|
|1   |1   |
|2   |2   |
|3   |3   |

Scalar Subquery Expressions

A scalar subquery expression is a subquery that returns exactly one column value from one row. The value of the scalar subquery expression is the value of the select list item of the subquery. If the subquery returns 0 rows, then the value of the scalar subquery expression is NULL.

Consider following example of Scalar subquery expression in Spark SQL.

SELECT col2,
       (SELECT Max(col1)
        FROM   t1) AS col1
FROM   t2; 
|col2|col1|
|----|----|
|10  |3   |
|20  |3   |
|30  |3   |

As you can see, the scalar subquery returns exactly one row and one column. If the subquery returns more than one row, then Spark SQL returns an error “Scalar subquery must return only one column”.

Subquery in WHERE Clause

Spark SQL supports writing a subquery in a WHERE clause. These types of subqueries are very common in query statements. The relational databases such as Oracle, Teradata return the single value or multiple values from the query in a WHERE clause. But, as of now, Spark SQL supports single value from the query in a WHERE clause.

Consider following query with a subquery in a WHERE clause.

SELECT *
FROM   t1
WHERE  col1 IN (SELECT col1
                FROM   t2); 
|col1|col2|
|----|----|
|1   |1   |
|2   |2   |
|3   |3   |

Note, if you are returning single values, use equality operator otherwise IN operator.

These types of subqueries are sometimes called predicate subqueries. Predicate subqueries are predicates in which the operand is a subquery

Correlated Subquery

Spark SQL provides limited support to correlated sub queries. Correlated subqueries are queries in which subquery refers to the column from parent table clause.

Consider following example, in which subquery refers column col1 from the parent query in its WHERE clause.

SELECT *
FROM   t1 AS t1
WHERE  EXISTS (SELECT 1
               FROM   t2 AS t2
               WHERE  t1.col1 = t2.col1);
|col1|col2|
|----|----|
|1   |1   |
|2   |2   |
|3   |3   |

Correlated Scalar Subquery

A correlated scalar subquery is a subquery that returns exactly one column value from one row and refers column from the parent table. The Correlated scalar subqueries must be aggregated. You should use aggregate function, for example, MAX in our following example.

SELECT col1,
       COALESCE ((SELECT Max(col2)
                  FROM   t1
                  WHERE  t1.col1 = t2.col1), 0) AS col2
FROM   t2;
|col1|col2|
|----|----|
|1   |1   |
|2   |2   |
|3   |3   |

Spark supports only equi-join when using correlated subqueries. Otherwise, you will end up getting “Correlated column is not allowed in a non-equality predicate” error.