A subquery in Spark SQL is a select expression that is enclosed in parentheses as a nested query block in a query statement. The subquery in Apache Spark SQL is similar to subquery in other relational databases that may return zero to one or more values to its upper select statements. In this article, we will check Apache Spark SQL supported subqueries and some examples.
Spark SQL Supported Subqueries
Spark SQL subqueries are another select statement or expression enclosed in parenthesis as a nested query block.
You can use these nested query blocks in any of the following Spark SQL:
- SELECT
- CREATE TABLE AS
- INSERT INTO
The upper query or parent query that contains the subquery is called a super query or outer query. You can use subqueries to calculate or derive values that will be used in outer SELECT statements. For example, parent query can use a subquery in the WHERE clause to restrict rows.
Types of Spark SQL Subqueries
Spark SQL does not support all types of subqueries that are available in relational databases such as Oracle, Teradata, etc.
Following are the subqueries that supported in Spark SQL.
- Table Subquery
- Scalar Subquery Expressions
- Subquery in WHERE clause
- Correlated Subquery
- Correlated Scalar Subquery
Table Subquery
You can write the subquery in place of table name. Consider following example which uses subquery in place of table name.
SELECT *
FROM (SELECT *
FROM t1) AS tab;
|col1|col2|
|----|----|
|1 |1 |
|2 |2 |
|3 |3 |
Scalar Subquery Expressions
A scalar subquery expression is a subquery that returns exactly one column value from one row. The value of the scalar subquery expression is the value of the select list item of the subquery. If the subquery returns 0 rows, then the value of the scalar subquery expression is NULL
.
Consider following example of Scalar subquery expression in Spark SQL.
SELECT col2,
(SELECT Max(col1)
FROM t1) AS col1
FROM t2;
|col2|col1|
|----|----|
|10 |3 |
|20 |3 |
|30 |3 |
As you can see, the scalar subquery returns exactly one row and one column. If the subquery returns more than one row, then Spark SQL returns an error “Scalar subquery must return only one column”.
Subquery in WHERE Clause
Spark SQL supports writing a subquery in a WHERE clause. These types of subqueries are very common in query statements. The relational databases such as Oracle, Teradata return the single value or multiple values from the query in a WHERE clause. But, as of now, Spark SQL supports single value from the query in a WHERE clause.
Consider following query with a subquery in a WHERE clause.
SELECT *
FROM t1
WHERE col1 IN (SELECT col1
FROM t2);
|col1|col2|
|----|----|
|1 |1 |
|2 |2 |
|3 |3 |
Note, if you are returning single values, use equality operator otherwise IN operator.
These types of subqueries are sometimes called predicate subqueries. Predicate subqueries are predicates in which the operand is a subquery
Correlated Subquery
Spark SQL provides limited support to correlated sub queries. Correlated subqueries are queries in which subquery refers to the column from parent table clause.
Consider following example, in which subquery refers column col1 from the parent query in its WHERE clause.
SELECT *
FROM t1 AS t1
WHERE EXISTS (SELECT 1
FROM t2 AS t2
WHERE t1.col1 = t2.col1);
|col1|col2|
|----|----|
|1 |1 |
|2 |2 |
|3 |3 |
Correlated Scalar Subquery
A correlated scalar subquery is a subquery that returns exactly one column value from one row and refers column from the parent table. The Correlated scalar subqueries must be aggregated. You should use aggregate function, for example, MAX in our following example.
SELECT col1,
COALESCE ((SELECT Max(col2)
FROM t1
WHERE t1.col1 = t2.col1), 0) AS col2
FROM t2;
|col1|col2|
|----|----|
|1 |1 |
|2 |2 |
|3 |3 |
Spark supports only equi-join when using correlated subqueries. Otherwise, you will end up getting “Correlated column is not allowed in a non-equality predicate” error.
Related Articles,
- Apache Hive Supported Subqueries and Examples
- Spark SQL and Dataset Hints Types- Usage and Examples
- Spark SQL Date and Timestamp Functions and Examples
Hope this helps 🙂