AWS Redshift Correlated Subquery and its Restrictions

Redshift Correlated subquery is a query within a query that refer the columns from the parent or outer query. This kind of subquery contains one or more correlations between its columns and the columns produced by the outer query. Redshift does support the regular and correlated subqueries. In most cases, the Redshift correlated subqueries are used to improve the SQL query performance.

netezza correlated subquery, Redshift Correlated Subquery

Page Contents

Introduction to Redshift Correlated Subquery

What is Redshift Correlated Subquery?

Examples of Redshift Correlated Subquery

Redshift Correlated Subquery Performance Considerations

Redshift Correlated Subquery Restrictions

Conclusion

Introduction to Redshift Correlated Subquery

A correlated subquery is a type of subquery in Amazon Redshift where the inner(child) query references a column from the outer (parent) query. In the correlated subquery the inner query is evaluated for each row of the outer query, making the subquery dependent on the outer query.

Correlated subqueries are used to perform complex tasks that cannot be accomplished with a single query. They are important in SQL because they allow for the retrieval of data from multiple tables and help to simplify queries that would otherwise require multiple joins. In this post, we will focus on Redshift correlated subqueries and explore their definition, differences from other subqueries, benefits, and performance considerations.

What is Redshift Correlated Subquery?

Correlated Subquery in Amazon Redshift is similar to other correlated subqueries in SQL, in that it references a column from the outer query in the inner query. However, Redshift correlated subqueries are optimized for performance in large-scale data warehousing.

Redshift correlated subqueries are useful for complex tasks with multiple tables and complex filtering conditions. It allows the user to reference data from an outer query within an inner query, providing a more efficient and streamlined way to perform complex data analysis.

In Redshift, you can use these subqueries in SELECT, UPDATE, DELETE, and INSERT statements. Redshift also allows for the use of scalar subqueries within correlated.

Examples of Redshift Correlated Subquery

The following example contains a correlated subquery in the WHERE clause of the query; In our examples, the correlation is where p.id = h.p_id. For each row that the outer query produces, the subquery is executed to qualify or disqualify the row.

select name, h_id 
from patients p 
where id = (select p_id from pat_hospital h where p.id = h.p_id);
 name | h_id
------+------
 B    |  222
 D    |  444
 E    |  555
 A    |  111
 C    |  333
(5 rows)

You can also use EXISTS clause in the outer query WHERE predicates. For example;

select name, h_id 
from patients p 
where exists (select 1 from pat_hospital h where p.id = h.p_id);
 name | h_id
------+------
 B    |  222
 D    |  444
 E    |  555
 A    |  111
 C    |  333
(5 rows)

As you can see inner subquery in WHERE clause refers the id from the outer or parent query, hence it is a correlated subquery.

Redshift Correlated Scalar Subquery Example

Redshift correlated scalar subquery is a subquery that returns exactly one column value from one row and refers column from the parent table.

The following example demonstrates a correlated scalar subquery.

-- correlated scalar subquery
SELECT col1,
       COALESCE ((SELECT Max(col2)
                  FROM   t1
                  WHERE  t1.col1 = t2.col1), 0) AS col2
FROM   t2;
|col1|col2|
|----|----|
|4   |0   |
|1   |1   |
|2   |2   |
|3   |3   |

Redshift Correlated Subquery Performance Considerations

When working with Redshift correlated subqueries, there are several best practices to keep in mind to ensure optimal performance and efficiency:

Limit the number of correlated subqueries: Correlated subqueries may slow down query execution time. Try to limit the number of correlated subqueries in your queries, if possible.
Use proper sorting key: Ensure that the tables are sorted using proper sorting column to improve query performance.
Optimize table distribution: Ensure that tables are distributed and sorted properly to minimize data movement during query execution.
Use EXPLAIN to evaluate query performance: Use the EXPLAIN command to evaluate query execution plans and identify areas for optimization.

You can improve the performance by including these simple techniques.

Redshift Correlated Subquery Restrictions

If you choose to use the correlated subqueries in Redshift, you should keep in mind following restrictions as some of correlated subqueries are not supported in Redshift.

You can use correlated subqueries in WHERE clause.
You can use the correlated subqueries in inner join conditions and only with equal condition operators.
Correlation references from within a subquery that contains a window function is not supported.
Cannot refer results of a correlated subquery in GROUP BY clause.
CASE/WHEN expressions doesn’t support correlated subqueries.
You cannot use correlated subqueries in IN conditions.

Conclusion

In conclusion, the Redshift Correlated Subquery is a one of the type of subqueries that allows users to efficiently execute complex queries with multiple levels of nesting. By utilizing subqueries that reference outer query values, you can filter and aggregate data as per your requirement with simple syntax. Overall, the Redshift Correlated Subquery can increase the performance of the queries when you follow certain best practices and know the limitations.

Hope this helps 🙂