Apache Hive Correlated subquery is a query within a query that refer the columns from the outer query. Hive does support some of subqueris such as table subquery, WHERE clause subquery etc, and correlated subqueries. In most cases, the Hive correlated subqueries are used to improve the Hive query performance.
Above diagram clearly explains the correlated subqueries in case of relational databases and Apache Hive.
Read:
Apache Hive Correlated Subquery Examples
For example, consider query, “check if student id is already exists in the department table before selecting records from students table”. You can achieve this by correlating both tables and write HiveQL query.
Below is the example of Hive correlated subqueries:
select *
from stud
where exists (select 1
from dept
where dept.stud_id = stud.id);
This type of queries are called correlated subquery, because inner subquery in WHERE clause refers the student id from the outer or parent query.
Restriction on Apache Hive Correlated Subquery
If you choose to use the correlated subqueries in Hive for your requirements, you should keep in mind following restrictions.
- Use of correlated subqueries in Hive SET operations are not supported
- You cannot correlate the queries in aggregations with GROUP By and HAVING clauses
- Conditional statements such as CASE/WHEN expressions doesn’t support correlated subqueries
Inline Correlated Subquery
As of current version Apache Hive doesn’t support the inline correlated queries, that is, query written in an SELECT clause of parent or outer query.
Note that, Apache Hive community is actively working on adding this feature. This feature may avaiable in future release.
For example:
Look at the following example which is trying to get dept_name from dept table based on student id:
SELECT stud.name,
(SELECTdept_name
FROM dept
WHERE stud.id = dept.stud_id
) AS dep_name
FROM stud;
You will get below error if you try to execute above query.
FAILED: ParseExceptionline 2:5 cannot recognize input near '(' 'SELECT' 'dept_name' in expression specification
These types of correlated subqueries are supported in database like Oracle. However, Hive does not support as of current version.
You can rewrite the same query as follows. There are many ways to rewrite the correlated sub-queries, using JOIN is one approach.
Related Articles,
Hope this helps 🙂
SELECT stud.name,
(SELECTdept_name
FROM dept
WHERE stud.id = dept.stud_id
) AS dep_name
FROM stud;
Per your post, the code is not working. Could you please post the working codes? Thanks.
Hi,
Those type of correlated sub-queries are not supported in Hive. Alternate query is already provided in post.
Here is the same rewritten query:
hive> SELECT stud.name,dept.dep_name
> FROM stud
> LEFT JOIN dept
> on (stud.id = dept.stud_id)
> ;
Thanks