Apache Hive Correlated Subquery and it’s Restrictions

  • Post author:
  • Post last modified:June 7, 2021
  • Post category:BigData
  • Reading time:5 mins read

Apache Hive Correlated subquery is a query within a query that refer the columns from the outer query. Hive does support some of subqueris such as table subquery, WHERE clause subquery etc, and correlated subqueries. In most cases, the Hive correlated subqueries are used to improve the Hive query performance.

Apache Hive Correlated Subquery

Above diagram clearly explains the correlated subqueries in case of relational databases and Apache Hive.

Read:

Apache Hive Correlated Subquery Examples

For example, consider query, “check if student id is already exists in the department table before selecting records from students table”. You can achieve this by correlating both tables and write HiveQL query.

Below is the example of Hive correlated subqueries:

select * 
from stud 
where exists (select 1 
 from dept 
 where dept.stud_id = stud.id);
https://gist.github.com/71f582b82684faaf6705efb855e80ae4

This type of queries are called correlated subquery, because inner subquery in WHERE clause refers the student id from the outer or parent query.

Restriction on Apache Hive Correlated Subquery

If you choose to use the correlated subqueries in Hive for your requirements, you should keep in mind following restrictions.

  • Use of correlated subqueries in Hive SET operations are not supported
  • You cannot correlate the queries in aggregations with GROUP By and HAVING clauses
  • Conditional statements such as CASE/WHEN expressions doesn’t support correlated subqueries

Inline Correlated Subquery

As of current version Apache Hive doesn’t support the inline correlated queries, that is, query written in an SELECT clause of parent or outer query.

Note that, Apache Hive community is actively working on adding this feature. This feature may avaiable in future release.

For example:

Look at the following example which is trying to get dept_name from dept table based on student id:

SELECT stud.name,
(SELECTdept_name
FROM dept
WHERE stud.id = dept.stud_id
) AS dep_name
FROM stud;

You will get below error if you try to execute above query.

FAILED: ParseExceptionline 2:5 cannot recognize input near '(' 'SELECT' 'dept_name' in expression specification

These types of correlated subqueries are supported in database like Oracle. However, Hive does not support as of current version.

You can rewrite the same query as follows. There are many ways to rewrite the correlated sub-queries, using JOIN is one approach. 

https://gist.github.com/fbbf1576b29ae48d76de97a1fd8e80c4

Related Articles,

Hope this helps 🙂

This Post Has 2 Comments

  1. Sophia

    SELECT stud.name,
    (SELECTdept_name
    FROM dept
    WHERE stud.id = dept.stud_id
    ) AS dep_name
    FROM stud;

    Per your post, the code is not working. Could you please post the working codes? Thanks.

    1. Vithal S

      Hi,

      Those type of correlated sub-queries are not supported in Hive. Alternate query is already provided in post.
      Here is the same rewritten query:

      hive> SELECT stud.name,dept.dep_name
      > FROM stud
      > LEFT JOIN dept
      > on (stud.id = dept.stud_id)
      > ;

      Thanks

Comments are closed.