Hive Self Join Query, Performance and Optimization

  • Post author:
  • Post last modified:November 14, 2019
  • Post category:BigData
  • Reading time:5 mins read

By definition, self join is a join in which a table is joined itself. Self joins are usually used only when there is a parent child relationship in the given data. In this article, we will check how to write self join query in the Hive, its performance issues and how to optimize it.

Hive Self Join Query, Performance and Optimization

Hive Self Join Query

As mentioned earlier, self join is used when there is parent-child relation between your data.

For example, consider an employee table. an employee table contains details about the employees and an employee can be manager too. You can use the self join to identify the employee and supervisor.

+------+-------+-------+--+
| eid  | Name  |  mid  |
+------+-------+-------+--+
| 100  | AAA   | 101   |
| 101  | BBB   | NULL  |
| 102  | CCC   | 101   |
| 103  | DDD   | 102   |
| 104  | EEE   | 103   |
| 10   | FFF   | 101   |
+------+-------+-------+--+

Now, let us write a query to get employee and his/her manager name. Following is the query that you can write on Hive or Impala.

select EMPL.EID, EMPL.NAME, MANAGER.NAME 
from  employee_manager EMPL, employee_manager MANAGER
where EMPL.eid = MANAGER.mid;
+------+-------+---------------+--+
| EID  | NAME  | MANAGER_NAME  |
+------+-------+---------------+--+
| 101  | BBB   | FFF           |
| 101  | BBB   | CCC           |
| 101  | BBB   | AAA           |
| 102  | CCC   | DDD           |
| 103  | DDD   | EEE           |
+------+-------+---------------+--+

Hive Self Join Using Result of Query

I had one of the requirements to self join the result of the query. You can use the Hive with clause to create CTE and use that CTE in your self join query.

WITH sub_query AS(
    SELECT * FROM employee_manager
)
select EMPL.EID, EMPL.NAME, MANAGER.NAME as MANAGER_NAME
from  sub_query EMPL, sub_query MANAGER
where EMPL.eid = MANAGER.mid;

Instead of using CTE, you can also create Hive temporary tables. You may see gain in performance if you use temp tables. You can read more about temp tables in m other post: Hadoop Hive Create Temp Table, Syntax and Examples

Related Articles

Self Join Performance and Optimization

The self joins in Hive affects the performance of the query if you are joining big tables. Follow below basic optimization techniques may help when working with self joins. These performance improvement techniques applies to SQL queries as well.

  • Avoiding using a self join on the big table
  • Create temp table with fewer records that you want to join
  • Filter out unnecessary data before joining
  • If possible, create index on the joining column

Related Articles

Hope this helps 🙂