Hive Incremental Load Options and Examples

Post author:Vithal S
Post last modified:November 2, 2019
Post category:BigData
Reading time:5 mins read

The incremental load is very common in a data warehouse environment. Incremental load is commonly used to implement slowly changing dimensions. When you migrate your data to the Hadoop Hive, you might usually keep the slowly changing tables to sync up tables with the latest data. In this article, we will check Hadoop Hive incremental load options and some examples.

Hive Incremental Load Options

There are many methods you can use. Apache Hive introduced to ACID supports since Hive 0.14. Following are the couple of methods that you can use to implement incremental update Apache Hive table.

Use MERGE Statement
Use FULL OUTER JOIN

Note that, you need to set Hive shell properties in order to updated using ACID mode. You will find details about properties in my other post – Apache Hive update table using ACID and Transaction tables.

Hive Incremental Load using MERGE Statement

The Hive MERGE statement supports SQL 2003 standard. It is available starting in Hive 2.2. If your Hive version is anything but above version 2.2, you can use MERGE statement to perform incremental load.

The MERGE statement will first check if row is available in Hive table. It will be updated if it is available, otherwise new record will be inserted.

For examples, consider to tables ’employee’ and ’empl’ tables. Check if data from ’empl’ table is available in ’employee’ table. Update it will latest value if present, otherwise insert that record.

MERGE INTO employee
USING empl z 
ON employee1.ID=z.ID
WHEN MATCHED THEN 
UPDATE SET id=z.ID
WHEN NOT MATCHED THEN
INSERT VALUES ( z.ID, 1);

FULL OUTER JOIN to Perform Incremental Update

If you are using Hive version lower than 2.2 then you cannot update Hive table in ACID mode using MERGE. You can use FULL OUTER JOIN to update records. To find all entries that will be updated you need to join increment data with old data.

It will work same as MERGE statement. You can use this as an alternative method to MERGE statement.

For example, consider same example that we have used in case of MERGE statement. Following is the example.

insert overwrite employee 
SELECT
  case when i.E_EmpNo is not null then i.E_EmpNo   else t.E_EmpNo   end as E_EmpNo
  -- all other column from employee table
  FROM 
      employee t 
      FULL JOIN empl i on (t.E_EmpNo=i.E_EmpNo);

Hope this helps 🙂

Tags: Hive, HiveQL, SCD