The incremental load is very common in a data warehouse environment. Incremental load is commonly used to implement slowly changing dimensions. When you migrate your data to the Hadoop Hive, you might usually keep the slowly changing tables to sync up tables with the latest data. In this article, we will check Hadoop Hive incremental load options and some examples.
Hive Incremental Load Options
There are many methods you can use. Apache Hive introduced to ACID supports since Hive 0.14. Following are the couple of methods that you can use to implement incremental update Apache Hive table.
- Use MERGE Statement
- Use FULL OUTER JOIN
Note that, you need to set Hive shell properties in order to updated using ACID mode. You will find details about properties in my other post – Apache Hive update table using ACID and Transaction tables.
Hive Incremental Load using MERGE Statement
The Hive MERGE statement supports SQL 2003 standard. It is available starting in Hive 2.2. If your Hive version is anything but above version 2.2, you can use MERGE statement to perform incremental load.
The MERGE statement will first check if row is available in Hive table. It will be updated if it is available, otherwise new record will be inserted.
For examples, consider to tables ’employee’ and ’empl’ tables. Check if data from ’empl’ table is available in ’employee’ table. Update it will latest value if present, otherwise insert that record.
MERGE INTO employee
USING empl z
ON employee1.ID=z.ID
WHEN MATCHED THEN
UPDATE SET id=z.ID
WHEN NOT MATCHED THEN
INSERT VALUES ( z.ID, 1);
FULL OUTER JOIN to Perform Incremental Update
If you are using Hive version lower than 2.2 then you cannot update Hive table in ACID mode using MERGE. You can use FULL OUTER JOIN to update records. To find all entries that will be updated you need to join increment data with old data.
It will work same as MERGE statement. You can use this as an alternative method to MERGE statement.
For example, consider same example that we have used in case of MERGE statement. Following is the example.
insert overwrite employee
SELECT
case when i.E_EmpNo is not null then i.E_EmpNo else t.E_EmpNo end as E_EmpNo
-- all other column from employee table
FROM
employee t
FULL JOIN empl i on (t.E_EmpNo=i.E_EmpNo);
Related Articles
- Hive MERGE Statement Alternative and Example
- Hive Join Types and Examples
- Hadoop Hive Transactional Table Update join and Example
Hope this helps 🙂