Cloudera Impala and Apache Hive provide a better way to manage structured and semi-structured data on Hadoop ecosystem. Both frameworks make use of HDFS as a storage mechanism to store data. The HDFS architecture is not intended to update files, it is designed for batch processing. i.e. process huge amount of data. But most of the organizations are maintaining a data warehouse on traditional relation databases like Netezza, Teradata, Oracle, etc. When they migrate their data warehouse to Hadoop ecosystem, they might want to have a design similar to that of RDBMS. In this article, we will check how to handle update Impala table so that you can directly migrate your data warehouse with little changes to your existing SQL queries.
Related Article
How to Update Impala Table?
There are many methods that you can use to update Impala tables. As Impala uses Hive metastore, you can consider a method to update Hive table that we discussed in another post.
You can consider below approaches to update Impala Table:
- Impala Update Command on Kudu Tables
- Update Impala Table using Intermediate or Temporary Tables
Impala Update Command on Kudu Tables
There are many advantages when you create tables in Impala using Apache Kudu as a storage format. You can use Impala Update command to update an arbitrary number of rows in a Kudu table. This statement only works for Impala tables that use the Kudu storage engine.
Impala Update Command Syntax
Below is the syntax of Impala update statements:
UPDATE [database_name.]table_name SET col = val [, col = val ... ]
[ FROM joined_table_refs ]
[ WHERE where_conditions ];
For more information refer Cloudera Impala official documentation.
If you are using any other file format then you might want to go through below alternate approach.
Update Impala Table using Temporary Tables
Below example explain steps to update Impala table using temporary or intermediate tables:
Let us consider you want to update col2 of table1 by taking data from staging table2. Below is the sample Teradata SQL update statement:
update a from tabl1 a, table2 b
set col2 = b.col2
where a.col1=b.col1;
Step 1: Drop temporary table if it is already exists
Use DROP IF EXISTS command to drop temporary table if it is already exists in the Impala:
DROP TABLE IF EXISTS table1Temp;
Step 2: Create intermediate table structure same as original table (i.e. table1)
Use Impala LIKE statement along with CREATE TABLE:
CREATE TABLE table1Temp LIKE table1;
Step 3: Insert data into temporary table with updated records
Join table2 along with table1 to get updated records and insert data into temporary table that you create in step2:
INSERT INTO TABLE table1Temp SELECT a.col1, COALESCE( b.col2 , a.col2) AS col2
FROM table1 a
LEFT OUTER JOIN table2 b
ON ( a.col1 = b.col1);
Step 4: DROP original managed table
In this step, drop original managed table that holds old data.
DROP TABLE table1;
Step 5: Rename temporary table to original table
Now the final step is to rename temporary table to original table
ALTER TABLE table1Temp RENAME TO tabl1e;
Instead of dropping original table, you can use INSERT OVERWRITE to INSERT data into original table and then drop intermediate table after cross validation.
Hope this helps 🙂