Cloudera Impala Merge Statement – UPSERT Command

The MERGE query or statement in SQL is used to perform incremental load. With the help of SQL MERGE statement, you can perform UPDATE and INSERT simultaneously based on the condition. i.e. you can update old values and insert new records. The merge command is widely used in incremental load where you have to update old records and insert new records if any. In this article, we will check Cloudera Impala Merge Statement along with some Impala native UPSERT command.

In SQL world, the merge statement is also referred to as UPSERT command.

SQL Merge Statement

You can use the SQL MERGE command or statement to insert, update, or delete rows in a target table using data from a source such as a table, view, or sub-query and based on rules specified in a matching condition in the merge statements.

You have to specify the rules what merge statement should do when condition is MATCHED or NOT MATCHED.

For examples, let us check sample SQL merge statement that we will be converting to Impala equivalent set of queries later.

MERGE INTO merge_table1 A 
using merge_table2 B 
ON ( A.id = b.id ) 
WHEN matched THEN 
  UPDATE SET A.lastname = B.lastname 
WHEN NOT matched THEN 
  INSERT (id, 
          firstname, 
          lastname) 
  VALUES (B.id, 
          B.firstname, 
          B.lastname);

As you can see in the above example, Merge statement performs an update and insert operations on a target table based on the joining condition and matching rules. The join can match a source row with only one target row.

SQL MERGE command is not available in Cloudera impala. We have to identify the alternative method to handle both MATCHED and NOT MATCHED conditions.

How to Update Impala Table?

Test data

Create sample tables on Impala.

--This table holds the records
CREATE TABLE merge_table1
(
ID INT,
FirstName string,
LastName string
);

-- Reference table
CREATE TABLE merge_table2
(
ID INT,
FirstName string,
LastName string
);

Insert sample records to tables.

--Sample records inserted to table
INSERT INTO merge_table11 VALUES (1, 'aaaa', 'bbbb');
INSERT INTO merge_table11 VALUES (2, 'cccc', 'dddd');
INSERT INTO merge_table11 VALUES (3, 'eeee', 'ffff');
INSERT INTO merge_table11 VALUES (4, 'gggg', 'hhhh');
INSERT INTO merge_table11 VALUES (5, 'iiii', 'jjjj');

--Let us say there is last name change for 'cccc' and 'eeee'
INSERT INTO merge_table2 VALUES (2, 'cccc', 'kkkk');
INSERT INTO merge_table2 VALUES (3, 'eeee', 'llll');

Merge Statement in Impala

There are couple of way you can re-implement the SQL merge statement in Cloudera Impala.

Merge Statement Alternative Approach
Impala UPSERT Statement

Let us check these two options in details.

Impala Merge Statement Alternative Examples

In this section, we will check SQL merge statement alternative in Impala. Merge statement in Impala is rewritten into multiple steps to handle both MATCHED and NOT MATCHED conditions:

For example,

-- Drop temp table if exists
DROP TABLE IF EXISTS merge_table1wmmergeupdate; 

-- Create temporary tables to hold merge records
CREATE TABLE merge_table1wmmergeupdate LIKE merge_table1; 

-- Insert records when condition is MATCHED
INSERT INTO table merge_table1WMMergeUpdate 
SELECT A.id        AS ID, 
       A.firstname AS FirstName, 
       CASE 
         WHEN B.id IS NOT NULL THEN B.lastname 
         ELSE A.lastname 
       end         AS LastName 
FROM   merge_table1 AS A 
       LEFT OUTER JOIN merge_table2 AS B 
                    ON A.id = B.id; 

-- Insert records when condition is NOT MATCHED
INSERT INTO merge_table1wmmergeupdate 
SELECT B.id        AS ID, 
       B.firstname AS FirstName, 
       B.lastname  AS LastName 
FROM   merge_table2 AS B 
       LEFT OUTER JOIN merge_table1wmmergeupdate AS A 
                    ON A.id = B.id 
WHERE  A.id IS NULL; 

-- Drop origianal table
DROP TABLE IF EXISTS merge_table1; 

-- Rename temp table to origianal table
ALTER TABLE merge_table1wmmergeupdate 
  RENAME TO merge_table1;

Results

Original Table:

+----+-----------+----------+
| id | firstname | lastname |
+----+-----------+----------+
| 1  | aaaa      | bbbb     |
| 2  | cccc      | dddd     |
| 3  | eeee      | ffff     |
| 4  | gggg      | hhhh     |
| 5  | iiii      | jjjj     |
+----+-----------+----------+

Merged Tables:

+----+-----------+----------+
| id | firstname | lastname |
+----+-----------+----------+
| 1  | aaaa      | bbbb     |
| 2  | cccc      | kkkk     |
| 3  | eeee      | llll     |
| 4  | gggg      | hhhh     |
| 5  | iiii      | jjjj     |
+----+-----------+----------+

Impala UPSERT Statement

The CDH 5.10 and above supports UPSERT statement. UPSERT statement is Cloudera implementation of SQL basic merge statement. You can insert and update records using UPSERT but delete is not yet supported. UPSERT statement will work only on the kudu tables. Because Kudu tables can efficiently handle small incremental changes, the VALUES clause is more practical to use with Kudu tables than with HDFS-based tables. You can’t use it in normal Impala or Hive tables.

Impala UPSERT Statement Syntax

Below is the CDH UPSERT statement:

UPSERT INTO [TABLE] [db_name.]table_name
  [(column_list)]
{
    select_statement
  | VALUES (value [, value ...]) [, (value [, value ...]) ...]
}

Impala UPSERT Statement Example

Below is the example of Impala UPSER statement:

UPSERT INTO merge_table1 values (1, 'aaaa', 'bbbb'),(1, 'aaab', 'bbba');

You can use any of the above methods based on your requirements.

Hope this helps ?