Impala Delete from Tables and Alternative Steps

  • Post author:
  • Post last modified:June 18, 2019
  • Post category:BigData
  • Reading time:8 mins read

Data warehouse stores the information in the form of tables. You may have to delete out-dated data and update the table’s values in order to keep data up-to-date. These performance critical operations are critical to keep the data warehouse on bigdata also when you migrate data from relational database systems.  In this article, we will check Impala delete from tables and alternative examples.

Impala Delete from Table Command

Cloudera Impala version 5.10 and above supports DELETE FROM table command on kudu storage. This command deletes an arbitrary number of rows from a Kudu table. This statement only works for Impala tables that use the Kudu storage engine.

Syntax

Below is the delete from table syntax:

DELETE [FROM] [database_name.]table_name [ WHERE where_conditions ];

As mentioned, Impala Delete command will works only on the Kudu storage types. In a bigdata warehouse it is unlikely to use kudu as storage format for all tables. You may choose different tables storage formats based on your requirements and data.

Related Articles

Impala Delete from Tables Alternative

Any storage format other then Kudu are not designed for online transaction processing and does not offer real-time queries and row level updates and deletes. However, if you access Hive table from Impala, the latest version of Apache Hive supports ACID transaction, but using ACID transaction on table with huge amount of data may kill the performance of Hive server. To use ACID transaction, one must create a table with ACID transaction properties set. Delete can be performed on the table that supports ACID. Instead, you can follow other easy steps such as create Impala temporary table and select records from the original table by excluding data that you want to delete from table. Sounds easy!

Related articles:

Below are some of the DELETE FROM table Equivalents:

  • Impala NOT IN to exclude records to be deleted
  • Impala NOT EXISTS to exclude records to be deleted

Now let us verify above methods using some examples.

SQL Delete Query:

DELETE FROM delete_test_demo 
WHERE  ID = 1;

We will write Impala equivalent statement for above SQL delete query.

Impala NOT IN as Workaround to Delete Records from Impala Table

You can use NOT IN condition to exclude the records that you want to delete from table.

Below are the steps to delete records from Impala table using not in condition:

  • Create temp table same as target table:
Create table delete_test_demo_temp like delete_test_demo;
  • Load data into temp table. This table holds all records except records to be deleted:
INSERT INTO delete_test_demo_temp
select * FROM delete_test_demo 
WHERE  id NOT IN (SELECT id 
              FROM   delete_test_demo 
              WHERE  ID = 1 ) ;
  • Insert records from temp table into target table
INSERT OVERWRITE TABLE delete_test_demo select * from delete_test_demo_temp;
  • Drop temp table
Drop table delete_test_demo_temp;

Impala NOT EXISTS as Workaround to Delete Records from Impala Table

Cloudera Impala supports EXISTS and NOT EXISTS clauses. You can make use of these keywords as a workaround to delete records from impala tables.

Let us re-write previous SQL delete statement using NOT EXISTS.

  • Create temp table same as target table
 Create table delete_test_demo_temp like delete_test_demo; 
  • Load data into temp table. This table holds all records except records to be deleted
INSERT INTO delete_test_demo_temp 
SELECT * 
 FROM   delete_test_demo aa 
 WHERE  NOT EXISTS (SELECT 1 
                    FROM   delete_test_demo bb 
                    WHERE  ( aa.id = bb.id) 
                           AND bb.id = 2);
  • Insert records from temp table into target table
INSERT OVERWRITE TABLE delete_test_demo select * from delete_test_demo_temp;
  • Drop temp table
Drop table delete_test_demo_temp;

End Note

There are many methods that you can follow. Above mentioned methods are simple to understand and use. Try to use NOT EXISTS clause as it is faster and safer to use as it uses correlated sub-query.

Hope this helps :-). Let me know if you know different method .