Hive DELETE FROM Table Alternative– Easy Steps

  • Post author:
  • Post last modified:June 2, 2019
  • Post category:BigData
  • Reading time:7 mins read

By definition, Data Warehouse is mechanism to store historical data in an easy accessible manner. Data may be updated to keep tables with up-to date records. This performance critical operation holds good when you plan to migrate your data warehouse to bigdata world. In this article, we will check one of the method to remove outdated records from Hive table i.e. Hive DELETE FROM table Alternative.  

Hive DELETE FROM Table Equivalents

Hive DELETE FROM Table Alternative

Apache Hive is not designed for online transaction processing and does not offer real-time queries and row level updates and deletes. However, the latest version of Apache Hive supports ACID transaction, but using ACID transaction on table with huge amount of data may kill the performance of Hive server. To use ACID transaction, one must create a table with ACID transaction properties set. Delete can be performed on the table that supports ACID. Instead, you can follow other easy steps such as create hive temporary table and select records from the original table by excluding data that you want to delete from table. Sounds easy!

Below are some of DELETE FROM table Equivalents:

  • Hive NOT IN to exclude records to be deleted
  • Hive NOT EXISTS to exclude records to be deleted

Related Articles:

Now let us verify above methods using some examples.

Hive LEFT JOIN as Workaround to Delete Records from Hive Table

Using Hive LEFT JOIN is one of the widely used work round to delete records from Hive tables. Hive LEFT JOIN will return all the records in the left table that do not match any records in the right table.

SQL Delete Query:

DELETE FROM pat_dtls_load 
 WHERE  sk IN (SELECT sk 
               FROM   new_pat_dtls_load 
               WHERE  NAME = 'Stuart') ;

Hive Equivalent Delete statement for IN conditions and subquery:

  • Create temp table same as target table:
Create table pat_dtls_load_temp like pat_dtls_load
  • Load data into temp table. This table holds all records except records to be deleted:
INSERT INTO pat_dtls_load_temp
select * FROM pat_dtls_load 
WHERE  sk NOT IN (SELECT sk 
              FROM   new_pat_dtls_load 
              WHERE  NAME = 'Jhoney') ;
  • Insert records from temp table into target table
INSERT OVERWRITE TABLE pat_dtls_load select * from pat_dtls_load_temp;
  • Drop temp table
Drop table pat_dtls_load_temp;

Hive NOT EXISTS as Workaround to Delete Records from Hive Table

Apache Hive supports EXISTS and NOT EXISTS clauses. You can make use of these keywords as a workaround to delete records from Hive tables.

Let us re-write previous SQL delete statement using NOT EXISTS.

SQL Delete Query:

DELETE FROM pat_dtls_load 
 WHERE  sk IN (SELECT sk 
               FROM   new_pat_dtls_load 
               WHERE  NAME = 'Stuart') ;

Hive Equivalent Delete statement using NOT EXISTS Clause:

  • Create temp table same as target table:
Create table pat_dtls_load_temp like pat_dtls_load;
  • Load data into temp table. This table holds all records except records to be deleted
INSERT INTO pat_dtls_load_temp 
SELECT * 
 FROM   pat_dtls_load aa 
 WHERE  NOT EXISTS (SELECT 1 
                    FROM   new_pat_dtls_load bb 
                    WHERE  ( aa.sk = bb.sk ) 
                           AND bb.NAME = 'ADDA'); 
  • Insert records from temp table into target table
INSERT OVERWRITE TABLE pat_dtls_load select * from pat_dtls_load_temp;
  • Drop temp table
Drop table pat_dtls_load_temp;

End Note

There are many methods that you can follow. Above mentioned methods are simple to understand and use. Try to use NOT EXISTS clause as it is faster and safer to use.

Hope this helps? Let me know if you know different method.