Identify and Remove Duplicate Records from Hive Table

Apache Hive being batch processing engine, does not support primary, foreign or unique key constraints. You can insert the duplicate records in the Hive table. There are no constraints to ensure uniqueness or primary key, but if you have a table and have loaded data twice, then you can de-duplicate in several ways. Below methods explain you how to identify and Remove duplicate records or rows from Hive table.

Remove Duplicate Records from Hive Table

Apache Hive does not provide support to many functions or internal columns that are supported in modern day relations database systems such as Netezza, Vertica, etc. You have to use different methods to identify and delete duplicate rows from Hive table. Below are some of the methods that you can use.

Use Insert Overwrite and DISTINCT Keyword
GROUP BY Clause to Remove Duplicate
Use Insert Overwrite with row_number() analytics functions

Test Data

We will be using below test data in all our examples.

select * from dup_demo;
+--------------+----------------+--+
| dup_demo.id  | dup_demo.name  |
+--------------+----------------+--+
| 1            | abc            |
| 2            | bcd            |
| 1            | abc            |
| 3            | cde            |
| 2            | bcd            |
+--------------+----------------+--+
5 rows selected (0.31 seconds)

Let us discuss these methods with an example.

Insert Overwrite using DISTINCT Keyword

This is one of the easiest methods and many SQL developers uses this to filter out duplicate values from a table. This operation is a bit heavy to Hive cluster, but does its job.

To remove duplicate values, you can use insert overwrite table in Hive using the DISTINCT keyword while selecting from the original table. The DISTINCT keyword returns unique records from the table.

For example, consider following example to insert overwrite the hive table from original table using the DISTINCT keyword in SELECT clause.

INSERT OVERWRITE TABLE dup_demo 
SELECT DISTINCT * FROM dup_demo;

GROUP BY Clause to Remove Duplicate

You can use the GROUP BY clause to remove duplicate records from a table.

For example, consider below Hive query to get unique records from a table.

INSERT OVERWRITE TABLE dup_demo 
SELECT dup_demo.id, 
       dup_demo.NAME 
FROM   dup_demo 
GROUP  BY dup_demo.id, 
          dup_demo.NAME; 

select * from dup_demo;
+--------------+----------------+--+
| dup_demo.id  | dup_demo.name  |
+--------------+----------------+--+
| 1            | abc            |
| 2            | bcd            |
| 3            | cde            |
+--------------+----------------+--+
3 rows selected (12.332 seconds)

You can use INSERT OVERWRITE to insert unique records into table.

Use Insert Overwrite with row_number() Analytics Functions

The row_number Hive analytic function is used to rank or number the rows. Here we use the row_number function to rank the rows for each group of records and then select only record from that group.

For example, consider below example to insert overwrite table using analytical functions to remove duplicate rows.

INSERT OVERWRITE TABLE dup_demo 
SELECT tmp.id, 
       tmp.NAME 
FROM   (SELECT id, 
               NAME, 
               Row_number() 
                 OVER ( partition by ID
                   ORDER BY id, NAME) AS rno 
        FROM   dup_demo) as tmp
WHERE  tmp.rno = 1;

select * from dup_demo;
+--------------+----------------+--+
| dup_demo.id  | dup_demo.name  |
+--------------+----------------+--+
| 1            | abc            |
| 2            | bcd            |
| 3            | cde            |
+--------------+----------------+--+
3 rows selected (0.112 seconds)

Hope this helps 🙂