Apache Hive being batch processing engine, does not support primary, foreign or unique key constraints. You can insert the duplicate records in the Hive table. There are no constraints to ensure uniqueness or primary key, but if you have a table and have loaded data twice, then you can de-duplicate in several ways. Below methods explain you how to identify and Remove duplicate records or rows from Hive table.
Remove Duplicate Records from Hive Table
Apache Hive does not provide support to many functions or internal columns that are supported in modern day relations database systems such as Netezza, Vertica, etc. You have to use different methods to identify and delete duplicate rows from Hive table. Below are some of the methods that you can use.
- Use Insert Overwrite and DISTINCT Keyword
- GROUP BY Clause to Remove Duplicate
- Use Insert Overwrite with row_number() analytics functions
Test Data
We will be using below test data in all our examples.
select * from dup_demo;
+--------------+----------------+--+
| dup_demo.id | dup_demo.name |
+--------------+----------------+--+
| 1 | abc |
| 2 | bcd |
| 1 | abc |
| 3 | cde |
| 2 | bcd |
+--------------+----------------+--+
5 rows selected (0.31 seconds)
Let us discuss these methods with an example.
Insert Overwrite using DISTINCT Keyword
This is one of the easiest methods and many SQL developers uses this to filter out duplicate values from a table. This operation is a bit heavy to Hive cluster, but does its job.
To remove duplicate values, you can use insert overwrite table in Hive using the DISTINCT keyword while selecting from the original table. The DISTINCT keyword returns unique records from the table.
For example, consider following example to insert overwrite the hive table from original table using the DISTINCT keyword in SELECT clause.
INSERT OVERWRITE TABLE dup_demo
SELECT DISTINCT * FROM dup_demo;
GROUP BY Clause to Remove Duplicate
You can use the GROUP BY clause to remove duplicate records from a table.
For example, consider below Hive query to get unique records from a table.
INSERT OVERWRITE TABLE dup_demo
SELECT dup_demo.id,
dup_demo.NAME
FROM dup_demo
GROUP BY dup_demo.id,
dup_demo.NAME;
select * from dup_demo;
+--------------+----------------+--+
| dup_demo.id | dup_demo.name |
+--------------+----------------+--+
| 1 | abc |
| 2 | bcd |
| 3 | cde |
+--------------+----------------+--+
3 rows selected (12.332 seconds)
You can use INSERT OVERWRITE to insert unique records into table.
Use Insert Overwrite with row_number() Analytics Functions
The row_number Hive analytic function is used to rank or number the rows. Here we use the row_number function to rank the rows for each group of records and then select only record from that group.
For example, consider below example to insert overwrite table using analytical functions to remove duplicate rows.
INSERT OVERWRITE TABLE dup_demo
SELECT tmp.id,
tmp.NAME
FROM (SELECT id,
NAME,
Row_number()
OVER ( partition by ID
ORDER BY id, NAME) AS rno
FROM dup_demo) as tmp
WHERE tmp.rno = 1;
select * from dup_demo;
+--------------+----------------+--+
| dup_demo.id | dup_demo.name |
+--------------+----------------+--+
| 1 | abc |
| 2 | bcd |
| 3 | cde |
+--------------+----------------+--+
3 rows selected (0.112 seconds)
Related Articles
Hope this helps 🙂