Identify and Remove Duplicate Records from Redshift Table

Redshift do not have a primary or unique key. You can define primary, Foreign or unique key constraints, but Redshift will not enforce them. You can insert the duplicate records in the Redshift table. There are no constraints to ensure uniqueness or primary key, but if you have a table and have loaded data twice, then you can de-duplicate in several ways. Below methods explain you how to identify and Remove duplicate records from Redshift table.

Remove Duplicate Records from Redshift Table

There are many methods that you can use to remove duplicate records from the table. Many relational databases provides internal columns such as ROWID, you can use this internal column to identify and remove duplicate value. Amazon Redshift does not provide any internal columns. You have to use alternate methods to identify and remove duplicate values from your Redshift table.

Use Intermediate and DISTINCT Keyword

This is one of the easiest methods and many SQL developer uses this to remove duplicate values. This operation is a bit heavy to system but does its job.

You can remove the duplicate records in Redshift by creating another table using the DISTINCT keyword while selecting from the original table.

For example, consider following example to create an intermediate table from the original table using the DISTINCT keyword in SELECT clause.

begin;
create table int_table as select distinct * from base_table;
alter table base_table rename to base_table_old;
alter table int_table rename to base_table;
drop table base_table_old;
commit;

As you can see in the above example, create temp intermediate table and later rename it to original table that you have in Redshift database.

Use Intermediate table with row_number() analytics functions

The row_number Redshift analytic function is used to rank or number the rows. Here we use the row_number function to rank the rows for each group of records and then select only record from that group.

For example, consider below example to create intermediate table using analytical functions to remove duplicate rows.

Create table int_table As 
Select col1, col2 
From 
( 
 Select col1, col2 
 row_number() over (partition by col1 order by col2) rno 
 From table1 
) 
Where rno = 1;

You can use any of the above method to identify and remove duplicate Records from Redshift Table

GROUP BY Clause to Remove Duplicate

You can use the GROUP BY clause to remove duplicate records from a table.

For example, consider below Redshift query to get unique records from a table.

SELECT id, 
       NAME 
FROM   dup_demo 
GROUP  BY id, 
          NAME;

Hope this helps 🙂