Identify and Remove Duplicate Records from Snowflake Table

Snowflake supports primary, foreign key(s) and unique key(s), but, does not enforce them. Snowflake table allows you to insert duplicate rows. There are chances that some application may insert the records multiple times. There are several methods you can use to de-duplicate the snowflake tables. In this article, we will check how to identify and remove duplicate records from Snowflake table.

Remove Duplicate Records from Snowflake Table

There are many methods that you can use to remove the duplicate records from the Snowflake table. For example, use the DISTINCT keyword to remove duplicate while retrieving rows.

The following methods can be used to remove duplicate records Snowflake table.

Use DISTINCT Keyword
ALTER TABLE using SWAP WITH method
Use ROW_NUMBER Analytic function
Use GROUP BY Clause to Remove Duplicate Records

Now, let us check these methods in brief.

Use Intermediate and Snowflake DISTINCT Keyword

This is one of the easiest methods to remove the duplicate records from the table. This method works perfectly when your table is populated twice.

The idea here is to create another intermediate table using the DISTINCT keyword while selecting from the original table.

For example, below query creates an intermediate table using DISTINCT keyword.

CREATE TABLE int_sample_table3 AS SELECT DISTINCT * FROM sample_table3;

ALTER TABLE sample_table3 RENAME TO sample_table3_old;

ALTER TABLE int_sample_table3 RENAME TO sample_table3;

DROP TABLE sample_table3_old;

As you can see, this method is pretty simple and worked perfectly fine if table size is small.

ALTER TABLE using SWAP – Used Along with DISTINCT

The SWAP WITH method swaps all content and metadata between two specified tables. The SWAP will rename two tables in a single transaction.

For example, Following example demonstrates usage of SWAP in alter table.

CREATE OR REPLACE TABLE sample_table3_dedup LIKE sample_table3;

INSERT INTO sample_table3_dedup SELECT DISTINCT * FROM sample_table3;

ALTER TABLE sample_table3_dedup SWAP WITH sample_table3;

DROP TABLE sample_table3_dedup;

Use ROW_NUMBER Analytic Function To Remove Dulicate

The row_number Snowflake analytic function is used to rank or number the rows. Here, we use the row_number function to rank the rows for each group of records and then select only one record from that group.

For example, consider below example to create intermediate table using analytical functions to remove duplicate rows.

SELECT col1, col2, col3
FROM 
( 
 SELECT col1, col2, col3, 
 ROW_NUMBER() OVER (PARTITION BY col1, col2, col3 ORDER BY col1, col2, col3) RNO 
 FROM sample_table3 
) 
WHERE RNO = 1;

You can previous methods to create and alter intermediate table.

GROUP BY Clause to Remove Duplicate

You can use the GROUP BY clause as an alternative to DISTINCT keyword.

For example, consider below Snowflake query to get unique records from a table using GROUP BY.

SELECT col1, col2, col3 
FROM SAMPLE_TABLE3
GROUP BY col1, col2, col3;

This method is usually faster on small tables.