Snowflake cloud data warehouse produces create clustered tables by default. However, as the table size grows and DML occurs on the table, the data in some table rows may no longer cluster optimally on desired dimensions. In this article, we will check how to create Snowflake clustered tables to improve the DML query performance.
Snowflake Clustered Tables
When you create a table and insert records into the Snowflake tables, inserted rows are grouped into continuous storage such as micro-partitions. But, as the table size grows, data in some column may no longer cluster optimally. Clustering is not intended for all tables, you will add cluster key to improve the clustering of the underlying table micro-partitions.
The cluster columns can be a subset of the table columns, or an expression of the tables. A table with a clustering key defined is considered to be clustered table.
What is Snowflake Clustering Key?
A clustering key is a subset of columns in a table or an expression that are explicitly used to co-locate the data in the table in the same micro-partition.
This is very useful for the large tables (multi-TB) where ordering of the column was not optimal or extensive DML operation on the table has caused the table’s natural clustering to degrade.
So, which column(s) should be clustering key?
Following are the recommended columns as a clustering key.
- Cluster columns that are most actively used in selective filters. For example, many fact tables involves the date column as a filter condition (for example “WHERE invoice_date = X” ), choosing the date column is a good idea.
- Consider columns frequently used in join predicates.
The number of distinct values in a column/expression is a critical aspect of selecting it as a clustering key.
You should identify a clustering key that has:
- A large enough number of distinct values to enable effective pruning on the table.
- A small enough number of distinct values to allow Snowflake to effectively group rows in the same micro-partitions.
Note that, if you define two or more columns/expressions as the clustering key for a table, the order has an impact on how the data is clustered in micro-partitions. Snowflake recommends ordering the columns from lowest cardinality to highest cardinality.
Snowflake Clustered Table Syntax
You can use following syntax to create clustered tables in Snowflake.
CREATE TABLE <name> ... CLUSTER BY ( <expr1> [ , <expr2> ... ] )
Snowflake Clustered Table Examples
Consider following example to create table with clustering key.
create or replace table sn_clustered_table
(c1 date, c2 string, c3 number)
cluster by (c1, c2);
Alter Snowflake Table to Add Clustering Key
You can add the clustering key while creating table or use ALTER TABLE syntax to add a clustering key to existing tables.
Following is the syntax to add a clustering key to existing Snowflake table.
ALTER TABLE <name> CLUSTER BY ( <expr1> [ , <expr2> ... ] )
For example,
alter table sn_clustered_table2 cluster by (c1, c3);
Dropping the Clustering Keys in Snowflake
If the query performance is not as per expectation, you can drop the clustering key.
Following is the syntax to drop the clustering key on Snowflake table.
ALTER TABLE <name> DROP CLUSTERING KEY
For example,
ALTER TABLE sn_clustered_table2 DROP CLUSTERING KEY
Reclustering in Snowflake
Reclustering in Snowflake is automatic; no maintenance is needed. During reclustering, Snowflake uses the clustering key for a clustered table to reorganize the column data, so that related records are relocated to the same micro-partition. This DML operation deletes the affected records and re-inserts them, grouped according to the clustering key.
Limitation of Clustering in Snowflake
Following are some of the limitations on clustered Tables.
- An existing Clustering key is not propagated when using CREATE TABLE … LIKE operation.
- An existing clustering key is not supported when a table is created using CREATE TABLE … AS SELECT; however, you can use the ALTER TABLE command to define a clustering key after the table is created.
- You cannot use a VARIANT column in clustering key. However, you can specify the expression to extract value in clustering key.
- CREATE TABLE … CLONE will retain the clustering key.
Related Articles,
- Snowflake Cloud Data Warehouse Best Practices
- Working with Snowflake External Tables and S3 Examples
- What are Different Methods to Create Snowflake Tables?
Hope this helps 🙂