The performance of the Snowflake cloud data warehouse is directly dependent on the optimal table structure and design. Optimizing Snowflake table structure is one of the important aspect to improve the performance of query, data loading and unloading process. In this article, we will check how to optimize the Snowflake table structure to improve query performance.
Optimize Snowflake Table Structure
There are no specific best practices that you can apply to optimize the table structure. Creating optimal table structure that uses right data type and length is one of the important steps.
However, here are some of the methods that Snowflake recommends while creating the table structure for better performance.
- Use Data/Time data type whenever possible
- Use CTAS to Recreate Table
- Define Constraints – Primary, Foreign Keys and NOT NULL
- Whenever possible Set column length
- Whenever possible Set Clustering Key
- Use VARIANT for semi-structured data
- Whenever possible use Temporary and Transient Table
Now, let use check these methods in brief.
Use Data/Time data type whenever possible
When defining columns to store dates or timestamps, Snowflake recommends date or timestamp data types rather than a character data type such as VARCHAR.
Snowflake stores DATE and TIMESTAMP data more efficiently than VARCHAR, resulting in better query performance.
Recreate Snowflake Table using CTAS
Use “Create Table AS” method whenever you are re-creating tables. The CTAS method is always faster when recreating a table with data.
Define Constraints – Primary, Foreign Keys and NOT NULL
Referential integrity constraints in Snowflake are informational and, with the exception of NOT NULL, not enforced. Constraints other than NOT NULL are created as disabled.
However, constraints provide valuable metadata. The primary keys and foreign keys enable the developer to orient themselves to the schema design and provide more information on how the tables relate with one another.
Whenever Possible Set Column Length
It is always best practice to provide column length for VARCHAR and CHAR columns. Don’t make it a practice to use the maximum column size for convenience. Instead, consider the largest values you are likely to store in a VARCHAR column, for example, and size your columns accordingly.
Whenever possible Set Clustering Key
Cluster key used to organize the micro-partitions. If your table is small, then specifying cluster column does not improve the performance.
So, when to set a cluster key? You can add cluster key when the order in which the data loaded does not match the dimension by which it is most commonly queried.
For example, the data is loaded by date, but reports filter the data by ID.
In this case, you can gain the performance if you set multi-column clustering key on a date and ID columns.
Use VARIANT for Semi-structured Data
If you aren’t sure what types of operations you will perform on your semi-structured data, we recommend storing it in a VARIANT column. You can later change the type as per the data.
However, use of VARIANT type comes with an advantage. You can use the FLATTEN function to extract the objects and keys you plan to query into a separate table.
Whenever possible use Temporary and Transient table
If your requirement is to use the table within the current session or you are working on transitory data, then you can use temporary and transient tables respectively.
For more information, read Temporary and Transient tables in Snowflake.
Related Articles,
Hope this helps 🙂