Amazon Redshift is fully managed, distributed, petabyte scale relational database system on Amazon web services (AWS). Redshift is a columnar database, it is capable of performing complex queries efficiently over petabytes of data. You can scale the database based on your needs and handle growing datasets. Redshift comes with built-in query optimization features. As a part of performance enhancement step, you can distributes table data on particular column(s). Another important feature is the Redshift sort key to boost your Redshift data access mechanism. In this article, we will check how to select Redshift best sort key.
Page Contents
Introduction to Redshift
Amazon Redshift is a fully managed, cloud-based data warehouse service offered by Amazon Web Services (AWS). Redshift is based on open-source PostgreSQL and is modified to handle large-scale data sets and complex queries for data analytics and business intelligence purposes. AWS Redshift is scalable, flexible, and cost-effective, making it a popular choice for large scale data warehouse solutions.
The key features of Amazon Redshift includes;
- Columnar storage: Redshift is based on a columnar storage architecture that allows for faster query performance and better compression of data.
- Massively parallel processing (MPP): Amazon Redshift is designed to handle parallel processing of data, allowing it to scale up or down based on resource demand.
- Integration with popular BI and analytics tools: Amazon Redshift integrates seamlessly with popular BI and analytics tools, making it easy to analyze and visualize data from different sources.
- Security: Redshift provides industry standard security features such as encryption, access control, and audit logging to ensure that data is secure and compliant with regulatory requirements.
- Handle Semistructured Data: Amazon Redshift provide
SUPER
data type to store and json functions to manipulate semistructured data.
What is the Sort Key in Redshift Tables?
A sort key
is a field in your Redshift database table that determines the order in which the data is physically stored on the database disk within each data slice. Amazon Redshift sort key helps to optimize query performance by minimizing the amount of data that needs to be scanned when performing queries that involve sorting or filtering.
For the rows stored on each slice, they are stored in SORTKEY
order. Amazon Redshift stores your data on disk in sorted order according to the sort key that you mention while creating Redshift table.
For example, consider store_sale table, You are more likely to retrieve data based on purchase date. If you a choose purchase_date column as a sort key, data will be ordered from oldest to newest purchases.
What are the Types of Sort Keys in Redshift?
Amazon Redshift sort keys are divided into two types:
Now, let use check these two in details:
Amazon Redshift Sort key
A single column sort key in Amazon Redshift is consists of one column that is sorted in ascending order. A single sort key is useful when you have queries that frequently use a specific column for sorting or filtering.
When you have table distributed and sorted on the same column, it will significantly improve the performance of your queries.
Following Redshift create table is sorted on a single column:
create table myevent(
eventid int,
eventname varchar(200),
eventcity varchar(30))
distkey (eventid)
sortkey(eventid);
Amazon Redshift Compound Sort Key
A compound sort key in Amazon Redshift is consists of one or more columns that are sorted in order of importance. The first column in the sort key is the most important, followed by the second column, and so on. Redshift stores the data on disk in the order defined by the sort key.
A compound sort key is useful when you have queries that frequently use a specific column or a set of columns for sorting or filtering. By organizing the data on disk based on the sort key, Redshift can optimize the performance of these queries more efficiently, by scanning only the relevant data blocks. The Redshift data scanning optimization is one of the important steps in performance tuning.
Following create table statement uses compound sort key:
CREATE TABLE customer (
customer_id INTEGER NOT NULL,
name VARCHAR(100) NOT NULL,
country_id INTEGER NOT NULL
) COMPOUND SORTKEY(customer_id, country_id);
Amazon Redshift Interleaved Sort Key
An interleaved sort key is similar to a compound sort key, but it allows for more equal distribution of data across the slices. With an interleaved sort key, Redshift gives equal weight to each column, or subset of columns in the sort keys, rather than prioritizing the first column over the others.
If multiple queries use different columns for filters, then you can often improve performance for those queries by using an interleaved sort style.
Following Redshift create table statement uses interleaved sort key:
create table customer_interleaved (
c_custkey integer not null,
c_name varchar(25) not null,
c_address varchar(25) not null,
c_city varchar(10) not null,
c_nation varchar(15) not null,
c_region varchar(12) not null,
c_phone varchar(15) not null,
c_mktsegment varchar(10) not null)
diststyle all
interleaved sortkey (c_custkey, c_city, c_mktsegment);
How to Select Appropriate Sort key in Redshift?
Choosing the appropriate sort keys for your table is one of the Redshift disk storage optimization techniques for optimizing your query performance. You should consider the types of queries that will be run against the data, the size of the data, and the distribution of the data across the slices when selecting a sort key.
When you create a table, you can specify one or more columns as the sort key. Amazon Redshift stores your data on disk in sorted order according to the sort keys.
Following column types are best candidates for Redshift cluster sort keys:
Following are some of the important criteria you can follow when selecting a Redshift sort keys.
Timestamp Columns
If your data contains timestamp column and your application frequently query the data, specify your timestamp column as the leading column for the sort key.
Redshift skips the unwanted data during query execution.
Range Columns
If your application queries perform range filtering or equality filtering on one particular column, that column is a good candidate for sort key.
Redshift track the minimum and maximum values from the sorted block and can skip the entire block if range is not available.
Join Column
If you frequently join a table, specify the join column as both the sort key and the distribution key
Choosing join column as a sort key enables the query optimizer to choose a sort merge join instead of a slower hash join. Because the data is already sorted on the join key, the query optimizer can bypass the sort phase of the sort merge join.
How to Specify a Sort Key?
When you create a table, you can specify one or more columns as the sort key.
For example, consider below create table statement.
CREATE TABLE customer (
customer_id INTEGER NOT NULL,
name VARCHAR(100) NOT NULL,
country_id INTEGER NOT NULL
) COMPOUND SORTKEY(customer_id, country_id);
Amazon Redshift Sort Key Best Practices
Following are some best practices for using sort keys in Redshift:
- Choose the Right Sort Key Columns: Columns that are frequently used in join and filter operations are best suited for the sort key. The sort key should be the same or a subset of the distribution key.
- Use Compound Sort Keys: A compound sort key is more efficient when query predicates use a prefix, which is a subset of the sort key columns in order.
- Use Interleaved Sort Keys: An interleaved sort is more effective with large tables. An interleaved sort key gives equal weight to each column in the sort key, so query predicates can use any subset of the columns that make up the sort key, in any order. Don’t use an interleaved sort key on columns with monotonically increasing attributes, such as identity columns, dates, or timestamps.
- Run VACUUM REINDEX: Execute VACUUM command during non-business hours. A VACUUM restores the sort order, but the operation can take longer for interleaved tables because merging new interleaved data might involve modifying every data block.
- Avoid Too Many Columns in Sort Key: If you overload the sort key with too many columns, it can negatively impact query performance. Only include the columns that are necessary for sorting and avoid including unnecessary columns in the sort key.
Conclusion
In conclusion, the Redshift Sort Key is one of the critical feature of Amazon Redshift’s columnar storage architecture that enables faster query performance and better compression of data. Choosing the right Sort Key can have a significant impact on query performance, and it is important to consider various factors such as data distribution, data type and query patterns when making this decision.
Lastly, monitor sort key performance and optimize it as needed to ensure Redshift provides an optimal query performance as data size grow.
Related Articles
- How to Change Redshift Table Sort Keys and Example
- How Redshift Distributes Table Data? Importance of right Distribution Key
- Create an Index in Amazon Redshift Table
- How to Optimize Query Performance on Redshift?
Hope this helps 🙂