How Redshift Distributes Table Data? Importance of right Distribution Key

  • Post author:
  • Post last modified:February 3, 2023
  • Post category:Redshift
  • Reading time:6 mins read

Amazon Redshift uses the Massively parallel processing technique, Redshift automatically distributes data and query load across all nodes available in the cluster. In this article, we will check how Redshift distributes table data and importance of right distribution key.

How Redshift distributes Table Data and Importance of right Distribution Key

How Redshift distributes Table Data?

Amazon Redshift uses the three types of distribution; EVEN, KEY and ALL. When you create tables you will have to tell the system which distribution it should use.

You may read about distribution types and best practices:

If you specify the DISTKEY, Amazon Redshift uses a hash of the DISTRIBUTION KEY (DISTKEY) to distribute data records amongst nodes.

A distribution method that distributes data evenly across all node slices is the single most important factor that can increase overall query performance. Bad distribution key can result in uneven distribution of a table across slices and will cause skew, of course that will hamper the system performance. It is very important to identify the correct and proper distribution key when creating table definition and that require the extremely good knowledge on data.

Read:

For example, consider Reshift table is distributed on the column that contains only flags such as Y or N. When data is inserted into table, hashed value will be pointing to only two slices on the node and data will be place on those slices, thus causing skew.

Leaving out DISTKEY in Redshift distribution

By default, Amazon Redshift database data distribution uses the EVEN distribution style i.e. data is distributed using round-robin techniques.

If you are not sure about which column should be used in DISTKEY, probably EVEN distribution style is your best choice.

What are factors when choosing best Redshift DISTKEY key?

When choosing which columns should be the DISTKEY for a Redshift table, your goal should be uniform distribution of the rows and optimal access to the data. The main goal of the DISTKEY is to get data co-located.

Consider the following factors when choosing best dsitribution key:

  • Choose a column with high cardinality: The more distinct the DISTKEY values, the better.
  • Parallel processing is more efficient when you have distributed table rows evenly across the slices.
  • Tables used together should use the common columns for their DISTKEY. For example, in an order system application, use the Customer ID as the DISTKEY for both the customer table and the order table.
  • If a particular key is used largely in Equi-join clauses, then that key is a good choice for the distribution key.
  • For small tables (like date dimension), use distribution style as ALL to store copy of small table on all node. Table will be collocated to other tables and improves joining performance.

Related Articles

Hope this helps 🙂