How to Choose Correct Compression Encode in Redshift?

Data is a key aspect of any company’s decision making process. It helps organizations to make smart decision using trends such as purchase, growth area and other high-level functionalities. Organizations grow by making smart decisions so does data. As a typical company’s size of data has grown exponentially, it’s become even more critical to optimize data storage. The size of data not only affects storage and cost, it also affects the performance of your query. The typical query performance is always directly proportional to the size of data because smaller the data size the less that has to be processed during expensive disk I/O operations. In this article, we will check how to choose correct compression encode in Amazon Redshift.

Compression Encode in Redshift

Amazon Redshift cloud data warehouse provides many features that allow you to optimize your storage and increase query performance. One of such a features is “compression encoding“. The compression encoding reduces the size of your data by applying compression algorithm. You can let the Redshift to choose compression encoding automatically or you can specify a compression encoding algorithm in your Redshift table DDL.

A compression encoding specifies the type of compression algorithm that is applied to a column of data values as rows are added to a table. Compression is a column-level operation that reduces the size of data when it is stored. The encoding algorithm converts the data into an internal format that occupy less space. Compression conserves storage space and reduces the size of data that is read from storage, which reduces the amount of disk I/O and therefore improves query performance.

The following list identifies the supported compression encodings.

RAW Encoding
AZ64 encoding
Byte-dictionary encoding
Delta encoding
LZO encoding
Mostly encoding
Run length encoding
Text255 and Text32k encodings
Zstandard encoding

Now, let us check these compression encoding in brief:

RAW Encoding

With raw encoding, data is stored in raw, uncompressed form. Raw encoding is a default encoding for a SORTKEY columns. and columns with data type BOOLEAN, REAL, or DOUBLE PRECISION.

AZ64 encoding

AZ64 is Amazon’s proprietary compression encoding algorithm to achieve a high compression ratio and improved query processing. You can use AZ64 to achieve significant storage savings and high performance for numeric, date, and time data types.

Byte-dictionary encoding

Byte-dictionary encoding is very effective when a column contains a limited number of unique values. This encoding is optimal when the data domain of a column is fewer than 256 unique values. Byte-dictionary encoding is space-efficient if a CHAR column holds long character strings.

In byte dictionary encoding, a separate dictionary of up to 256 unique values is created for each block of column values on disk.

Delta encoding

Delta encodings are very useful for date, time columns. It is one of the commonly used compression encoding for date, time columns in a Redshift.

LZO encoding

LZO encoding provides a very high compression ratio with good performance on CHAR and VARCHAR columns with long character strings. You can use this encoding for free-form text, such as product descriptions, user comments, or JSON strings.

Mostly encoding

The Mostly encoding works well on SMALLINT, INT and BIGINT columns. In general, the mostly encodings work with the following data types:

SMALLINT/INT2 – MOSTLY8
INT/INTEGER/INT4 – MOSTLY16
BIGINT/INT8 – MOSTLY32

Run length encoding

Run length encoding replaces a value that is repeated consecutively with a token that consists of the value and a count of the number of consecutive occurrences.

For example, if ONE is repeated 3 time, then the compressed value would be {3, ONE}. Similarly, if TWO is repeated twice, the compressed vlaue would be {2, TWO}, and so on.

Text255 and Text32k encodings

Text255 and text32k encodings are useful for compressing VARCHAR columns in which the same words recur often.

Zstandard encoding

Zstandard (ZSTD) encoding provides a high compression ratio with very good performance across diverse datasets. ZSTD works especially well with CHAR and VARCHAR columns that store a wide range of long and short strings, such as product descriptions, user comments, logs, and JSON strings.

Redshift Compression Encode Example

For example, following create table syntax uses best compression encoding algorithms.

CREATE TABLE tmp_encode 
(
	col0 TEXT encode  RAW,
	col1 BOOLEAN encode  RAW,
	col2 BOOL encode  RAW,
	col3 REAL encode  ZSTD,
	col4 FLOAT4 encode  ZSTD,
	col5 DOUBLE PRECISION encode  ZSTD,
	col6 FLOAT8 encode  ZSTD,
	col7 FLOAT encode  ZSTD,
	col8 DECIMAL encode  AZ64,
	col9 NUMERIC encode  AZ64,
	col10 CHARACTER encode  LZO,
	col11 CHAR encode  LZO,
	col12 CHARACTER VARYING encode  LZO,
	col13 TEXT encode  LZO,
	col14 VARCHAR encode  LZO,
	col15 VARCHAR(100) encode  ZSTD,
	col16 SMALLINT encode  MOSTLY8,
	col17 INT2 encode  MOSTLY8,
	col18 INTEGER encode  MOSTLY16,
	col19 INT encode  MOSTLY16,
	col20 INT4 encode  MOSTLY16,
	col21 BIGINT encode  MOSTLY32,
	col22 INT8 encode  MOSTLY32,
	col23 DATE encode  DELTA32K,
	col24 TIME encode  DELTA,
	col25 TIMETZ encode  DELTA,
	col26 TIMESTAMP encode  DELTA32K,
	col27 TIMESTAMPTZ encode  DELTA32K,
	col28 TIME encode  DELTA 
) DISTKEY(col0) SORTKEY(col0);

Redshift Compression ENCODE AUTO

Choosing right compression encoding is depends upon the type of column and data you are inserting into the Redshift table. To make this process easy for the database administrator and developer, Redshift allows you to enable auto encode to automatically choose best compression encode.

Redshift compression ENCODE AUTO is the default for tables. When you specify the auto encoding, Amazon Redshift automatically manages compression encoding for all columns in the table.

For example, following Redshift DDL enable ENCODE AUTO on the table.

create table tmp_encode_auto (
ID int,
name text,
dept int
) ENCODE auto;

However, if you specify compression encoding for any column in the table, the Redshift table is no longer set to ENCODE AUTO. Amazon Redshift no longer automatically manages compression encoding for all columns in the table.

How to Choose Correct Compression Encode in Redshift?

You don’t need to understand all the different algorithms to select the best one for your data in Amazon Redshift. The AWS Redshift provides a very useful command to determine the best encoding for each column in your table. The ANALYZE COMPRESSION command performs compression analysis and produces a report with the suggested compression encoding for the Redshift tables analyzed.

For example,

ANALYZE COMPRESSION tmp_encode_auto;
|Table          |Column|Encoding|Est_reduction_pct|
|---------------|------|--------|-----------------|
|tmp_encode_auto|id    |raw     |0.00             |
|tmp_encode_auto|name  |raw     |0.00             |
|tmp_encode_auto|dept  |raw     |0.00             |

Hope this helps 🙂