Data is a key aspect of any company’s decision making process. It helps organizations to make smart decision using trends such as purchase, growth area and other high-level functionalities. Organizations grow by making smart decisions so does data. As a typical company’s size of data has grown exponentially, it’s become even more critical to optimize data storage. The size of data not only affects storage and cost, it also affects the performance of your query. The typical query performance is always directly proportional to the size of data because smaller the data size the less that has to be processed during expensive disk I/O operations. In this article, we will check how to choose correct compression encode in Amazon Redshift.
Also Read:
Compression Encode in Redshift
Amazon Redshift cloud data warehouse provides many features that allow you to optimize your storage and increase query performance. One of such a features is “compression encoding“. The compression encoding reduces the size of your data by applying compression algorithm. You can let the Redshift to choose compression encoding automatically or you can specify a compression encoding algorithm in your Redshift table DDL.
A compression encoding specifies the type of compression algorithm that is applied to a column of data values as rows are added to a table. Compression is a column-level operation that reduces the size of data when it is stored. The encoding algorithm converts the data into an internal format that occupy less space. Compression conserves storage space and reduces the size of data that is read from storage, which reduces the amount of disk I/O and therefore improves query performance.
The following list identifies the supported compression encodings.
- RAW Encoding
- AZ64 encoding
- Byte-dictionary encoding
- Delta encoding
- LZO encoding
- Mostly encoding
- Run length encoding
- Text255 and Text32k encodings
- Zstandard encoding
Now, let us check these compression encoding in brief:
RAW Encoding
With raw encoding, data is stored in raw, uncompressed form. Raw encoding is a default encoding for a SORTKEY columns. and columns with data type BOOLEAN, REAL, or DOUBLE PRECISION.
AZ64 encoding
AZ64 is Amazon’s proprietary compression encoding algorithm to achieve a high compression ratio and improved query processing. You can use AZ64 to achieve significant storage savings and high performance for numeric, date, and time data types.
Byte-dictionary encoding
Byte-dictionary encoding is very effective when a column contains a limited number of unique values. This encoding is optimal when the data domain of a column is fewer than 256 unique values. Byte-dictionary encoding is space-efficient if a CHAR column holds long character strings.
In byte dictionary encoding, a separate dictionary of up to 256 unique values is created for each block of column values on disk.
Delta encoding
Delta encodings are very useful for date, time columns. It is one of the commonly used compression encoding for date, time columns in a Redshift.
LZO encoding
LZO encoding provides a very high compression ratio with good performance on CHAR and VARCHAR columns with long character strings. You can use this encoding for free-form text, such as product descriptions, user comments, or JSON strings.
Mostly encoding
The Mostly encoding works well on SMALLINT, INT and BIGINT columns. In general, the mostly encodings work with the following data types:
- SMALLINT/INT2 – MOSTLY8
- INT/INTEGER/INT4 – MOSTLY16
- BIGINT/INT8 – MOSTLY32
Run length encoding
Run length encoding replaces a value that is repeated consecutively with a token that consists of the value and a count of the number of consecutive occurrences.
For example, if ONE is repeated 3 time, then the compressed value would be {3, ONE}. Similarly, if TWO is repeated twice, the compressed vlaue would be {2, TWO}, and so on.
Text255 and Text32k encodings
Text255 and text32k encodings are useful for compressing VARCHAR columns in which the same words recur often.
Zstandard encoding
Zstandard (ZSTD) encoding provides a high compression ratio with very good performance across diverse datasets. ZSTD works especially well with CHAR and VARCHAR columns that store a wide range of long and short strings, such as product descriptions, user comments, logs, and JSON strings.
Redshift Compression Encode Example
For example, following create table syntax uses best compression encoding algorithms.
CREATE TABLE tmp_encode
(
col0 TEXT encode RAW,
col1 BOOLEAN encode RAW,
col2 BOOL encode RAW,
col3 REAL encode ZSTD,
col4 FLOAT4 encode ZSTD,
col5 DOUBLE PRECISION encode ZSTD,
col6 FLOAT8 encode ZSTD,
col7 FLOAT encode ZSTD,
col8 DECIMAL encode AZ64,
col9 NUMERIC encode AZ64,
col10 CHARACTER encode LZO,
col11 CHAR encode LZO,
col12 CHARACTER VARYING encode LZO,
col13 TEXT encode LZO,
col14 VARCHAR encode LZO,
col15 VARCHAR(100) encode ZSTD,
col16 SMALLINT encode MOSTLY8,
col17 INT2 encode MOSTLY8,
col18 INTEGER encode MOSTLY16,
col19 INT encode MOSTLY16,
col20 INT4 encode MOSTLY16,
col21 BIGINT encode MOSTLY32,
col22 INT8 encode MOSTLY32,
col23 DATE encode DELTA32K,
col24 TIME encode DELTA,
col25 TIMETZ encode DELTA,
col26 TIMESTAMP encode DELTA32K,
col27 TIMESTAMPTZ encode DELTA32K,
col28 TIME encode DELTA
) DISTKEY(col0) SORTKEY(col0);
Redshift Compression ENCODE AUTO
Choosing right compression encoding is depends upon the type of column and data you are inserting into the Redshift table. To make this process easy for the database administrator and developer, Redshift allows you to enable auto encode to automatically choose best compression encode.
Redshift compression ENCODE AUTO is the default for tables. When you specify the auto encoding, Amazon Redshift automatically manages compression encoding for all columns in the table.
For example, following Redshift DDL enable ENCODE AUTO on the table.
create table tmp_encode_auto (
ID int,
name text,
dept int
) ENCODE auto;
However, if you specify compression encoding for any column in the table, the Redshift table is no longer set to ENCODE AUTO. Amazon Redshift no longer automatically manages compression encoding for all columns in the table.
How to Choose Correct Compression Encode in Redshift?
You don’t need to understand all the different algorithms to select the best one for your data in Amazon Redshift. The AWS Redshift provides a very useful command to determine the best encoding for each column in your table. The ANALYZE COMPRESSION command performs compression analysis and produces a report with the suggested compression encoding for the Redshift tables analyzed.
For example,
ANALYZE COMPRESSION tmp_encode_auto;
|Table |Column|Encoding|Est_reduction_pct|
|---------------|------|--------|-----------------|
|tmp_encode_auto|id |raw |0.00 |
|tmp_encode_auto|name |raw |0.00 |
|tmp_encode_auto|dept |raw |0.00 |
Hope this helps 🙂