Redshift Table Data Skew and How to avoid it

  • Post author:
  • Post last modified:February 6, 2023
  • Post category:Redshift
  • Reading time:6 mins read

You will hear a lot about “Data Skew” if you are developing data warehouse on Redshift, Netezza, Teradata, hive or Impala database.

In the MPP database, performance of the system is directly linked to uniform distribution of the user data across all data node in the system. When you create a table and then load the data into the system, the rows of the table should be distributed uniformly among all the data nodes. If some data node slices have more rows of a table than others, this scenarios is called skew. This can lead to longer query runtimes and suboptimal performance because some nodes are heavily loaded while others are underutilized.

Redshift Table Data Skew

When Redshift Table Data skew occurs?

Redshift Skew can happen while distributing or loading the data into the following types of tables:

  • Base tables : Database administrators define the schema and create tables. Developer deploy the required table DDL into the Redshift database. If tables does not distributed on proper DIST KEY then table data skew occurs.
  • Intra-session tables : Applications or SQL users create temp tables that are required for the application. If tables does not distributed on proper DIST KEY then table data skew occurs.

On Redshift database, data in the table should be evenly distributed among all the data node slices in the Redshift cluster. If the data node slices with more row and its associated data node will have to work hard, longer and need more resource to process the data that is required for client application. These data nodes become the performance bottleneck for queries that are being processed on the Redshift cluster. This type of the skew is caused by the bad distribution and is called table skew.

How to identify Redshift Table Data skew?

You can query the Redshift system tables to identify the table with data skew. The following query identifies tables that have uneven data distribution (data skew) rows.

select trim(pgn.nspname) as schema,
trim(a.name) as table,
id as tableid,
decode(pgc.reldiststyle,0, 'even',1,det.distkey ,8,'all') as distkey,
dist_ratio.ratio::decimal(10,4) as skew,
det.head_sort as "sortkey",
det.n_sortkeys as "#sks", b.mbytes,
decode(b.mbytes,0,0,((b.mbytes/part.total::decimal)*100)::decimal(5,2)) as pct_of_total,
decode(det.max_enc,0,'n','y') as enc, a.rows,
decode( det.n_sortkeys, 0, null, a.unsorted_rows ) as unsorted_rows ,
decode( det.n_sortkeys, 0, null, decode( a.rows,0,0, (a.unsorted_rows::decimal(32)/a.rows)*100) )::decimal(5,2) as pct_unsorted
from (
select db_id,
id,
name,
sum(rows) as rows,
sum(rows)-sum(sorted_rows) as unsorted_rows
from stv_tbl_perm a
group by db_id, id, name
) as a
join pg_class as pgc
on pgc.oid = a.id
join pg_namespace as pgn
on pgn.oid = pgc.relnamespace
left outer join (select tbl, count(*) as mbytes
from stv_blocklist group by tbl) b
on a.id=b.tbl
inner join (select attrelid,
min(case attisdistkey when 't' then attname else null end) as "distkey",
min(case attsortkeyord when 1 then attname else null end ) as head_sort ,
max(attsortkeyord) as n_sortkeys,
max(attencodingtype) as max_enc
from pg_attribute group by 1) as det
on det.attrelid = a.id
inner join ( select tbl, max(mbytes)::decimal(32)/min(mbytes) as ratio
from (select tbl, trim(name) as name, slice, count(*) as mbytes
from svv_diskusage group by tbl, name, slice )
group by tbl, name ) as dist_ratio
on a.id = dist_ratio.tbl
join ( select sum(capacity) as total
from stv_partitions where part_begin=0 ) as part on 1=1
where mbytes is not null
order by mbytes desc;

If a table has a skew value of 4.00 or higher, consider modifying its data distribution style.

How to Avoid Redshift Table Data Skew?

You can avoid the Redshift table skew by following below steps:

Follow best table design best approach to design table based on your requirements.

You can follow: Amazon Redshift Distribution Types and Examples

  • Identify the column which is unique and use that in DIST KEY.
  • If any key is used largely in equi-join, use that in DIST KEY.

Read:

Hope this helps 🙂