You will hear a lot about “Data Skew” if you are developing data warehouse on Redshift, Netezza, Teradata, hive or Impala database.
In the MPP database, performance of the system is directly linked to uniform distribution of the user data across all data node in the system. When you create a table and then load the data into the system, the rows of the table should be distributed uniformly among all the data nodes. If some data node slices have more rows of a table than others, this scenarios is called skew. This can lead to longer query runtimes and suboptimal performance because some nodes are heavily loaded while others are underutilized.
When Redshift Table Data skew occurs?
Redshift Skew can happen while distributing or loading the data into the following types of tables:
- Base tables : Database administrators define the schema and create tables. Developer deploy the required table DDL into the Redshift database. If tables does not distributed on proper DIST KEY then table data skew occurs.
- Intra-session tables : Applications or SQL users create temp tables that are required for the application. If tables does not distributed on proper DIST KEY then table data skew occurs.
On Redshift database, data in the table should be evenly distributed among all the data node slices in the Redshift cluster. If the data node slices with more row and its associated data node will have to work hard, longer and need more resource to process the data that is required for client application. These data nodes become the performance bottleneck for queries that are being processed on the Redshift cluster. This type of the skew is caused by the bad distribution and is called table skew.
How to identify Redshift Table Data skew?
You can query the Redshift system tables to identify the table with data skew. The following query identifies tables that have uneven data distribution (data skew) rows.
select trim(pgn.nspname) as schema, trim(a.name) as table, id as tableid, decode(pgc.reldiststyle,0, 'even',1,det.distkey ,8,'all') as distkey, dist_ratio.ratio::decimal(10,4) as skew, det.head_sort as "sortkey", det.n_sortkeys as "#sks", b.mbytes, decode(b.mbytes,0,0,((b.mbytes/part.total::decimal)*100)::decimal(5,2)) as pct_of_total, decode(det.max_enc,0,'n','y') as enc, a.rows, decode( det.n_sortkeys, 0, null, a.unsorted_rows ) as unsorted_rows , decode( det.n_sortkeys, 0, null, decode( a.rows,0,0, (a.unsorted_rows::decimal(32)/a.rows)*100) )::decimal(5,2) as pct_unsorted from ( select db_id, id, name, sum(rows) as rows, sum(rows)-sum(sorted_rows) as unsorted_rows from stv_tbl_perm a group by db_id, id, name ) as a join pg_class as pgc on pgc.oid = a.id join pg_namespace as pgn on pgn.oid = pgc.relnamespace left outer join (select tbl, count(*) as mbytes from stv_blocklist group by tbl) b on a.id=b.tbl inner join (select attrelid, min(case attisdistkey when 't' then attname else null end) as "distkey", min(case attsortkeyord when 1 then attname else null end ) as head_sort , max(attsortkeyord) as n_sortkeys, max(attencodingtype) as max_enc from pg_attribute group by 1) as det on det.attrelid = a.id inner join ( select tbl, max(mbytes)::decimal(32)/min(mbytes) as ratio from (select tbl, trim(name) as name, slice, count(*) as mbytes from svv_diskusage group by tbl, name, slice ) group by tbl, name ) as dist_ratio on a.id = dist_ratio.tbl join ( select sum(capacity) as total from stv_partitions where part_begin=0 ) as part on 1=1 where mbytes is not null order by mbytes desc;
If a table has a skew value of 4.00 or higher, consider modifying its data distribution style.
How to Avoid Redshift Table Data Skew?
You can avoid the Redshift table skew by following below steps:
Follow best table design best approach to design table based on your requirements.
- Use Redshift table distribution style based on your requirement.
You can follow: Amazon Redshift Distribution Types and Examples
- Identify the column which is unique and use that in DIST KEY.
- If any key is used largely in equi-join, use that in DIST KEY.
Read:
- Redshift ANALYZE Command to Collect Statistics and Best Practices
- Amazon Redshift Data Types and Best Practices
Hope this helps 🙂