Netezza Skew and How to avoid it

You will hear a lot about “Netezza Skew” if you are developing data warehouse on Netezza, Redshift, Teradata, hive or Impala database.

The performance of the system is directly linked to uniform distribution of the user data across all of the data slices in the system. When you create a table and then load the data into the system, the rows of the table should be distributed uniformly among all the data slices. If some data slices have more rows of a table than others this scenarios is called skew.

Netezza Skew can happen while distributing or loading the data into the following types of tables:

Base tables — Database administrators define the schema and create tables.
Intra-session tables — Applications or SQL users create temp tables.

On Netezza server, data in the table should be evenly distributed among all the data slices. If the data slices with more row and its associated SPU and FPGA will have to work hard, longer and need more resource to process the data. These data slices and SPUs become the performance bottleneck for queries that are being processed on the NPS. This type of the skew is caused by the bad distribution and is called table or explicit skew.

Below is how skewed table looks like when checked in Administrator tool

netezza skew table

Specifying Distribution Keys

Netezza uses the table’s distribution key to determine how to distribute (or stripe) the table’s data across all active data slices in the system. The Netezza system requires that all tables have a distribution method, either hash or random.

When you use the commands CREATE TABLE or CREATE TABLE AS, you can either specify the method or allow the Netezza to select one.

With the DISTRIBUTE ON (hash) command, you can specify up to four columns as the distribution key.

If there is no obvious group of columns that can be combined as the distribution key, you can specify random distribution. Random distribution means that the Netezza distributes the data evenly (in a round-robin format) across all the data slices.

Random distribution results in the following:

Avoiding skew when loading data.
Eliminating the need to pick a distribution key when loading a large database that has many tables with a small number of rows. In such cases picking a good distribution key may have little performance benefit, but it gains the advantage of equal distribution of data.
Allowing you to verify a good distribution key by first loading the data round-robin, then using the GENERATE STATISTICS command and running selects on the database columns to get the min/max and counts. With this information, you can better choose which columns to use for the distribution key.
If you do not specify a distribution when you create a table, the system chooses a distribution key and there is no guarantee what that choice will be. If you use particular key largely in equi-join clauses, then that key is a good choice for the distribution key.

A good distribution key should always distribute data uniformly. Good distributed table looks like below when checked in Admin tool

Right distribution key without Netezza skew

Identify the Skew using utility Scripts

IBM has provided utility to identify the skew in the tables or records. Below are the scripts that you can use to identify skew in Netezza system:

nz_skew – Identify any issues with data skew in the system.
nz_record_skew – This script can be used to check record skew for a given table

Read: