This post is all about how data is distributed (Netezza distribution key) in Netezza server. Feel free to make comments or suggestions to improve it, or pass it on if you like.
Let’s first understand how NPS stores the data on disk drives. Each Snippet Processor in the Snippet Processing Unit (SPU) has a dedicated hard drive has its separate CPU, FPGA, separate RAM memory, hard disks and the data on stored on drive is called a data slice.
Read:
- Changing Netezza Table Distribution Key
- Cluster Based Tables (CBT) in Netezza
- Working with Netezza Zone Maps
- Netezza Skew and How to Avoid it
Netezza Partitions
Before going to Netezza distribution you may want to learn about SPU. Each snippet processing unit (SPU) disk is divided into three partitions.
Primary Partition: This partition contains its own set of data.
Mirror Partition: This partition contains another SPU’s primary partition data. The mirror partition automatically mirrors another SPU’s primary data slice. This process is called mirroring. In multi-rack configuration, a SPU mirror is located in another rack. This enables the Netezza to handle the fault tolerance. In case of racks fails, data is mirrored in other rack and can be recovered. This process is also called replication.
Swap Partition: Swap partition is used for aggregating, sorting and other query operations. Swap partition is not mirrored in other word it is called temp or intermediate processing data. Netezza makes use of this temp partition to perform the SQL operations.
Tables are split across multiple SPUs, data slices and the data is stored in groups in same or nearby data slices according to rows, while data is compressed according to identical column values i.e. columnar compression.
Now the question is, what is Netezza Distribution key and how the data is distributed?
The actual distribution of data across disks is determined by the Netezza distribution key(s) provided in table definition.
In fact, in Netezza there are two types of distribution methods available, a Hash and Random distribution. There should be a distribution column(s) specified, if the DISTRIBUTE ON clause is not specified, the system uses first column as the default Distribution column using the hash algorithm. The maximum number of columns that can specify in distribution on clause is four. When the system creates record in the Netezza data base it assign those records to logical slices based on their distribution key(s) and then pushed to physical stored identified in logical slices. Netezza creates zone maps for the loaded records
Data should uniformly distribute across all the data slices, the processing performance is directly dependent on it. The even distribution of the data over all data slices is directly related to key column(s) used in distribution on clause. Bad distribution key(s) can cause the Netezza to place data on same slices and that will cause the data skew, a major performance bottleneck.
For example, consider table is distributed on the column that contains only flags such as Y or N. When data is inserted into table, hashed value will be pointing to only two data slices and data will be place on those slices.
A distribution method that distributes data evenly across all data slices is the single most important factor that can increase overall system performance. Bad distribution key(s) can result in uneven distribution of a table across data slices and SPUs that will cause skew, causing data to be redistributed or broadcasted, of course that will hamper the system performance. It is very important to identify the correct and proper distribution key when creating table definition and that require the extremely good knowledge on data and system.
Keep the table structure simple that helps to attain maximum co-locations so that Netezza can use its MPP power to give you result as fast as possible.
Distribution on multiple keys
The maximum number of columns that can specify in distribution on clause is four. Keep in mind that you should not use the multiple keys to distribute data if the chosen key provides good data distribution. However, distribution keys should be used in join condition in order to achieve co-location. If we use multiple distribution keys, then those columns must be used in joining conditions. That is very rare case so it is better to stick to single distribution key that provides good data distribution. The distribution is not an index – it is based on hash algorithm. In multi-key distributions, the join does not look at one column then the next; it looks at all three combinations at once because they are hashed together for the distribution.
Leaving out ‘Distribution on’ clause in Netezza distribution
If Netezza distribution key is not specified, system will, by default distribute the data on first column (just like Teradata) using hash partition method. This process sometime can cause the skew if the first column is something like flag. The best method is to distribute data on column or on random if you are not able identify the best column.
What are factors when choosing Netezza distribution key?
When choosing which columns should be the distribution key for a table, your goal should be uniform distribution of the rows and optimal access to the data. The main goal of the distribution is to get data co-located.
Consider the following factors when choosing distribution keys:
- The more distinct the distribution key values, the better.
- The system distributes rows with the same distribution key value to the same data slice.
- Parallel processing is more efficient when you have distributed table rows evenly across the data slices.
- Tables used together should use the same columns for their distribution key. For example, in an order system application, use the Customer ID as the distribution key for both the customer table and the order table.
- If a particular key is used largely in Equi-join clauses, then that key is a good choice for the distribution key.