When you have a Distribution Key by Hash and the values in that column are unique, the data will spread evenly evenly across all segments in Greenplum system. The Greenplum system distributes the rows with same distribution value to the same segment. This is because the data values in the hash key use a hashing algorithm.
How Hash Algorithm Works in Distributed systems?
Data is stored based on selected field (s) which are used for distribution. When you have a Distribution Key by Hash the values of the Distribution Key are run through a Hash Formula. Then, a map is used to distribute the row to the correct segment. The formula is designed to be consistent so that all like values go to the same segment.
==Data(A) => Hash Function(B) => Logical Segment list(C) => Physical Segment list(D) => Storage(E).
When data arrives at the Greenplum, it is hashed based on field(s) and a hash function (B) is used for this purpose.
For example, Consider 3 node system, logical segment list has 3 unique entries. If there are 24 hashed data items from (B), there are 24 entries in (C), then all having only 4 segment entries. For example (C) has values [1,2,3,1,2,3,1,2,3,1,2,3,1,2,3…..]. Then, a map is used to distribute the row to the correct segment. The formula is designed to be consistent so that all like values go to the same segment.
Perfect Distribution Key
Consider the Student_marks table. Let us assume data in the table are distributed on student_id which is a primary key. In this example, Greenplum system distributes the data evenly across all the segments as depicted in figure.
CREATE TABLE Students_marks ( Student_ID integer, Sub_code char(3), Name text, Marks integer ) DISTRIBUTED BY (Student_ID);
Read:
Poor Distribution Key
Consider the Student_marks table. Now let us store which are distributed on Sub_code. In this example, Greenplum system distributes the data unevenly across all the segments as depicted in figure. Uneven distribution causes data skew for instance node 1 in below figure
CREATE TABLE Students_marks ( Student_ID integer, Sub_code char(3), Name text, Marks integer ) DISTRIBUTED BY (Sub_code);