Greenplum Skew and How to Avoid it

  • Post author:
  • Post last modified:February 28, 2018
  • Post category:Greenplum
  • Reading time:4 mins read

Greenplum is a MPP shared nothing environment. Data is spread across the many segments located on the multiple segment hosts. If the data is distributed properly, no two segments in the system have same data. The even distribution of the data is determined by the column(s) provided in the DISTRIBUTED BY clause. Greenplum skew is the table situation that degrade the performance.

System distributes the rows with same distribution values to same segment. Hence, the more the unique value in the distribution column, the better. In case if the data is distributed on the non-unique column, some segments end up having more data and workload than others. This situation is called the skew.

For Example, consider we have the patient table in Greenplum, and say, table has the distribution key patient_id. Let us say the Greenplum system has 8 segments. Below is how the data will actually be stored in the Greenplum.

Patient_id Segmentid
1 0
2 1
3 2
4 3
5 4
6 5
7 6
8 7
9 0
10 1
11 2
12 3
13 4
14 5
15 6
16 7

How the Greenplum process the query in shared nothing environment?

When you execute the query, master will create the processes and distribute across all the segments to execute.

Say, you executed query like ‘select count(1) from patient’, master will create the 8 processes for each segments. These processes will run in parallel across all the segments to count rows.

Overall response time for a query is measured by the process completion time for all segments. If the data is skewed, the segments with more data will have a longer completion time.

Now let’s see what Skew is,

Skew In Greenplum

In our patient table, data is evenly distributed across all the segments. That is, each segment in cluster has same number of rows. Now let’s say, we redistribute data on region_code. All patients from the south region would go into single segment. This situation is called skew.

Greenplum Skew

When you execute the query ‘select count(1) from patient’ again, segment which has the south region patient will take more time as it has lot of data in it. Overall response time for a query will increase.

Check Data Distribution

To see the data distribution of a Greenplum table’s rows, you can run a query such as:

=# SELECT gp_segment_id, count(*)  
FROM patient 
GROUP BY gp_segment_id;

This query will give the number of rows on each segment. Data distribution check is not required when you have table distributed on unique or primary key. Such columns will give you great distribution of the data.

Data Distribution Best Practices

To know the best practices on tables distribution in Greenplum;

Read: