Greenplum is a massive parallel processing data store, and data is distributed across segments as per the definition of the distribution strategy. Greenplum Table Distribution uses the two types of distribution, Hash and Random. When you create or alter tables you will have to tell the system which distribution it should use. By default, Greenplum database data distribution uses the hash algorithm.
Types of Greenplum Data Distribution
Greenplum database distributes data using two methods
- Column Oriented/Hash Distribution: Distributes data evenly across all segment using the column specified in DISTRIBUTED BY (Column_Name)
See: How Greenplum Hash Distribution Works?
- Random Distribution: Distributes data randomly across all segments using round-robin distribution. In this type of distribution, there won’t be any skew on the segments. It is specified by DISTRIBUTED RANDOMLY.
Table will never be collocated in case of random distribution. Either redistribute or broadcast operations require performing a table join.
Random distribution should be used for small tables and when a Hash distribution method is not feasible due to significant data skew
Table distribution can be modified at any given point of the time. If the table has unique key or primary key, that key is must be used to distribute the data. In case if you miss to provide the distribution key, Greenplum database uses the default strategy. By default, it distributes data on first column using hash strategy. You should avoid using default distribution as it may cause the skew.
Choosing the Greenplum Table Distribution Key
Consider the following best practices when deciding on a distribution strategy:
- Commonly used joined tables should be distributed on same key. This allows the Greenplum to perform the location operations hence increasing performance.
- An Integer column when possible. Hash joins works better on the integers hence integer is better choice for distribution key.
- A column that does not allow NULL, or minimum NULL values. Null values will hash to the same distribution and potentially cause skew.
- Explicitly define a column or random distribution for all tables. Do not use the default as it may cause skew on segments.
- You should use a single best suited column that will distribute data across all segments evenly. You should distribute data on unique or primary key columns.
- Do not distribute on columns that will be used in the WHERE clause of a query. You should partition the data on the column that is used in WHERE clause.
- Do not distribute on dates or timestamps. Distributing data on date or timestamp may case poor system performance.
- The distribution key column data should contain unique values or very high cardinality.
- If a single column cannot achieve an even distribution, use a multi-column distribution key, but no more than two columns. Additional columns in distribution require additional time in the hashing process.
- In case if a two-column distribution key cannot achieve an even distribution of data, use a random distribution. Multi-column distribution keys in most cases require motion operations to join tables, so they offer no advantages over a random distribution.
Redistribute and Broadcast Motions
When two tables are distributed on same key and that key is used in the join operation than system performs all the operations locally as required data is available within segment. Such tables are called co-located tables.
Whenever system cannot perform the local join, then it performs either redistribute or broadcast motion to co-locate table.
Redistribute Motions: In order to perform a local join, rows must be located together on the same segment and in case if absence of which, a dynamic redistribution of the needed rows from one of the segment instance to another segment instance will be performed. This operation might be quite expensive in case of large fact tables.
Broadcast Motions: In the broadcast motion, every segment will sends the copy of table to all segment instances. Optimizer always picks only small table for broadcast motions.
Greenplum Table Distribution: Syntax.
CREATE TABLE tablename ( Col1 data_type, Col2 data_type, Col3 data_type …) [DISTRIBUTE BY (column_name)] -- Used for Hash distribution [DISTRIBUTED RANDOMLY] -- Used for Random distribution, distributes on round robin fashion
View Greenplum Table Distribution
To see the data distribution of a table’s rows, you can run a query such as:
training=# SELECT gp_segment_id, count(*) FROM table_name GROUP BY gp_segment_id;
This query will return the number of rows on each segment.