Distribution
Overview of distribution policies
Distribution is one of the most important concepts underlying ADB. It means storing table rows on different segments of the cluster. The more uniform the distribution of the data between segments is — the higher the whole cluster performance is. All segments should contain approximately equal portions of data.
All ADB tables are distributed. When you create or alter a table, you optionally specify one of the available policies — to determine the table row distribution. Distribution policies are listed below.
Type | Description | Example |
---|---|---|
DISTRIBUTED BY (<column(-s)>) |
A hash distribution. The particular segment is chosen on the basis of hashes. These hashes are calculated for the fields specified in brackets. Recommended for tables that have primary keys or unique columns — these columns can be used as distribution key (the field that is used to choose the segment) |
|
DISTRIBUTED REPLICATED |
A full distribution. A copy of the table is stored on every cluster segment. Recommended only for small tables (e.g. reference tables). Can be useful to avoid any motion when joining tables |
|
DISTRIBUTED RANDOMLY |
A random distribution that is based on the round-robin algorithm. Since the system chooses the particular segment randomly, uniformity is not guaranteed. Recommended for other cases — when tables do not have unique columns and the table size is not small |
|
The image below shows (in a simplified way) how to choose the appropriate distribution policy.
Recommendations on avoiding data skew
For the best possible performance, all cluster segments should contain equal portions of data. Otherwise, the skew can occur. The skew means the data becomes unbalanced and the segments with more data start to work harder to perform their part of the query processing. To avoid the skew and make the data distribution more efficient, follow the recommendations listed below:
-
Always explicitly specify the distribution policy when creating a table. If the
DISTRIBUTED
clause is missed, the table distribution policy is controlled by thegp_create_table_random_default_distribution
configuration parameter. That can lead to choosing the first suitable column as the distribution key, unless the parameter is explicitly set toON
. See Greenplum documentation for more information. -
Avoid selecting the following non-uniformly distributed data fields as distribution keys:
-
Dates and timestamps.
-
The fields that contain a lot of the
NULL
values. -
Custom data type fields with the unknown subsequent distribution.
Instead, try to choose a distribution key that is unique for each record, such as the primary key or any unique column without
NULL
values. -
-
Make sure that the data is uniformly distributed after the initial and incremental loads. To obtain the segment where the current table row is stored, you can run the following query:
SELECT gp_segment_id, * FROM test_distr_key;
The result can be similar to:
gp_segment_id | a | b ---------------+---+------- 4 | 1 | first (1 row)
The
gp_segment_id
system field corresponds to thecontent
column of thepg_catalog.gp_segment_configuration
table that contains information about the cluster segments. This field is hidden by default. It can be obtained for all tables except the tables with theDISTRIBUTED REPLICATED
distribution policy (as in this case the data is duplicated on all segments).CAUTIONIt is not recommended to request thegp_segment_id
field for large tables on a regular basis. -
If necessary, you can change the distribution type using one of the following ways.
-
Changing the distribution policy:
ALTER TABLE test_distr_replicated SET DISTRIBUTED RANDOMLY;
-
Redistribution without changing the distribution key:
ALTER TABLE test_distr_key SET WITH (REORGANIZE=TRUE);
-
-
The correct data distribution also affects the speed of queries execution, especially the
JOIN
queries. The main recommendations on data querying are:-
Do not choose the fields used in the
WHERE
clauses as distribution keys. For the best performance, all segments should handle an equal share of the query workload. But if theWHERE
predicate in a query references a distribution key, the query processing work is concentrated on only one segment that stores all the rows with the requested distribution key value. -
Remember that joining tables by their distribution keys is the fastest way to join tables due to identical values being always stored on a single segment. When tables share a common distribution key, the work of joining or sorting on this key is done locally — without distributing data tuples across the segments. Local operations are always faster than distributed ones.
-
Use the
DISTRIBUTED REPLICATED
policy for small frequently used tables. That improves queries performance by making it unnecessary to broadcast a table to all segments (so-called Broadcast Motion). -
Choose the same data types for the fields that are used to join tables.
-
Examples
-
Create two tables as shown below:
CREATE TABLE t1 (id INT, msg TEXT) DISTRIBUTED BY(id); CREATE TABLE t2 (id INT, msg TEXT) DISTRIBUTED BY(id);
-
Insert some data into the tables:
INSERT INTO t1 (id, msg) VALUES (1, 'text1'), (2, 'text2'); INSERT INTO t2 (id, msg) VALUES (1, 'text3'), (2, 'text4');
-
Check that in both tables the rows with the same distribution key value are stored on one segment:
SELECT gp_segment_id, * FROM t1;
The result is:
gp_segment_id | id | msg ---------------+----+------- 4 | 1 | text1 9 | 2 | text2 (2 rows)
SELECT gp_segment_id, * FROM t2;
The result is shown below. You can see that segment identifiers are similar to the listed above:
gp_segment_id | id | msg ---------------+----+------- 9 | 2 | text4 4 | 1 | text3 (2 rows)
-
Use the
EXPLAIN
statement to see the plan of joining two tables.EXPLAIN SELECT * FROM t1 JOIN t2 ON t1.id = t2.id;
The result is:
QUERY PLAN --------------------------------------------------------------------------------- Gather Motion 16:1 (slice1; segments: 16) (cost=0.00..862.00 rows=2 width=20) -> Hash Join (cost=0.00..862.00 rows=1 width=20) Hash Cond: (t1.id = t2.id) -> Seq Scan on t1 (cost=0.00..431.00 rows=1 width=10) -> Hash (cost=431.00..431.00 rows=1 width=10) -> Seq Scan on t2 (cost=0.00..431.00 rows=1 width=10) Optimizer: Pivotal Optimizer (GPORCA) (7 rows)
The command output shows that no data motions are needed for this query, except Gather Motion — which means a standard process of sending the data from Segments to Master.
Different data types of keys
-
Create one additional table as shown below:
CREATE TABLE t3 (id INT, msg TEXT) DISTRIBUTED BY(msg);
-
Insert some data into the table:
INSERT INTO t3 (id, msg) VALUES (1, 'text5'), (2, 'text6');
-
List the segments that store the table data:
SELECT gp_segment_id, * FROM t3;
The result is:
gp_segment_id | id | msg ---------------+----+------- 2 | 1 | text5 10 | 2 | text6 (2 rows)
You can see that the segments are not similar to those that were shown for the
t1
table in the previous example. -
Use the
EXPLAIN
statement to see the plan of joining thet3
andt1
tables.EXPLAIN SELECT * FROM t1 JOIN t3 ON t1.id = t3.id;
The result is:
QUERY PLAN ------------------------------------------------------------------------------------------------------- Gather Motion 16:1 (slice2; segments: 16) (cost=0.00..862.00 rows=2 width=20) -> Hash Join (cost=0.00..862.00 rows=1 width=20) Hash Cond: (t1.id = t3.id) -> Seq Scan on t1 (cost=0.00..431.00 rows=1 width=10) -> Hash (cost=431.00..431.00 rows=2 width=10) -> Broadcast Motion 16:16 (slice1; segments: 16) (cost=0.00..431.00 rows=2 width=10) -> Seq Scan on t3 (cost=0.00..431.00 rows=1 width=10) Optimizer: Pivotal Optimizer (GPORCA) (8 rows)
In comparison with the previous example, the Broadcast Motion is added. This operation means that the table copy is sent to all segments — as they have no necessary data for joining. This is not critical for small tables, but in case of joining large tables this can lead to significant performance problems. To fix it in our example, change the distribution key in the
t3
table frommsg
toid
.
Not unique field as a key
-
Create one additional table as shown below:
CREATE TABLE t4 (id INT, msg TEXT) DISTRIBUTED BY(id);
-
Insert some data into the table. Notice that the first column values are not unique:
INSERT INTO t4 (id, msg) VALUES (1, 'text7'), (1, 'text8');
-
List the segments that store the table data:
SELECT gp_segment_id, * FROM t4;
The result is:
gp_segment_id | id | msg ---------------+----+------- 4 | 1 | text7 4 | 1 | text8 (2 rows)
You can see that one segment is used for two rows as they have the same value of the distribution key. You should avoid such situations as they cause the workload to increase on single segments. To fix it in our example, choose the unique field as the distribution key. If this option is impossible, change the distribution policy to the
DISTRIBUTED RANDOMLY
value.