Конференция Arenadata
Новое время — новый Greenplum
Мы приглашаем вас принять участие в конференции, посвященной будущему Open-Source Greenplum 19 сентября в 18:00:00 UTC +3. Встреча будет проходить в гибридном формате — и офлайн, и онлайн. Онлайн-трансляция будет доступна для всех желающих.
Внезапное закрытие Greenplum его владельцем — компанией Broadcom - стало неприятным сюрпризом для всех, кто использует или планирует начать использовать решения на базе этой технологии. Многие ожидают выхода стабильной версии Greenplum 7 и надеются на её дальнейшее активное развитие.
Arenadata не могла допустить, чтобы разрабатываемый годами Open-Source проект Greenplum прекратил своё существование, поэтому 19 сентября мы представим наш ответ на данное решение Broadcom, а участники сообщества получат исчерпывающие разъяснения на все вопросы о дальнейшей судьбе этой технологии.

На конференции вас ждёт обсуждение следующих тем:

  • План возрождения Greenplum;
  • Дорожная карта;
  • Экспертное обсуждение и консультации.
Осталось до события

Distribution

Overview of distribution policies

Distribution is one of the most important concepts underlying ADB. It means storing table rows on different segments of the cluster. The more uniform the distribution of the data between segments is — the higher the whole cluster performance is. All segments should contain approximately equal portions of data.

All ADB tables are distributed. When you create or alter a table, you optionally specify one of the available policies — to determine the table row distribution. Distribution policies are listed below.

Distribution policies
Type Description Example

DISTRIBUTED BY (<column(-s)>)

A hash distribution. The particular segment is chosen on the basis of hashes. These hashes are calculated for the fields specified in brackets. Recommended for tables that have primary keys or unique columns — these columns can be used as distribution key (the field that is used to choose the segment)

CREATE TABLE test_distr_key (a INT PRIMARY KEY, b TEXT)
DISTRIBUTED BY (a);

DISTRIBUTED REPLICATED

A full distribution. A copy of the table is stored on every cluster segment. Recommended only for small tables (e.g. reference tables). Can be useful to avoid any motion when joining tables

CREATE TABLE test_distr_replicated (a TEXT, b TEXT)
DISTRIBUTED REPLICATED;

DISTRIBUTED RANDOMLY

A random distribution that is based on the round-robin algorithm. Since the system chooses the particular segment randomly, uniformity is not guaranteed. Recommended for other cases — when tables do not have unique columns and the table size is not small

CREATE TABLE test_distr_random (a TEXT, b INT)
DISTRIBUTED RANDOMLY;

The image below shows (in a simplified way) how to choose the appropriate distribution policy.

Choosing the distribution policy
Choosing the distribution policy
Choosing the distribution policy
Choosing the distribution policy

Recommendations on avoiding data skew

For the best possible performance, all cluster segments should contain equal portions of data. Otherwise, the skew can occur. The skew means the data becomes unbalanced and the segments with more data start to work harder to perform their part of the query processing. To avoid the skew and make the data distribution more efficient, follow the recommendations listed below:

  1. Always explicitly specify the distribution policy when creating a table. If the DISTRIBUTED clause is missed, the table distribution policy is controlled by the gp_create_table_random_default_distribution configuration parameter. That can lead to choosing the first suitable column as the distribution key, unless the parameter is explicitly set to ON. See Greenplum documentation for more information.

  2. Avoid selecting the following non-uniformly distributed data fields as distribution keys:

    • Dates and timestamps.

    • The fields that contain a lot of the NULL values.

    • Custom data type fields with the unknown subsequent distribution.

    Instead, try to choose a distribution key that is unique for each record, such as the primary key or any unique column without NULL values.

    You can also specify custom hash functions for a hash distribution. To do this, fill in the <opclass> argument in the DISTRIBUTED BY clause. It defines an operator class name. For more details on syntax, see Tables.

  3. Make sure that the data is uniformly distributed after the initial and incremental loads. To obtain the segment where the current table row is stored, you can run the following query:

    SELECT gp_segment_id, * FROM test_distr_key;

    The result can be similar to:

     gp_segment_id | a |   b
    ---------------+---+-------
                 4 | 1 | first
    (1 row)

    The gp_segment_id system field corresponds to the content column of the pg_catalog.gp_segment_configuration table that contains information about the cluster segments. This field is hidden by default. It can be obtained for all tables except the tables with the DISTRIBUTED REPLICATED distribution policy (as in this case the data is duplicated on all segments).

    CAUTION
    It is not recommended to request the gp_segment_id field for large tables on a regular basis.
  4. If necessary, you can change the distribution type using one of the following ways.

    • Changing the distribution policy:

      ALTER TABLE test_distr_replicated SET DISTRIBUTED RANDOMLY;
    • Redistribution without changing the distribution key:

       ALTER TABLE test_distr_key SET WITH (REORGANIZE=TRUE);
  5. The correct data distribution also affects the speed of queries execution, especially the JOIN queries. The main recommendations on data querying are:

    • Do not choose the fields used in the WHERE clauses as distribution keys. For the best performance, all segments should handle an equal share of the query workload. But if the WHERE predicate in a query references a distribution key, the query processing work is concentrated on only one segment that stores all the rows with the requested distribution key value.

    • Remember that joining tables by their distribution keys is the fastest way to join tables due to identical values being always stored on a single segment. When tables share a common distribution key, the work of joining or sorting on this key is done locally — without distributing data tuples across the segments. Local operations are always faster than distributed ones.

    • Use the DISTRIBUTED REPLICATED policy for small frequently used tables. That improves queries performance by making it unnecessary to broadcast a table to all segments (so-called Broadcast Motion).

    • Choose the same data types for the fields that are used to join tables.

Examples

Correct distribution

 

  1. Create two tables as shown below:

    CREATE TABLE t1 (id INT, msg TEXT) DISTRIBUTED BY(id);
    CREATE TABLE t2 (id INT, msg TEXT) DISTRIBUTED BY(id);
  2. Insert some data into the tables:

    INSERT INTO t1 (id, msg) VALUES (1, 'text1'), (2, 'text2');
    INSERT INTO t2 (id, msg) VALUES (1, 'text3'), (2, 'text4');
  3. Check that in both tables the rows with the same distribution key value are stored on one segment:

    SELECT gp_segment_id, * FROM t1;

    The result is:

     gp_segment_id | id |  msg
    ---------------+----+-------
                 4 |  1 | text1
                 9 |  2 | text2
    (2 rows)
    SELECT gp_segment_id, * FROM t2;

    The result is shown below. You can see that segment identifiers are similar to the listed above:

     gp_segment_id | id |  msg
    ---------------+----+-------
                 9 |  2 | text4
                 4 |  1 | text3
    (2 rows)
  4. Use the EXPLAIN statement to see the plan of joining two tables.

    EXPLAIN SELECT * FROM t1 JOIN t2 ON t1.id = t2.id;

    The result is:

                                       QUERY PLAN
    ---------------------------------------------------------------------------------
     Gather Motion 16:1  (slice1; segments: 16)  (cost=0.00..862.00 rows=2 width=20)
       ->  Hash Join  (cost=0.00..862.00 rows=1 width=20)
             Hash Cond: (t1.id = t2.id)
             ->  Seq Scan on t1  (cost=0.00..431.00 rows=1 width=10)
             ->  Hash  (cost=431.00..431.00 rows=1 width=10)
                   ->  Seq Scan on t2  (cost=0.00..431.00 rows=1 width=10)
     Optimizer: Pivotal Optimizer (GPORCA)
    (7 rows)

    The command output shows that no data motions are needed for this query, except Gather Motion — which means a standard process of sending the data from Segments to Master.

Wrong distribution

 

     Different data types of keys

  1. Create one additional table as shown below:

    CREATE TABLE t3 (id INT, msg TEXT) DISTRIBUTED BY(msg);
  2. Insert some data into the table:

    INSERT INTO t3 (id, msg) VALUES (1, 'text5'), (2, 'text6');
  3. List the segments that store the table data:

    SELECT gp_segment_id, * FROM t3;

    The result is:

     gp_segment_id | id |  msg
    ---------------+----+-------
                 2 |  1 | text5
                10 |  2 | text6
    (2 rows)

    You can see that the segments are not similar to those that were shown for the t1 table in the previous example.

  4. Use the EXPLAIN statement to see the plan of joining the t3 and t1 tables.

    EXPLAIN SELECT * FROM t1 JOIN t3 ON t1.id = t3.id;

    The result is:

                                                  QUERY PLAN
    -------------------------------------------------------------------------------------------------------
     Gather Motion 16:1  (slice2; segments: 16)  (cost=0.00..862.00 rows=2 width=20)
       ->  Hash Join  (cost=0.00..862.00 rows=1 width=20)
             Hash Cond: (t1.id = t3.id)
             ->  Seq Scan on t1  (cost=0.00..431.00 rows=1 width=10)
             ->  Hash  (cost=431.00..431.00 rows=2 width=10)
                   ->  Broadcast Motion 16:16  (slice1; segments: 16)  (cost=0.00..431.00 rows=2 width=10)
                         ->  Seq Scan on t3  (cost=0.00..431.00 rows=1 width=10)
     Optimizer: Pivotal Optimizer (GPORCA)
    (8 rows)

    In comparison with the previous example, the Broadcast Motion is added. This operation means that the table copy is sent to all segments — as they have no necessary data for joining. This is not critical for small tables, but in case of joining large tables this can lead to significant performance problems. To fix it in our example, change the distribution key in the t3 table from msg to id.

 
     Not unique field as a key

  1. Create one additional table as shown below:

    CREATE TABLE t4 (id INT, msg TEXT) DISTRIBUTED BY(id);
  2. Insert some data into the table. Notice that the first column values are not unique:

    INSERT INTO t4 (id, msg) VALUES (1, 'text7'), (1, 'text8');
  3. List the segments that store the table data:

    SELECT gp_segment_id, * FROM t4;

    The result is:

     gp_segment_id | id |  msg
    ---------------+----+-------
                 4 |  1 | text7
                 4 |  1 | text8
    (2 rows)

    You can see that one segment is used for two rows as they have the same value of the distribution key. You should avoid such situations as they cause the workload to increase on single segments. To fix it in our example, choose the unique field as the distribution key. If this option is impossible, change the distribution policy to the DISTRIBUTED RANDOMLY value.

Found a mistake? Seleсt text and press Ctrl+Enter to report it