Конференция Arenadata
Новое время — новый Greenplum
Мы приглашаем вас принять участие в конференции, посвященной будущему Open-Source Greenplum 19 сентября в 18:00:00 UTC +3. Встреча будет проходить в гибридном формате — и офлайн, и онлайн. Онлайн-трансляция будет доступна для всех желающих.
Внезапное закрытие Greenplum его владельцем — компанией Broadcom - стало неприятным сюрпризом для всех, кто использует или планирует начать использовать решения на базе этой технологии. Многие ожидают выхода стабильной версии Greenplum 7 и надеются на её дальнейшее активное развитие.
Arenadata не могла допустить, чтобы разрабатываемый годами Open-Source проект Greenplum прекратил своё существование, поэтому 19 сентября мы представим наш ответ на данное решение Broadcom, а участники сообщества получат исчерпывающие разъяснения на все вопросы о дальнейшей судьбе этой технологии.

На конференции вас ждёт обсуждение следующих тем:

  • План возрождения Greenplum;
  • Дорожная карта;
  • Экспертное обсуждение и консультации.
Осталось до события

Analyze queries

PostgreSQL creates an execution plan for each query. Since the resulting plan is critical to performance, PostgreSQL includes a complex planner that chooses the best plan based on an estimated cost of each alternative option. You can use the EXPLAIN command to see what plan the planner produces for a query.

Create a table for test examples:

CREATE TABLE test_table (field1 integer, field2 text);

Add data to the table and execute the ANALYZE command to create table statistics after data update:

INSERT INTO test_table
  SELECT i, md5(random()::text)
  FROM generate_series(1, 1000000) AS i;
ANALYZE test_table;

Call the EXPLAIN command to see the query plan:

EXPLAIN SELECT * FROM test_table;

The result:

 QUERY PLAN
---------------------------------------------------------------------
Seq Scan on test_table (cost=0.00..18334.00 rows=1000000 width=37)

PostgreSQL can read data from a table in several ways. Seq Scan reports that PostgreSQL reads data sequentially, block by block.

The cost option contains two values: estimated start-up and total costs of the query execution. The start-up cost has the 0.00 value. This is the cost value before the output phase begins, for example, the cost value required to sort in a sort node. The estimated total cost (18334.00) shows the cost that is necessary for retrieving all available rows. The cost option does not take into account resources taken to send result rows to the client, because the planner cannot change this value by altering the plan. By default, the cost is measured in units of disk page fetches (seq_page_cost). The cost value most often indicates the consumption of resources, and not the required time, although these values correlate with each other. See Planner Cost Constants for details.

The rows option is the estimated number of rows that Seq Scan returns.

The width option is the estimated average size of an output row in bytes.

When you call EXPLAIN, PostgreSQL does not execute the query, instead it creates an estimated execution plan based on available statistics. The actual plan can differ. Use the ANALYZE parameter to execute the query and get actual statistics.

EXPLAIN (ANALYZE) SELECT * FROM test_table;

The result:

                      QUERY PLAN
------------------------------------------------------------------
Seq Scan on test_table
    (cost=0.00..18334.00 rows=1000000 width=37)
    (actual time=0.005..62.035 rows=1000000 loops=1)
Planning Time: 0.056 ms
Execution Time: 96.757 ms

The actual time option is the time expended to get the first row and all rows, in milliseconds.

The rows option is the actual number of rows that Seq Scan returns.

The loops option shows how many times PostgreSQL scans the entire table, i.e. executes Seq Scan on all table data.

Execution Time is the total time to complete the request.

The EXPLAIN ANALYZE command executes queries. If you need to get statistics on the modifying INSERT, DELETE, or UPDATE statements and do not change data, use the BEGIN and ROLLBACK commands to execute EXPLAIN ANALYZE inside a transaction block with a rollback:

BEGIN;

EXPLAIN (ANALYZE) INSERT INTO test_table
  SELECT i, md5(random()::text)
  FROM generate_series(1, 100) AS i;

ROLLBACK;

Cache usage

The EXPLAIN command allows you to get information on buffer usage. Utilize the BUFFERS parameter for this purpose.

To see correct results, restart ADPG to clear the buffer cache before testing BUFFERS. You can use the Reconfigure & Restart action for this.

Execute the EXPLAIN command with the ANALYZE and BUFFERS parameters:

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM test_table;

The result:

                      QUERY PLAN
------------------------------------------------------------------
Seq Scan on test_table
    (cost=0.00..18334.00 rows=1000000 width=37)
    (actual time=1.400..626.291 rows=1000000 loops=1)
   Buffers: shared read=8334
Planning:
   Buffers: shared hit=47 read=14
Planning Time: 35.535 ms
Execution Time: 663.332 ms

PostgreSQL reads the table in parts called blocks. The Buffers: shared read option is the number of blocks read from the disk. The cache is empty and the table is completely read from the disk, 8334 blocks are read.

Run the same query:

EXPLAIN (ANALYZE,BUFFERS) SELECT * FROM test_table;

The result:

                      QUERY PLAN
-------------------------------------------------------------------
Seq Scan on test_table
    (cost=0.00..18334.00 rows=1000000 width=37)
    (actual time=0.037..73.519 rows=1000000 loops=1)
  Buffers: shared hit=32 read=8302
Planning Time: 0.033 ms
Execution Time: 108.673 ms

PostgreSQL adds the Buffers: shared hit parameter to the result. Buffers: shared hit is the number of blocks read from the PostgreSQL cache. If you repeat this query multiple times, you will see that PostgreSQL takes more and more data from the cache each time. PostgreSQL adds data to the cache on each request. It reads from the cache faster than from the disk. PostgreSQL increases Buffers: shared hit and reduces Execution Time on each query execution.

The shared_buffers configuration parameter determines the cache size. You can change it via ADCM, see Configuration parameters.

Index usage

Add the WHERE clause to the query:

EXPLAIN SELECT * FROM test_table WHERE field1 > 750;

The result:

                             QUERY PLAN
---------------------------------------------------------------------
Seq Scan on test_table (cost=0.00..20834.00 rows=999288 width=37)
  Filter: (field1 > 750)

The table has no indexes, and PostgreSQL reads each record sequentially (Seq Scan).

Filter: (field1 > 750) means that each entry is compared with the condition field1 > 750. If a field1 value meets the condition, PostgreSQL inserts the corresponding row into the result.

The cost values is increased. The expected number of result rows (rows) is reduced because of the WHERE clause.

Create an index on field1 and execute the same query:

CREATE INDEX ON test_table(field1);
EXPLAIN (ANALYZE) SELECT * FROM test_table WHERE field1 > 750;

The result:

                             QUERY PLAN
---------------------------------------------------------------------
 Seq Scan on test_table
    (cost=0.00..20834.00 rows=999279 width=37)
    (actual time=0.132..89.711 rows=999250 loops=1)
   Filter: (field1 > 750)
   Rows Removed by Filter: 750
Planning Time: 0.083 ms
Execution Time: 121.256 ms

Only 750 rows were filtered out, query returns more than 99.9% of the table. Despite an index, the planner chose Seq Scan.

Utilize the enable_seqscan setting to disable Seq Scan and force the query planner to use the index:

SET enable_seqscan TO off;
EXPLAIN (ANALYZE) SELECT * FROM test_table WHERE field1 > 750;

The result:

                             QUERY PLAN
---------------------------------------------------------------------
Index Scan using test_table_field1_idx on test_table
    (cost=0.42..36794.81 rows=999279 width=37)
    (actual time=0.054..162.518 rows=999250 loops=1)
   Index Cond: (field1 > 750)
Planning Time: 0.087 ms
Execution Time: 195.522 ms

Index Scan means that PostgreSQL scans an index. Index Cond contains the condition used to find rows.

PostgreSQL uses the index, but it increases the execution time. The planner chose the best plan without index in the previous example.

Enable the enable_seqscan setting and modify the query as follows:

SET enable_seqscan TO on;
EXPLAIN (ANALYZE) SELECT * FROM test_table WHERE field1 < 750;

The result:

                             QUERY PLAN
---------------------------------------------------------------------
Index Scan using test_table_field1_idx on test_table
    (cost=0.42..31.02 rows=720 width=37)
    (actual time=0.031..0.148 rows=749 loops=1)
   Index Cond: (field1 < 750)
Planning Time: 0.074 ms
Execution Time: 0.182 ms

The planner uses the test_table_field1_idx index to get the best result.

Add an index on field2. Since our database stores text fields in UTF8, create the index with the text_pattern_ops operator class.

CREATE INDEX ON test_table(field2 text_pattern_ops);

Run the following query with the condition for the text field:

EXPLAIN (ANALYZE) SELECT * FROM test_table WHERE field2 LIKE 'a%';

The result:

                             QUERY PLAN
---------------------------------------------------------------------
Bitmap Heap Scan on test_table
        (cost=2711.24..11912.67 rows=70714 width=37)
        (actual time=7.672..25.626 rows=62382 loops=1)
   Filter: (field2 ~~ 'a%'::text)
   Heap Blocks: exact=8332
   ->  Bitmap Index Scan on test_table_field2_idx1
        (cost=0.00..2693.57 rows=69314 width=0)
        (actual time=6.343..6.344 rows=62382 loops=1)
         Index Cond: ((field2 ~>=~ 'a'::text) AND (field2 ~<~ 'b'::text))
Planning Time: 0.097 ms
Execution Time: 28.520 ms

In the EXPLAIN command output, the -> sign and indentation reflect the hierarchical plan tree structure.

Bitmap Index Scan collects a bitmap with the positions of records that match the condition. After that, PostgreSQL sequentially scans the target table for these positions within Bitmap Heap Scan.

Bitmap Heap Scan works effectively with a large number of records that correspond to a condition.

PostgreSQL also supports Index Only Scan operation that is used if a query only references indexed columns. Execute a query that returns a single indexed field:

EXPLAIN SELECT field1 FROM test_table WHERE field1 < 750;

The result:

                             QUERY PLAN
---------------------------------------------------------------------
Index Only Scan using test_table_field1_idx on test_table
   (cost=0.42..29.69 rows=758 width=4)
  Index Cond: (field1 < 750)

Index Only Scan is faster than Index Scan, because Index Only Scan does not need to read the entire row (width=4), except checking the visibility of an index row.

Indexes improve performance of different operations, including sorting. Consider the ORDER BY clause.

Drop the index for field1:

DROP INDEX test_table_field1_idx;

Sort the table by field1:

EXPLAIN (ANALYZE) SELECT * FROM test_table ORDER BY field1;

The result:

                             QUERY PLAN
---------------------------------------------------------------------
Gather Merge  (cost=63795.04..161033.70 rows=833416 width=37)
              (actual time=94.311..273.651 rows=1000000 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  ->  Sort  (cost=62795.02..63836.79 rows=416708 width=37)
            (actual time=91103..127.215 rows=333367 loops=3)
        Sort Key: field1
        Sort Method: external merge  Disk: 13568kB
        Worker 0:  Sort Method: external merge  Disk: 12040kB
        Worker 1:  Sort Method: external merge  Disk: 20504kB
        ->  Parallel Seq Scan on test_table
            (cost=0.00..12502.08 rows=416708 width=37)
            (actual time=0.008..29.574 rows=333367 loops=3)
Planning Time: 0.099 ms
Execution Time: 316.971 ms

When the planner determines that a parallel query is the fastest execution strategy for a particular query, it creates a plan that includes a Gather or Gather Merge node (in our case, the Gather Merge node). This node includes the execution plan subtree, which is processed in parallel.

The first operation is Parallel Seq Scan on test_table. The table blocks are divided among the cooperating processes (Worker 0 and Worker 1). Worker 0 and Worker 1 scan different consecutive blocks of the table in parallel.

Then, PostgreSQL sorts the result. Sort Key contains the sort condition (field1). Sort Method is external merge Disk: the system creates temporary files on disk.

The query execution plan above contains several operations and looks complex.

Add an index to the field1 and execute the same query:

CREATE INDEX ON test_table(field1);
EXPLAIN (ANALYZE) SELECT * FROM test_table ORDER BY field1;

The result:

                             QUERY PLAN
---------------------------------------------------------------------
Index Scan using test_table_field1_idx on test_table
    (cost=0.42..34319.93 rows=1000100 width=37)
    (actual time=0.018..106.251 rows=1000100 loops=1)
Planning Time: 0.140 ms
Execution Time: 148.662 ms

With an index, the plan contains only the Index Scan operation, and as a consequence — the execution time is reduced.

Found a mistake? Seleсt text and press Ctrl+Enter to report it