What’s new in Greenplum 7. Part 2

cat cover 2 dark en
cat cover 2 light en

In the previous part of the overview, we discussed the migration of the Append-Optimized tables engine to use table access methods, optimization of column appending, and changes related to index support.

Today we are going to talk about an index type that is new for Greenplum and more.

BRIN indexes

All of the new index features mentioned earlier do not seem like a "must have" for analytical loads. How often do you need to retrieve singular or even unique values from tables that have millions and billions of records? Exactly. Those improvements will be most beneficial for auxiliary tables or dictionaries.

A more popular scenario is retrieving a set of rows that match a certain criteria that the user plans to join, aggregate, etc. In this case, the most valuable techniques are those of narrowing down the search and excluding unnecessary blocks: data skipping, metadata pruning. Previously, users could use the following approaches:

  • Table partitioning. Data is physically split by tables and, consequently, files. This gives us an ability to scan only those partitions that contain the necessary data. There’s also an ability to maintain, backup, and update a table by parts. The cost of that is the structure’s stiffness, an increased load on a database catalog, stronger requirements for planners, who should be aware of the table topology in order to create an effective plan for it.

  • Bitmap indexes, which are good for columns with low cardinality (otherwise, their size will outweigh all their advantages). They allow combining several indexes across different columns in one query plan via bit mask operations.

  • B-tree indexes, compared to bitmap indexes, can work with columns containing a large number of unique values, but impose cumbersome overhead maintenance and usage expenses.

Starting with Postgres 9.5 and Greenplum 7, users got one more option — BRIN indexes (Block Range Index). The point is that whole table’s address space is split into ranges of equal size. For heap tables, it’s a configurable for each table index number of pages, whose numbers make up the first bytes in row IDs. Meta-information describing each range’s contents is stored within its index. In the simplest case for types with a linear sort order — minimum and maximum values. In a more complicated case, encasing values — an area that contains all the points within a block. Since we store only a few values per range, such an index is much more compact compared to B-tree.

Referencing such an index, we can get IDs of all the blocks that may contain the requested values and spend resources only for scanning. Obviously, each retrieved value needs to be checked, since the requested range may not contain all the block range selected for scanning. The thinner and more accurate the block description is with the same size, the more effectively index scanning will work: fewer blocks will be selected for scanning and they will contain fewer "unwanted" values.

Hence, good candidates for such indexes are columns for which the rate of loading data into a table will correlate with the amount of queries for it. It’s great if the values will be loaded chronologically or by day and then requested by large time frames (for a week, month, etc.). Often it’s recommended to use as a reference such columns, for which there is a forward or backward correlation between a value and its place in a table. Such correlation can be forcefully implemented by rewriting a table in the required order using the CLUSTER command if an index is available. If it’s not available — you can use the ALTER TABLE REPACK BY COLUMNS command implemented for Greenplum 7 (8ee44e8). To implement it, an additional table access method was required — table_relation_copy_for_repack.

Let’s see how the search using BRIN index is handled (see the implementation of the amgetbitmap index access method — bringetbitmap) to understand what challenges arose during the adaptation for AO tables:

  1. Request the number of pages for a table.

  2. Index stores a description not for each page, but for a range of them. Hence, we will iterate through a range of pages and, using the index, determine if it has any data we want.

  3. On the first pages of a BRIN index, there’s always a reverse map of a range (revmap). It always occupies a continuous range of pages, and for simplicity we can look at it as an array of pointers which allows us to get a description for each range of pages.

  4. If a description of a block range was not found (e.g. if the information about a range has not been summarized yet), all of the pages will be marked for scanning.

  5. Otherwise, a support function will be called. It interprets the generalized information and decides whether to scan the whole block range.

  6. As a result, a full bitmap will be constructed, where all the pages containing the requested rows will be marked.

Now, let’s focus on the third step and remind ourselves what a row ID is for an AO table. Note that we don’t have a fixed size for pages, and a row number doesn’t have any connection with a physical page number. However, the first seven bits contain the number of a segment file that has a certain row version. Thus, if at least one row is stored in the segment file with the greatest number out of the 127 available, we will need a reverse map for ranges, which allows addressing at least 0xFE000000 pages. In Greenplum, a single BRIN index page with the size of 32 Kb fits 5454 pointers. So, a map will need more than 781 thousand pages or 23 Gb+. For comparison, in Postgres, a single 8 Kb page would match a table that occupies more than 32 Tb. In case of AO tables, such ID can be obtained by only one row in a whole table.

This is why BRIN indexes were severely refactored to support AO tables (63d916b). A whole range of the available row version IDs can be split into sequences using a table access method (BlockSequences) without any significant gaps within each of them. And for that we needed yet another table access method — relation_get_block_sequences. For heap tables, it returns strictly one sequence from the first to the last page. For the AO tables, the number of sequences will match the number of segment files. For each segment file, the first page will be the one whose number consists of the number of its segment file (the first seven bits) and 25 zero-filled bits. The last page will correspond to the current (!) value of the row version number generator for this segment file (FastSequence). Since the generator value isn’t dropped throughout the table lifetime, intensive updates with subsequent garbage collection will lead to storing information about non-existent pages in the reverse range map. The pages themselves will be purely logical, containing the ranges of 32k row version IDs, which corresponds to the last two bytes of ctid. Such a logical page will be a base unit for which generalized information is stored within an index. This means that the minimal amount of information read from a disk will depend on a table’s row width or the amount of required columns — for columnar storage. Unlike heap tables, for which the page size is fixed. It’s an important factor to consider when setting the pages_per_range value during the index creation.

So, the scanning of AO tables by a BRIN index is handled as follows:

  1. For each segment file of a table, we form a sequence of logical pages (Block Sequence).

  2. For each page sequence, we retrieve from a meta-page the number of the first page of the reverse range map.

  3. For each range of pages, starting with the range that matches the number of a segment file:

    • We obtain the page number of the reverse range map in the chain by discarding the most significant bits corresponding to the segment file number, dividing by the maximum number of ranges per map page.

    • If the current page number is too low, go to the next page of the reverse range map.

    • Obtain the line pointer on the map page as a remainder of the page number divided by a maximum number of ranges that can be stored on it.

    • By the obtained offset, we get the ID (page + record number) of a record that describes the current range.

    • Retrieve the page and a record that describes the current range.

    • Call the support function to check that the generalized information corresponds to the requested condition. If a range may contain the requested information or generalized information is missing (e.g. wasn’t collected), each logical page gets marked in its bitmap.

  4. Once the bitmap construction is done (the Bitmap Index Scan operator), we sequentially retrieve the row versions with IDs from logical pages (Bitmap Heap Scan) and check the requested condition for them again. The block map (Block Directory) helps us obtain the offset of the physical blocks in files corresponding to the beginning of each logical page. This way, we reduce the amount of data that needs to be read from a disk.

As an example, let’s consider scanning a three-column table with B-tree and BRIN indexes by the b integer column, which will retrieve a range that contains about 0.7% of records. Below are the query plans and their execution times.

For Greenplum 7.1, ORCA misses the target when it comes to the evaluation of index scanning costs and, by default, chooses sequential scanning.

ORCA evaluation
ORCA evaluation
ORCA evaluation
ORCA evaluation

To get the same result, a segment that contains most of the requested rows needs to scan only two logical pages (Heap Blocks) with a BRIN index. Those blocks will contain only 23540 of the necessary rows, the other 42105 will be filtered after reevaluating the condition.

BRIN comparison
BRIN comparison
BRIN comparison
BRIN comparison

A similar result can be achieved with a B-tree index. The only difference is that this access method and the amount of retrieved values allow to construct an exact (down to a row version ID) bitmap and avoid the condition reevaluation. The cost is dealing with indexes two orders of magnitude greater.

B-tree comparison
B-tree comparison
B-tree comparison
B-tree comparison

Parallel operations in the segment mode

Postgres 9.6 came with an ability to run many operations within a query concurrently. However, the Greenplum developers couldn’t find a way to make queries within a cluster or a certain segment concurrent. So, they forbade concurrency within a segment to avoid any errors.

Parallelism comment
Parallelism comment
Parallelism comment
Parallelism comment

To give you an idea of potential implementation challenged, here’s a parallel plan for Postgres.

Postgres parallel plan
Postgres parallel plan
Postgres parallel plan
Postgres parallel plan

Now, imagine a distributed version of such a plan.

Distributed version of the plan
Distributed version of the plan
Distributed version of the plan
Distributed version of the plan

Here, X is a number of running processes, not segments in a cluster. Replacing Gather with Gather Motion looks organic. It’s more complicated with Redistribute Motion. Each of the running processes on a segment should get only the rows corresponding to that segment, but equally distributed among them.

Moreover, the issue of concurrent scanning of AO tables is still open. The approach that exists for heap tables is based on the fixed page size, a number of which is known beforehand.

On the other hand, it could significantly simplify the cluster management and resource distribution for each host via regulating the number of running processes. That is, given that a single Postgres instance can utilize the resources of a single physical machine.

Extending the Foreign tables functionality

Postgres provides the developers with a foreign tables API to retrieve and modify data from random sources. Such source can be represented by a file, S3 storage, Postgres, Kafka, or even a different Greenplum cluster.

The process of getting rows from an external storage is hidden behind the Foreign Scan plan. Consider two tables in an external Postgres, and we want to use the result of their joining in Greenplum.

Use joint Postgres tables in Greenplum
Use joint Postgres tables in Greenplum
Use joint Postgres tables in Greenplum
Use joint Postgres tables in Greenplum

In Postgres 9.4 and Greenplum 6, the FDW capabilities were restricted to pushing down predicates to an external system. The joining and its consequent handling will be done on the query initiator (in this case, on a coordinator).

In the Postgres world, life moved on. In Postgres 9.6 we got the ability to transfer joins, sorts, and the UPDATE/DELETE operations. Postgres 10 introduced the aggregator support. Postgres 12 extended the sorting support and implemented the transfer of LIMIT. Having received such changes, we can see the following query plan, in which the result will be fully computed on a remote system.

Compute on a remote system
Compute on a remote system
Compute on a remote system
Compute on a remote system

Now the arsenal of such capabilities can be used in connectors for various systems. Also, to make use of these features, the refactoring of the second planner available in Greenplum (ORCA) is required. The actual versions of ORCA construct a plan that is similar to the one we saw in Greenplum 6.

Monitoring the DB processes

A query takes too long. EXPLAIN ANALYZE takes forever. All that the standard tools in Greenplum 6 could offer is the wait_reason column of the pg_stat_activity view with the lock, replication, and resgroup values. Not much, huh? We had to inspect a cluster for a bottleneck, connect a tracer, debugger, etc. In Postgres 9.6, the situation started to change (53be0b1), then it improved in Postgres 10 (6f3bd98 and 249cf07). Now users have a wide range of wait reasons to analyze the process behavior.

Wait events
Wait events
Wait events
Wait events

In this example, we can see that a coordinator dispatched a query to segments and is waiting for it to finish. Two out of three segments were locked during the IO phase: they waited for the extension of the table files to load new data. The third one is waiting for the synchronous replica (a mirror) to catch up to it, after which it will continue insertion. The Greenplum documentation doesn’t have a full description of the reason as of now. For that, you can use the Postgres documentation or see the source code for the Greenplum-specific wait events.

It’s worth looking at the new gp_stat_activity view that collects information from a whole cluster — previously, we had to write a wrapper ourselves.

Also, with the Postgres 10 patches, we got an ability to monitor the background system processes. In the example below, the WAL-writer process was pictured at the moment of the IO waiting during the preemptive journal buffer dumping.

Background process wait events
Background process wait events
Background process wait events
Background process wait events

That’s not all

We have looked at how several more Postgres changes affected the new Greenplum version. In the third and final part, we will continue to look at such improvements and talk about how the Greenplum-specific capabilities have changed. Keep in touch, thank you for attention!

Found a mistake? Seleсt text and press Ctrl+Enter to report it