What’s new in Greenplum 7. Part 1

Vasiliy Ivanov

Contents

Table access methods interface
Unique indexes for the AO tables
INSERT ON CONFLICT (UPSERT)
Scan only the AO index, covering indexes
Column addition without rewriting the whole table
Summary

Hi! I’m Vasiliy Ivanov and I work for Arenadata. One of the company’s products that is a part of our big data processing platform is Arenadata DB (ADB). At its core lies the Greenplum open-source MPP DBMS. The task ahead of us quickly escalated from "How do we build and launch it?" to "Why doesn’t it work as expected?" and "We need it to work this way". We created a dedicated team of developers, passed the bug fixes and improvements to the community. Our expertise grew along with the complexity of the tasks that we were able to solve ourselves. It allowed us to create our open-source fork to deliver updates to our customers without waiting for the community’s approval (which, to be honest, sometimes took forever).

Having our own fork has helped us greatly. It prepared us for the refusal to accept changes from the Russian community in September, 2023 and the archiving of the public repos along with a complete stop in public development in May, 2024. All of these actions were taken without any word from Broadcom, who had previously acquired VMware. We have long used our own infrastructure to run automated tests and builds.

Last September, after four years of development, a new major version was released — Greenplum 7. It was not without our contribution. Now, while we are getting to release ADB 7, we want to discuss the changes this version brings to the users.

The most popular response to "What do you expect from Greenplum 7?" is "Postgres 12!". Greenplum is one the Postgres forks, and the former’s codebase has finally caught up with the still supported 12th version of the latter (the final release comes November 14, 2024). As a reminder, the last stable release was based on Postgres 9.4. Working on the actual code base is a significant benefit for a developer. However, the presence of some feature in Postgres doesn’t guarantee that it will work in the distributed systems framework. In this article, I will start the presentation of new features, talk about the struggle it took to make them work, and discuss possible restrictions along with their origins.

Table access methods interface

Long ago, Postgres users could use only one data storage engine — heap. The previous Greenplum versions had to insert into the Postgres codebase to provide the storage formats optimized for the analytical loads — the row-oriented and column-oriented Append-Optimized storage. With Postgres 12, the process of storage system abstraction has started: the table access methods interface was introduced and the TYPE parameter of the CREATE ACCESS METHOD command got a new possible value — TABLE (8586bf7).

This interface has 41 function (in the current version — 44), with the implementation of which, you can try and teach Postgres to work with your storage format. Why "try"? Well, because if your format greatly varies from heap, it has a good chance not to fit in the heap’s procrustean bed. In fact, that’s exactly what happened during the Append-Optimized tables refactoring, but more on that later. As a result, the number of functions went up to 53. Now, one of the most popular questions that new users have — "How do we change the data compression level or the data storage format for a table without recreating it?" — may get an answer that is short and not discouraging.

Changing compression level

Actually, it doesn’t change the fact that a table will be completely overwritten under the hood. The old syntax for the table creation still works.

Old way to change compression level

Unique indexes for the AO tables

Another common problem among the users who see Greenplum as a big Postgres is the specifics of handling unique indexes. First, they work on a segment level (data shards) and have no idea about the rows stored on other segments. This is why a unique index should be created based on a list of columns that starts with the ones used for table sharding. Second, the Append-Optimized tables didn’t support unique indexes. While the first issue still remains, the second one was thoroughly worked on.

Let’s remind ourselves with certain simplifications how Postgres checks and guarantees the uniqueness of the rows being appended to a table. The only index implementation that provides key uniqueness is based on a B-tree.

A row is inserted into a heap table and assigned a unique identifier within the table of this Postgres instance — ctid (the tuple_insert function of the access method). For heap tables, it describes the page number and line pointer inside the page. The insertion of rows with duplicate keys can carry out several transactions in parallel.
Now, the process of inserting a record into an index starts (the aminsert function of the index access method). A search for the target node of the B-tree commences. At this step, only the first transaction will receive an exclusive lock on it. The subsequent ones will queue up.
Since the index is unique, once the transaction gets a lock on the node, it will look up the key on the page. If not found, then the key corresponding to our row version is unique. Thus, we can add a pointer to that row version to an index and release the exclusive lock on the node.
If at least one key is found on a page, it’s necessary to check the relevance of the referenced version. It’s required because, in Postgres, indexes do not contain definite information about the version of the row they point at. To achieve that, there’s a call to series of functions that implement the table’s access method: scan initiation (index_fetch_begin), querying the row version by its ID (index_fetch_tuple), and scan finishing (index_fetch_end). The visibility snapshot that is used allows for "dirty reads" — when the rows of unconfirmed transactions are also returned.
If all the found rows in the table are "dead", it means that the row is unique. If there’s an "alive" confirmed row, it means that our row is a duplicate and the transaction should be aborted with an error.
It gets slightly more complicated if a transaction that has inserted/removed a row is not yet confirmed. That’s the case for the transaction from the second point, the ones that carry out the parallel loading of duplicating records. In that case, a transaction will release the previously obtained exclusive lock on the tree’s target node, wait for a competing transaction to complete, and repeat all the actions from step 2.

What can go wrong here for the Append-Optimized tables? Well, to begin with, this type of tables is optimized for a sequence of operations, be it insertion or scanning. Random access by row version’s ID without creating an index is impossible because of the following differences from heap tables:

AO table blocks are custom sized, which means you can’t move N * size bytes in a file to get to the N-th page.
Following the first point, a block’s number isn’t a part of the row version ID. All we can know about a row from its ID is a number of the segment file it is stored in.

Heap tuple vs Append Optimized tuple ID

And only the user’s desire to build an index on the table will force Greenplum to create and maintain a block map to provide a relatively quick transition to the desired block with the necessary row. Relatively quick, but still optimized only for sequential scan of blocks according to their location in the data file.

The buffer cache for the AO tables data files is not used, as there is not even a shadow of hope that a typical table has a chance to fit in memory and stay there for some time. As a result, every process that starts accessing such a table initializes the entire infrastructure to work with it, starting with the opening of file descriptors. As a result, the approach described in section 4 will be very costly in overhead to retrieve a random row.

Moreover, we don’t even need that row itself. What matters to us is whether it is visible or not (points 5 and 6). And this information in the AO table is stored separately from the data in auxiliary heap tables.

Here, we face the first situation where the existing interface of table access methods required an extension. The index_unique_check function was added, which does exactly what is needed — it returns information about the visibility of the desired row by its identifier.

For this, a search is performed in the corresponding block map, the range of row version numbers of which covers the desired identifier. If there is no such block, then the reference in the index is no longer valid. This could happen if a transaction that added a row with a matching key earlier was aborted. If such a block exists, it needs to be checked that the row version has not been deleted. For this, another auxiliary table — the visibility map — is checked.

Checks are also carried out with a visibility snapshot that allows "dirty reads" to see changes made by concurrent transactions. The waiting according to point 6 is performed based on the information about the transaction modifying the block and visibility maps. If the transaction that recorded information about the new block, which contains a potential duplicate, has not yet been confirmed — we wait for its confirmation.

Here, arose another problem that required changes. Previously, information about the new block was written to the block map page (minipage) after the block was filled. Updating the modified page in the buffer cache was also delayed until it was filled or the insertion was completed. This meant that competing transactions could not ensure the uniqueness of the added keys immediately after their insertion into the table. To avoid such a situation, the transaction now writes a placeholder block to the block map before starting the insertion, covering the entire remaining range of row version numbers for the segment file into which the insertion is being made. Information about this block will be overwritten upon writing the first real block, but, thanks to the versioning of the auxiliary heap tables in Postgres and "dirty reads", the placeholder block will be visible until the end of the transaction. This is the block that "competitors" will see and will wait for the completion of the transaction that recorded it.

The block map format has also been changed (258ec96). In previous versions, it was not allowed to have gaps in the row version identifiers between adjacent blocks. If part of the generated row version identifiers was not used by the end of the insertion, then the information about them was included in the last block when inserting the next one. This behavior contradicts the previously stated idea that the visibility of a tuple depends on the visibility of the block covering the range of row version identifiers, and it has been changed.

Another specific issue became the VACUUM for AO tables with unique indexes, as it generates duplicate rows when repacking the segment files of the table. The transaction performing the cleanup must still see rows from the cleaned file and insert the read rows into the target. As a consequence of the internal structure of AO tables, when transferring row versions from one segment file to another, their identifier (which contains the segment file number, as mentioned above) will change. To solve this problem, the gp_bypass_unique_check workaround flag (in Executor State) was added, which disables uniqueness checks during insertion. The vacuum itself cannot generate duplicates (which means there is no need to waste resources on rechecking), and "competitors" will see either the row versions before the cleanup or after. The logic was implemented in patch 9dd1471.

The question of the applicability of unique indexes for AO tables remains open for me. The downsides obviously include increasing overhead costs for loading data into the table, data duplication in the index (and we are not here for some twenty thousand rows, right?), and the significant size of such indexes, which, unlike the table itself, will use the buffer cache and flush more in-demand data from it. On the other hand, the presence of unique indexes simplifies the data loading procedure in cases where duplicates are unacceptable and also serves as a cue for planners about the nature of the data. Ultimately, having functional working features is better than having none, and it’s not prohibited to check the loaded data during loading, as it was done in previous versions.

INSERT ON CONFLICT (UPSERT)

In the previous section, we imposed unique indexes on all tables, and now when attempting to load rows with duplicate key values, the user will receive an error message and a transaction rollback. Now we would like to handle such situations. Starting from Postgres 9.5 (168d580), the user has two options: to discard such rows (ON CONFLICT DO NOTHING) or to update the existing row, for instance, with the values of the attributes of the inserted row (ON CONFLICT DO UPDATE). The MERGE operator was introduced in Postgres 15, but it doesn’t make sense to discuss it in this article.

I recommend getting acquainted with the capabilities of INSERT ON CONFLICT in the Postgres documentation, which is even available in Russian. Let’s thoroughly examine how this functionality is structured and where the limitations imposed by Greenplum come from.

In the previous part, we discussed how the b-tree index ensures the uniqueness of keys. First, a new version of a row is added to the table, and then an attempt is made to add a reference to it in the index, and at that moment, we find out that we might not be the first ones after all. In this case, the unique index itself will serve as the arbiter — the first transaction will continue with the insertion, while the subsequent ones will need to somehow roll back the insertion in the table, wait for the confirmation of the first transaction, and perform the prescribed ON CONFLICT handling (for Read Committed). The issue of rolling back the added row becomes even greater in the presence of an exclusion constraint (CONSTRAINT EXCLUDE), since the index in this case will not ensure uniqueness, and the condition check will be performed with a snapshot that allows "dirty reads".

To implement the possibility of rolling back the insertion of a single version of a row without rolling back the entire transaction, a mechanism for speculative insertion was proposed, which must be supported by the table access method. For this purpose, the tuple_insert_speculative and tuple_complete_speculative functions should be implemented. For heap tables, they allow inserting a version of a row while manipulating the auxiliary attributes in the row version header, specifying the identifier of the transaction that creates it, and then, depending on the results of checking the constraint conditions, either confirm or roll it back, making it invisible even for "dirty reads". This is noted in the log with a preemptive record and still requires subsequent cleanup using VACUUM.

Insertion and abortion

In the example above, the ninth line on the page was inserted successfully, while the tenth was rolled back.

At this point, we understand that this approach is not applicable for AO tables, since Greenplum stores auxiliary information for blocks rather than for individual rows, and updating an already inserted version of the table without creating a new one is not supported. Therefore, these access methods were not implemented for AO tables. However, this is not the main issue. AO tables currently do not support concurrent updates. Specifically, the insertion that involves conflict handling becomes equivalent to an update operation. When updating a row version, we need to lock it (using the tuple_lock function of the table access method). For heap tables, this is done similarly through manipulations of the row header [1]. For AO tables, due to the lack of such an option, an exclusive lock on the entire table is taken for UPDATE/DELETE operations, effectively excluding concurrent updates altogether. Parallel insertion is still possible by writing to different segment files. It might have been feasible to implement an exclusive UPSERT similarly to UPDATE, eliminating the need for speculative inserts and row locking. Whether it’s needed or not is a question for the end-users. On the other hand, the inability to lock rows obstructs the implementation of not only UPSERT but also triggers for UPDATE/DELETE, logical replication, competing UPDATE/DELETE, as well as SELECT FOR UPDATE/SHARE queries.

This is why support for this operation in the current versions of Greenplum 7 is limited to heap tables. However, even for them, the default implementation of UPSERT requires an exclusive lock (893c529). Why? The answer lies in the distributed nature of Greenplum. Let’s imagine the following situation:

Transaction A inserts a row with the K1 key on the SDW1 segment cluster.
Transaction B inserts a row with the K2 key on the SDW2 segment cluster.
Transaction A inserts a row with the K2 key on the SDW2 segment cluster and, upon detecting a conflict, waits for transaction B to complete.
Transaction B inserts a row with the K1 key on the SDW1 segment cluster and, upon detecting a conflict, waits for transaction A to complete.

Since the waits occur on two different instances of Postgres, which are segments of Greenplum, they are unable to resolve this conflict on their own. This problem is meant to be addressed by the Global Deadlock Detector process on the coordinator, which can be activated using the gp_enable_global_deadlock_detector GUC. Its use will remove the limitations on parallel UPSERT for heap tables, but the user must be prepared for transaction rollbacks and the need to repeat operations anew.

Scan only the AO index, covering indexes

As discussed earlier, the keys stored in Postgres indexes do not contain information about the visibility of the rows they reference. To ensure that a row is visible to our transaction, we must refer to the table itself. This is how standard index scanning occurs.

This leads to random access to the table itself. Even for heap tables, the planner evaluates whether this will be more beneficial than sequential scanning — how many rows need to be extracted, and whether there is a correlation between the order of rows in the index and the table itself. For AO tables, the performance of random scanning will be even worse. Therefore, earlier index scans were prohibited for AO during the planning stage. Even when a non-unique index was present, Bitmap scanning was used, which was intended to mitigate the inefficiencies of random access to versions of AO table rows by pre-building a bitmap followed by sequential extraction of blocks containing the requested rows.

However, Postgres contains an optimization that allows avoiding scanning the table itself if we know in advance that the version of the row referenced by the index key is stored on a page that contains only visible rows. For heap tables, this information is contained in the visibility map, which is populated by the vacuum process, while modifications to the data in the table clear these bits. The name of the operator that implements this approach is Index Only Scan. The name of this operator should not be misleading. If the bit corresponding to the page we are interested in is cleared, we will still have to consult the table, just like with a regular index scan — Index Scan.

At first glance, it may seem that this story has nothing to do with Append Optimized tables. However, when discussing unique indexes, we have already talked about the possibility of checking the visibility of rows without referring to the table itself. Moreover, this can be done decisively without the risk of reverting to table scanning. For this purpose, the following function was implemented in the table access method interface, which allows us to verify that the version of a row for a given identifier is visible — index_fetch_tuple_visible.

In many ways, this access method duplicates the method of checking the uniqueness of row versions used when inserting a key into a unique index. The differences lie in the visibility snapshot used, which in this case corresponds to the snapshot of the current command, i.e. the MVCC snapshot. To standardize working with different types of snapshots, it would require refactoring both the logic of snapshot handling with auxiliary Append Optimized tables and the unification of the code sections responsible for scanning and inserting into the index on the Postgres side. The Greenplum developers decided to abandon this task at this stage.

Now, the extraction of individual column values covered by the index from AO tables through the same index is effectively done without referring to the AO table itself.

Extraction of individual column values

It might be a nice bonus if all the mentioned arguments regarding indexes over AO tables have been rejected and you already have your indexes.

I would also like to highlight the functionality of covering indexes, which first appeared in Postgres 11. It allows storing not only the values of the key itself in the index but also user-specified columns that can be retrieved from the index without accessing the actual table under the conditions specified in this section. Support for such index extensions has been implemented for the ORCA planner (b9f43b9). This logic works for AO tables as well, but I believe it will be more useful for smaller reference tables, which are more likely to be stored as heap tables.

Column addition without rewriting the whole table

Once upon a time, during the era of Postgres 9.4, on which Greenplum 6 is based, adding a new column to a table with a default value different from NULL required rewriting the entire table. However, starting from Postgres 11 (16828d5), such a value, if it can be pre-computed at the time of altering the table schema (for example, it does not involve calling volatile functions), is stored in the catalog relation pg_attribute, and the attribute itself is marked as having missing values.

Attributes with missing values

In this case, the table is not rewritten, the missing attributes are added to the tuple during its deforming into an array of values "lazily" at query execution time. At this moment, the actual number of attributes stored in the version of the row is extracted from its header. And if more attributes are requested, the row is supplemented with the values stored in the catalog. For more details, see the slot_getsomeattrs_int function.

All this works wonderfully for heap tables, but what about the Append-Optimized ones? Here, it is again worth recalling that one of the optimization directions was to reduce overhead in storing headers of individual version rows. Rows are stored in the MemTuple format (see memtuple.c). This format does not contain information about the number of attributes, leaving it to the table access method, which in turn extracts it from the catalog. Moreover, to reduce the number of intermediate transformations, tuples of Append Optimized tables are immediately passed to the query executor in the format of "unpacked" array of values (virtual tuple). With this approach, reusing the Postgres machinery does not seem feasible.

At first glance, one could define their own type of table slot (TupleTableSlotOps) that would implement "lazy" unpacking of MemTuple. The number of columns could be stored at the block level with variable length, where the versions of rows have the same length. Moreover, Postgres allows each table method to have its own implementation of slots (see the slot_callbacks function). However, the developers took a different path (d3dcb9b).

Since the insertion into the segment files of AO tables is always done at the end, Greenplum remembers the current value of the version row number counter for each file at the moment of adding a new column (the lastrownums field). Rewriting the entire table is also not required.

Adding a new column

When scanning segment files, all versions of rows with an identifier less than the remembered one will be supplemented with default values from the catalog. For more details, see the comment in the AppendOnlyExecutorReadBlock_BindingInit function.

What has been said above relates to row-oriented Append-Optimized tables. Adding a new column to a columnar table in earlier versions only required generating a new column with data for all existing row versions. This behavior has been preserved. However, for these purposes, it was necessary to extend the table access methods interface with the aoco_relation_add_columns function (as well as the aoco_relation_rewrite_columns function for rewriting values of only selected columns, for example, when changing their type).

Summary

In this article, we have only examined a portion of the changes brought by Greenplum 7. Next in line are BRIN indexes, JIT compilation, and an overview of changes in the resource groups mechanism. We will continue this conversation in the next part. Thank you for your attention!

Read on: What’s new in Greenplum 7. Part 2.

References:

Rogov E.V. Postgres 16 Internals. DMK PRESS, 2024, p. 258.

Found a mistake? Seleсt text and press Ctrl+Enter to report it