Arenadata DB table changes tracking

Andrey Savitskiy

Contents

Overview
User scenarios
What we offer at the idea level
Basis for the solution
- Relativity of events to the transaction life cycle
How does getting a changes track work
How events are handled in changes track
Transactional semantics of the tracking_get_track function
Implementation details based on bloom filter
Practical part: implementation of basic table monitoring

Overview

The history related to this task started for us in May 2024. One of the major Greenplum/Arenadata DB users requested us to implement the ability to track changes in the size of table data files. This functionality would become a source of events for the user cluster monitoring system. The task seemed fascinating and promising to us. However, as often happens, the user needed a solution yesterday.

On the one hand, we realized the complexity of this task in a fully functional implementation for all users of our product (and as a result, adequately estimated the expected workload). On the other hand, we considered it wrong to tailor the solution to a specific user, but at the same time deliver this implementation as a part of the overall solution. As a result, the development team continued working at its own pace and in accordance with its understanding of the implementation.

In this article:

We will introduce our new feature for tracking the size of table data files (I call it a changes track), which has become a part of the arenadata_toolkit extension. It is also shipped with the open source version of Arenadata DB (we’ll likely consider porting this feature in some form to Greengage DB in the future).
We will explain the basic principles of the proposed solution.
We will tell you about the difficulties we had to face and dive into the implementation details.
At the end of the article we will give an example of the simplest monitoring system based on this tracking of changes.

If you want to skip the "theory" and just see an example of using this solution in practice — here is a link to the relevant section.

User scenarios

Let’s start by formulating one of the basic user scenarios.

Imagine that the user has a large cluster in terms of the number of tables that this cluster manages (at this stage, we will not complicate things by assuming that there can be many such clusters). As developers, in addition to a number of important requirements, we are primarily interested in the quantitative characteristics of the infrastructure where our solution will work and the expectations from our solution regarding the performance.

Assume that we have the following input conditions (of course, we are talking about approximate values).

Approximate number of tables
Table type	Total number of tables
AO tables row oriented	10 000
AO tables column oriented	200 000
Heap tables	1 000 000

Since it is necessary to provide a "safety margin", the solution must be able to work with millions of tables and, depending on the size of the tables, with many times larger number of physical files containing the data of these tables.

Now the general scenario itself, from which the expected time characteristics can be determined:

The user wants to track table size changes by requesting a changes track to store table data in an external system. The changes track will be requested every 5 minutes.

Calculating and displaying the sizes of such a number of tables in a database with such frequency is definitely not an easy task.

Calculate the size of tables

At the very beginning, it is necessary to clarify what it means to calculate the full table size in PostgreSQL and Greenplum in particular.

Structure of heap table relationships regarding the data storage

If we consider heap tables schematically, then segment data files will be associated with the table (the heap table → heap segments relationship). The larger the table, the more files are associated with the table — for Greenplum, the maximum size of a heap segment is 1 GB: RELSEG_SIZE (32768 bytes) * BLCKSZ (32768 bytes), if exceeded, a new segment file is created.

Also, there are visibility and free space map files (heap table → vm and heap table → fsm). If there are attributes stored in TOAST tables, then there are TOAST table data files (heap table → toast). TOAST tables, in turn, have associated btree indexes (heap table → toast → toast btree).

Both btree indexes and TOAST tables are heap tables, meaning they have segment files, visibility (only for TOAST) and free space maps. A table can also have user-defined indexes.

To calculate the full size of a table, DBMS needs to extract these relationships from the system catalog metadata, go through all the relationships, determine the file size (in case of AO tables for data files, this will be the "logical" size, more on that later), and sum them up to get the final table size.

For AO tables, the approach is even more complicated, since they also have service aux tables. These are heap tables, which means again indexes, visibility maps, etc.

Structure of AO table relationships regarding the data storage

What existing tools are at our disposal

First, there are the basic system functions pg_relation_size, pg_table_size, and pg_total_relation_size.

These functions allow you to calculate the table size taking into account the selected fork (main, fsm, vm, init), the table size without indexes, and the total size with indexes.

Sizes calculated by the "pg_relation_size", "pg_table_size", "pg_total_relation_size" functions

Second, the gp_toolkit system schema implements several views, which, among other things, rely on the above functions. Here are some of them: gp_size_of_table_disk, gp_size_of_table_and_indexes_disk, gp_size_of_index. The convenience of these views is that they display detail by storage type (the main branch, data in TOAST, aux data for AO tables) in conjunction with and without indexes. The arenadata_toolkit schema has views for defining data skew by tables and their analogs that are several times faster in performance.

The diskquota extension solves a slightly different problem of limiting disk space consumption for tables, but in theory it could be improved in some way for our purposes.

How costly all this can be

To test queries and evaluate execution time, I created a database with the following set of test tables.

Set of test tables
relkind	relstorage	Number of tables
ordinary relation ('r')	heap	450 092
index ('i')	heap	250 166
internal append-optimized segment files and EOFs ('o')	heap	250 005
ordinary relation ('r')	append optimized table	250 005
append-only visibility map ('M')	heap	250 005

As a test load to obtain the execution time and a performance profile of these operations, I executed a number of queries (3 runs each) with the cleared and warmed up cache. The queries were executed on the laptop DELL Latitude 5410, Intel Core i7, 256 GB SSD, 32 GB RAM.

Time taken to obtain table sizes under different conditions
Operation type	Operation time (cache is cleared)	Operation time (cache is warmed up)
Calculating the sizes of ~1.5 million tables using the `pg_relation_size` function	12 min. 21 sec.	8 min. 37 sec.
Creating the `gp_size_of_table_and_indexes_disk` view for 700k tables (only user tables, according to the implementation of this view)	13 min. 45 sec.	10 min. 48 sec.

As can be seen from the table above, even the test time required to obtain the sizes of the test database tables does not allow solving the problem with the specified requirements. For production clusters, you need to take into account the number of segments, the actual load, a larger number of data files, etc.

However, there are other reasons why we were not satisfied with the basic tools.

The main reason is that for AO tables, system functions calculate the logical size — for each table, DBMS needs to open its table with metadata, from where you need to read a row with an offset. The GetAOTotalBytes and GetAOCSTotalBytes kernel functions are responsible for this. These functions get the logical size from the aux table associated with each AO table (the aux table’s OID can be found in the segrelid field).

In addition to the expensive read call from the table, this value does not reflect the actual size of the table in some cases. For example, a long write transaction to such a table ended in a rollback due to lack of space, and the logical size was not updated. This is correct from the perspective of the kernel’s work with AO tables, but it is clearly not suitable for solving the problem of tracking the real size of the table data files. A user sees the previous logical table size, but in fact the files have filled the entire disk!

Create an AO table, interrupt the recording of a large block of data (100 million rows), and try to calculate the table size using the pg_total_relation_size function after interrupting the recording and rolling back the transaction:

andrey=# CREATE TABLE aoro_rollbacked(c1 INT) WITH (appendoptimized=true, orientation=row);
CREATE TABLE

andrey=# INSERT INTO aoro_rollbacked SELECT generate_series(1,1000);
INSERT 0 1000

andrey=# SELECT pg_size_pretty(pg_total_relation_size('aoro_rollbacked'::regclass));

 pg_size_pretty
----------------
 138 kB
(1 row)

andrey=# INSERT INTO aoro_rollbacked SELECT generate_series(1,100000000); (1)
^CCancel request sent
ERROR:  canceling statement due to user request

andrey=# SELECT pg_size_pretty(pg_total_relation_size('aoro_rollbacked'::regclass));

 pg_size_pretty
----------------
 138 kB  (2)
(1 row)

As you can see, after the transaction rollback (the line with marker 1), the logical size did not change (marker 2), although in fact the size of the first segment file (the zero file is reserved for AO tables) took up 220 MB on one of the segments at the time of the transaction rollback:

$ PGOPTIONS='-c gp_session_role=utility' psql -p 6002 -c 'SELECT pg_relation_filepath('\''aoro_rollbacked'\''::regclass);' -At | xargs -I % ls -lh '%.1'

-rw------- 1 andrey andrey 220M янв 25 12:13 base/5963210/2422978.1

Now, regarding the details that affect the final performance.

The biggest cost when creating the "gp_size_of_table_and_indexes_disk" system view

If you look at the profiling results of building the gp_size_of_table_and_indexes_disk view (the image above), you can see that a rather significant contribution to the total time is made by the table opening function (try_relation_open) and the above-mentioned function for obtaining the logical size of AO tables (GetAOTotalBytes). The costs of obtaining the physical file sizes themselves by the stat function are lost against the general background.

Since each view creation results in retrieving data from all tables, most of which can be assumed to remain unchanged, the overhead can fairly be considered unacceptable for users working with large amounts of data, the profile of which is characterized by a significant load on the disk subsystem.

The diskquota extension

The diskquota extension deployed on a large cluster can create many problems for its users. In short, the main elements of the architecture:

The master server has process (BackgroundWorker), which collects data on table sizes from segments.
Table sizes are stored in tables of the extension itself.
Hash tables are widely used in the implementation (example) to store table sizes.
Hash tables are located in the shared memory.

Thus, this extension has a potentially heavy state (both at runtime and in the DBMS itself in the form of tables).

Our team periodically implements some optimizations to this solution, but it still remains quite problematic for large clusters. Although there are currently no open and widely used alternatives in terms of limiting writes by storage space if the quota is exceeded.

What we offer at the idea level

Now, let’s talk about the proposed solution. It is based on the following principles:

Avoid maintaining heavy state in the database itself (for example, some updatable tables or something similar).
Build a changes track "on the fly".
Do not open tables unless absolutely necessary (for example, metadata in tables, catalogs, etc.).
Use compact data structures whenever possible.

As opposed to diskquota, we wanted to avoid storing heavy state in the database itself, the changes of which are tracked. As experience with diskquota shows, such a state can create a side load on the database and leads to performance problems for the main load.

We do not want to probe hash tables every time, which in theory they can grow to large sizes. Unfortunately, this cannot be completely avoided for all scenarios, more on this below. As a result, whenever possible, we need to use compact data structures that are limited in terms of memory consumption.

Basis for the solution

The Greenplum core provides the following basic tools that can be used to solve the problem of tracking table sizes:

The smgr.c and md.c low-level modules for working with physical files.
The actions we are interested in tracking are create, extend, truncate, and unlink. (Further on I will call them events. The mapping of actions to events is given in the table below).
Related hooks (created under diskquota).
Identification of tables and files (the table OID, the relfilenode file identifier).

The modules that are responsible for physical data storage (smgr and low-level md) have a mechanism for tracking some events in the system.

smgr is a common interface for storage managers. It provides a contract for operations like init, close, create, unlink, extend, read, write, truncate, etc. The md manager implements them. There is also a specific part for AO tables in aomd.

The main functions that interest us in the context of event tracking are: smgrcreate, smgrcreate_ao, smgrextend, smgrtruncate, and smgrdounlinkall. Some events are triggered from the implementation code of the AORO and AOCO tables.

Mapping kernel functions to events
Function	Mapping event	Table type	Event description
smgrcreate	create	heap	Event of creating a heap table data file (taking into account the required `ForkNumber`)
smgrcreate_ao	create	AO	Event of creating a new segment data file for the AO table
smgrextend	extend	heap	Event of extending a heap table data file with a new block (for heap tables the block size is `BLCKSZ = 32 KB`)
BufferedAppendWrite	extend	AO	Event of expanding an AO table segment data file with a new block
smgrtruncate	truncate	heap	Event of truncating the segment data file of a heap table to the required number of blocks (`BlockNumber nblocks`)
TruncateAOSegmentFile	truncate	AO	Event of truncating an AO table segment file, starting from some offset and up to the end of file
smgrdounlinkall	unlink	heap/AO	Deleting files associated with a table (unlink)

All these functions contain callback calls to user hook functions, they are executed after calling system implementations:

typedef void (*file_create_hook_type) (RelFileNodeBackend rnode);
typedef void (*file_extend_hook_type) (RelFileNodeBackend rnode);
typedef void (*file_truncate_hook_type) (RelFileNodeBackend rnode);
typedef void (*file_unlink_hook_type) (RelFileNodeBackend rnode);

Table files are identified by the relfilenode identifier.

RelFileNode definition

typedef struct RelFileNode
{
    Oid     spcNode;        /* tablespace */
    Oid     dbNode;         /* database */
    Oid     relNode;        /* relation */
} RelFileNode;

For example, for heap tables, it is the same for all table data files, as rows are written and the table grows (new segment files are added) it appends an integer postfix to it.

The RelFileNodeBackend instance can be simply considered as an identifier of a file to which some changes have occurred. The type of changes depends on the hook type.

RelFileNodeBackend definition

typedef struct RelFileNodeBackend
{
    RelFileNode node;
    BackendId   backend;
} RelFileNodeBackend;

Relativity of events to the transaction life cycle

Now let’s consider the events in the context of their occurrence under different queries and transaction scenarios. Each of the events we are interested in affects either the consumption of disk space (extend, truncate, unlink), or, in case of the create event, the total number of tables in the database (some data files may initially be of zero size).

So, our main task is to track the create, extend, truncate, and unlink events with minimal impact on the DBMS performance. At the same time, it is important to spend a minimum of resources on this, which are more reasonable to devote to the main load.

If the extend and truncate events are quite simple and can be processed as they occur, then processing the events of creating and deleting table files presents certain difficulties. These difficulties are related to transactionality and, more precisely, to multi-statement transactionality.

As examples, consider the following queries:

BEGIN; CREATE TABLE t(...); INSERT INTO t ...; COMMIT;
BEGIN; CREATE TABLE t(...); INSERT INTO t ...; ROLLBACK;
BEGIN; CREATE TABLE t(...); DROP TABLE t; ROLLBACK;
CREATE TABLE t(...); BEGIN; ...; DROP TABLE t; ...; COMMIT;
CREATE TABLE t(...); BEGIN; ...; DROP TABLE t; ...; ROLLBACK;

For the sake of brevity, we will leave out the details concerning the table structure and the inserted data, as they are not of fundamental importance.

If we examine these scenarios in more detail in terms of kernel events, the picture is as follows.

BEGIN; CREATE TABLE t(...); INSERT INTO t ...; COMMIT;

Creating a table inside a transaction followed by COMMIT

The distinguishing feature of the first scenario shown in the image is that the events (in this case, create and extend at times t₁, t₂, t₃) occur in the context of a table whose OID is not yet visible in the pg_class system table to other transactions according to the isolation rules.

Thus, in the transaction retrieving track changes, it is not possible to "pull out" the table information from pg_class using standard means. However, we must track these events and report them in the context of changes to tables to the changes track, since in this case the transaction is committed (COMMIT at time t₆).

In the image above, at time t₅, there is a call to the smgrDoPendingDeletes function, which is responsible for deleting files based on the transaction outcome. Since there are no table deletions here, it has nothing to do.

A different situation occurs for transaction ROLLBACK events:

BEGIN; CREATE TABLE t(...); INSERT INTO t ...; ROLLBACK;
BEGIN; CREATE TABLE t(...); DROP TABLE t; ROLLBACK;

Rollback a transaction with table creation events

In the first case, the transaction that creates the table is rolled back (time t₄). In the second case, there is an explicit DROP TABLE, but the unlink event still occurs when processing the transaction rollback (time t₆). So in terms of the moment the unlink event occurs, they are equivalent. Table files are created and filled with data and should be deleted in smgrDoPendingDeletes. There is work to be done for this function (time t₅), since the data files that the rows managed to get into (the extend events occurred) must be deleted at the end of the transaction.

A couple more identical cases in terms of unlink events — the delete event occurs only at the moment of committing the transaction:

CREATE TABLE t(...); BEGIN; DROP TABLE t; ...; COMMIT;
BEGIN; CREATE TABLE t(...); DROP TABLE t; ...; COMMIT;

Committing a transaction containing table delete events

It is important to note that the unlink event occurs not when DROP TABLE is called at time t₁, but as a result of committing the transaction in smgrDoPendingDeletes at time t₄.

If the table was created previously, calling DROP TABLE followed by rolling back the transaction will not generate any events:

CREATE TABLE t (...); BEGIN; ...; DROP TABLE t; ...; ROLLBACK;

Rolling back a transaction with DROP TABLE will not generate "unlink" events

How does getting a changes track work

To get the changes track, the user needs to make a SELECT query to the arenadata_toolkit.tables_track view. Internally, this results in a call to the tracking_get_track function on the master and segment servers.

The changes track is a set of records in the following form:

  relid  |          relname           | relfilenode | size  | state | segid | relnamespace | relkind | relstorage | parent_relid
---------+----------------------------+-------------+-------+-------+-------+--------------+---------+------------+--------------
 5971396 | pg_aovisimap_5971392       |     2422956 |     0 | a     |    -1 |         6104 | M       | h          |      5971392
 5971394 | pg_aoseg_5971392           |     2422955 | 32768 | a     |    -1 |         6104 | o       | h          |      5971392
 5971402 | pg_aoblkdir_5971392_index  |     2422960 | 32768 | a     |    -1 |         6104 | i       | h          |      5971400
 5971392 | t1                         |     2422954 |     0 | a     |    -1 |         2200 | r       | a          |
 5971400 | pg_aoblkdir_5971392        |     2422959 |     0 | a     |    -1 |         6104 | b       | h          |      5971392
 5971399 | t1_c1_idx                  |     2422958 | 32768 | a     |    -1 |         2200 | i       | h          |      5971392
 5971398 | pg_aovisimap_5971392_index |     2422957 | 32768 | a     |    -1 |         6104 | i       | h          |      5971396
 5971396 | pg_aovisimap_5971392       |     2422949 |     0 | a     |     0 |         6104 | M       | h          |      5971392
 5971394 | pg_aoseg_5971392           |     2422948 | 32768 | a     |     0 |         6104 | o       | h          |      5971392
 5971402 | pg_aoblkdir_5971392_index  |     2422953 | 65536 | a     |     0 |         6104 | i       | h          |      5971400
 5971392 | t1                         |     2422947 |    32 | a     |     0 |         2200 | r       | a          |
 5971400 | pg_aoblkdir_5971392        |     2422952 | 32768 | a     |     0 |         6104 | b       | h          |      5971392
 5971399 | t1_c1_idx                  |     2422951 | 65536 | a     |     0 |         2200 | i       | h          |      5971392
 5971398 | pg_aovisimap_5971392_index |     2422950 | 32768 | a     |     0 |         6104 | i       | h          |      5971396
(14 rows)

Each record contains information about the table for which files change track events occurred. The time interval for tracking events is calculated from the moment of the previous request for the changes track and until the moment of the current request for the track of changes using the tracking_get_track function.

The relfilenode value identifies the file that was modified. For example, for heap tables, changes to different segment files will still be displayed with the original identifier from pg_class.relfilenode. The actual physical size of the table data files at the time of receiving the track is displayed in the size field. All information is provided for each segment (the segid field), according to the set of tracked events and filtering parameters (the relnamespace schema, the relkind table type, the relstorage storage type). A file is uniquely matched to a table (the relid and relname fields).

This set of fields allows you to build quite interesting analytics, we will see this in the section on using the solution in practice.

Description of change track fields
Track field	Description
relid	Table `OID` (`pg_class.oid`)
relname	Table name
relfilenode	The `relfilenode` identifier (`pg_class.relfilenode`) of the table data file that was modified (one of the `create`, `extend`, `truncate`, or `unlink` events occurred)
size	The current (physical) size of the table data files
state	Possible values: `a` — active table; `d` — deleted table data files; `i` — initial snapshot.
segid	The identifier of the segment on which the table data file is located (the `content` field of the `gp_segment_configuration` table)
relnamespace	Schema `OID`
relkind	Table type (`pg_class.relkind`)
relstorage	Table data storage type (`pg_class.relstorage`)
parent_relid	`OID` of a parent table for aggregating data within a table (`NULL` if top level)

One of the options for obtaining a changes track involves retrieving information on all required tables (according to the specified filters). This data is called the initial snapshot (the state field value for all records will be i). Such a snapshot is useful for obtaining initial information on table sizes, which will then be updated with a track of scoped, distinct changes. Obviously, getting the initial snapshot can take a significant amount of time for large clusters, but for subsequent changes it can be extremely compact. We expect that in most cases, a limited and small set of tables will be affected in the few minutes between change track queries.

A simplified algorithm for the tracking_get_track function is:

Open the pg_catalog.pg_class table for reading.
For each entry in pg_catalog.pg_class:

2.1    Check relnamespace according to the set of tracked schemas arenadata_toolkit.tracking_schemas.

2.2    Check relkind according to the set of tracked table types arenadata_toolkit.tracking_relkinds.

2.3    Check relstorage according to the set of tracked table data storage types arenadata_toolkit.tracking_relstorages.

2.4    If pg_class.relfilenode is included into the set of modified files (check by bloom filter, details are in the following sections), then we move on, otherwise we go to 2.

2.5    Calculate the table size.
If an initial snapshot is requested, then:

3.1 Table deletion events are not reported.
Else:

4.1 Report deleted table data files with the d flag.

How events are handled in changes track

Now let’s talk more about the implementation of step 2.4 in the above-described algorithm for the tracking_get_track function.

This step is key in terms of saving resources and speeding up the operation of tracking table sizes in general. The purpose of this step is to discard relfilinode identifiers that were not associated with monitored events and obtain sizes only for tables that have changed. Yes, the algorithm involves going through all pg_class rows, but this takes a small amount of time even for large clusters.

Another important factor is that when file change events occur, an extremely fast implementation of saving them is required for subsequent checking against this set in step 2.4. When inserting data and triggering the extend event, this operation will be extremely sensitive to the overall performance of the solution.

Taking into account the above principles of creating a track on the fly and using compact data structures, we decided to build our implementation based on storing information about a part of the events in the bloom filter.

This data structure implements a "probabilistic approach" to determine whether the element belongs to a certain set. However, given the compactness of the representation and the speed of operation, in general, it is suitable for solving the problem. Even if, in case of a false positive, we report that some file has changed and add information about this event to the changes track, this can be perceived as an update of the information. Since the bloom filter hash functions are implementation specific and cannot be customized by a user, the only parameter that determines the minimization of false positives is the size of the filter. As you know, a bloom filter is a bit array. To configure its size, we chose a value in bytes, which is determined by the arenadata_toolkit.tracking_bloom_size GUC setting. The main thing here is to choose the correct bloom filter size. After experimenting a bit, I chose 2 MB. (with values < 1 MB, files not affected by changes were regularly found in the changes track).

As an example, let’s consider handling the extend event (truncate is fundamentally no different).

Sequence diagram of processing the "extend" event

The extend event handler checks that the given file is tracked (call is_file_node_trackable). These can be tables and materialized views: RELPERSISTENCE_PERMANENT, RELPERSISTENCE_TEMP, RELPERSISTENCE_UNLOGGED.

Next, the relfilenode identifier is saved (according to the results of bloom filter hash functions) in the bloom filter (bloom_set_set). This is a very fast operation. After that, by traversing pg_class and checking the identifiers using the bloom filter, it is possible to find out whether any file associated with this table has changed (taking into account the possibility of false positives, but this risk is accepted).

The peculiarity of the unlink event is that at the moment of calling the hook, there is no information about the table linked to the deleted files in pg_class. In fact, we only have the relfilenode identifier at our disposal. Obviously, traversing pg_class will not give data about this table, so we have to save the relfilenode identifier received in the hook until we receive the changes track in the linked list.

A detailed discussion of handling the create event is beyond the scope of this article. I will only say that at the moment, the implementation does not allow returning to the track changes of tables created within a transaction, which has not yet been committed. In the current implementation, we will report on this table only after the transaction is committed. We will prepare an additional article where we will describe the create event in more detail upon this feature release.

Transactional semantics of the tracking_get_track function

The changes track allows the following types of queries to be received:

-- We return information about changes in tables and "reset" the track.
-- Thus, changes included in this set of records will not be returned in the next query.
BEGIN; SELECT * FROM arenadata_toolkit.tables_track; COMMIT;

-- We return information about changes in tables, but the track is not reset.
-- In the next query, the changes will be a "super-position" of all the changes.
BEGIN; SELECT * FROM arenadata_toolkit.tables_track; ROLLBACK;

This can be useful for users who process records in some way and, say, transfer intermediate aggregate metrics to some other storage. If this processing fails, the transaction can be rolled back and the records (both new and old) can be processed in the next query.

To implement such a scheme, we used an interesting approach based on a version counter.

The idea is to calculate some counter value on the master segment in the stable function. Then this value is passed to the segments for comparison with their current value. If the transaction is committed, the version counter is incremented (both on the master and on the segments), otherwise it remains the same. The nuance is that since the track receiving transaction is not a writing transaction, and the execution of the track preparation function can be on the reading slice, the master will not know about the result of this operation on the segments.

Thus, there are two possible options for the segment:

The counter value received from the master segment is equal to the current value on the segment, which means that the previous transaction for obtaining a track is committed at the level of the entire cluster.
The value is different, which is equivalent to a transaction rollback.

How these situations affect track processing we will see in the next section.

Implementation details based on bloom filter

Each database is associated with its own set of bloom filters, which record change events that have occurred in the table data files. In the bloom filter, in essence, the necessary bits are set, which are calculated using the RelFileNodeBackend identifier by the filter’s hash functions. For some events, it is possible to utilize a simple recording of these events in a filter (extend, truncate). With create and unlink, as already mentioned, changes tracking has to use additional data structures and more complex processing.

To support transactional semantics, a pair of filters are required:

The first bloom filter instance is needed to store the global context, which is required for individual calls of the tracking_get_track function in different transactions.
The second bloom filter instance is required for the local context of the current tracking_get_track call.

Global and local context of bloom filters

The first instance of a bloom filter can be thought of as a single bit array, which is logically divided into two equal parts (bloom bits and bloom bits shadow). The first part stores the bits for the current bloom filter, which is handled by callback functions that set the bloom filter bits (hooks like file_create_hook_type). The second part is responsible for storing the filter state at the time of the previous call to tracking_get_track.

So, we have the compactly-stored state at the current moment and at the moment of the previous call. The previous state, as you can guess, is necessary for the transaction’s ROLLBACK.

For each tracking_get_track call, a local context with its initially empty filter is allocated, the current filter from the global context (the left part of the array) is copied to the local context. The tracking_get_track function continues to work with its state. Copying the bloom filter, of course, occurs under lock, but will be fast. At the same time, the global context will continue to track current events, without affecting the copy that the tracking_get_track transaction will continue to work with.

If the transaction is rolled back, the next query must join the filters bitwise, returning both the previous changes and the current ones. If the transaction is committed, only the latest data is returned.

Practical part: implementation of basic table monitoring

In the final part of the article, I will show how the track can be used to build the simple monitoring system and basic analytics on tables.

Initial metric source configuration

Let’s start by installing the extension, configuring filtering parameters, enabling changes tracking for the selected database, and checking that the changes are tracked:

$ gpconfig -c shared_preload_libraries -v 'arenadata_toolkit' (1)
20250122:16:22:12:164571 gpconfig:ldap:andrey-[INFO]:-completed successfully with parameters '-c shared_preload_libraries -v arenadata_toolkit'

$ gpstop -ar

$ psql
psql (9.4.26)
Type "help" for help.

andrey=# CREATE EXTENSION arenadata_toolkit;
andrey=# \dx
                                          List of installed extensions
       Name        | Version |   Schema   |                             Description
-------------------+---------+------------+---------------------------------------------------------------------
 arenadata_toolkit | 1.7     | public     | extension is used for manipulation of objects created by adb-bundle
 plpgsql           | 1.0     | pg_catalog | PL/pgSQL procedural language
(2 rows)

andrey=# SELECT arenadata_toolkit.tracking_set_relkinds('r,i,t');  (2)
andrey=# SELECT arenadata_toolkit.tracking_set_relstorages('h,a'); (3)
andrey=# SHOW arenadata_toolkit.tracking_schemas;
                    arenadata_toolkit.tracking_schemas
--------------------------------------------------------------------------
 public,arenadata_toolkit,pg_catalog,pg_toast,pg_aoseg,information_schema  (4)

andrey=# SELECT arenadata_toolkit.tracking_unregister_schema('information_schema'); (5)
andrey=# SELECT arenadata_toolkit.tracking_unregister_schema('pg_aoseg');           (6)
andrey=# SHOW arenadata_toolkit.tracking_schemas;
     arenadata_toolkit.tracking_schemas
--------------------------------------------
 public,arenadata_toolkit,pg_toast,pg_catalog

andrey=# SELECT arenadata_toolkit.tracking_register_db();                 (7)
andrey=# SELECT arenadata_toolkit.tracking_trigger_initial_snapshot();    (8)

andrey=# SELECT * from arenadata_toolkit.tables_track;    (9)
 relid |                            relname                             | relfilenode |  size  | state | segid | relnamespace | relkind | relstorage | parent_relid
-------+----------------------------------------------------------------+-------------+--------+-------+-------+--------------+---------+------------+--------------
  2840 | pg_toast_2619                                                  |       12542 |      0 | i     |    -1 |           99 | t       | h          |         2619
  2836 | pg_toast_1255                                                  |           0 |      0 | i     |    -1 |           99 | t       | h          |         1255
  2830 | pg_toast_2604                                                  |       12564 |      0 | i     |    -1 |           99 | t       | h          |         2604
  2832 | pg_toast_2606                                                  |       12571 |      0 | i     |    -1 |           99 | t       | h          |         2606
  2838 | pg_toast_2618                                                  |       12600 | 393216 | i     |    -1 |           99 | t       | h          |         2618
  2336 | pg_toast_2620                                                  |       12607 |      0 | i     |    -1 |           99 | t       | h          |         2620
  2834 | pg_toast_2609                                                  |       12616 |      0 | i     |    -1 |           99 | t       | h          |         2609
  2966 | pg_toast_2964                                                  |           0 |      0 | i     |    -1 |           99 | t       | h          |         2964
  2846 | pg_toast_2396                                                  |           0 |      0 | i     |    -1 |           99 | t       | h          |         2396
  5510 | pg_toast_3079                                                  |       12667 |      0 | i     |    -1 |           99 | t       | h          |         3079
  3598 | pg_toast_3596                                                  |       12682 |      0 | i     |    -1 |           99 | t       | h          |         3596
  6092 | pg_toast_5036                                                  |           0 |      0 | i     |    -1 |           99 | t       | h          |         5036
  6233 | pg_toast_6231                                                  |       12732 |      0 | i     |    -1 |           99 | t       | h          |         6231
  9905 | pg_toast_9903                                                  |       12745 |      0 | i     |    -1 |           99 | t       | h          |         9903
  6222 | pg_toast_6220                                                  |       12752 |      0 | i     |    -1 |           99 | t       | h          |         6220
 12202 | pg_toast_12200                                                 |       12765 |      0 | i     |    -1 |           99 | t       | h          |        12200
 12207 | pg_toast_12205                                                 |       12772 |      0 | i     |    -1 |           99 | t       | h          |        12205
 12212 | pg_toast_12210                                                 |       12779 |      0 | i     |    -1 |           99 | t       | h          |        12210
 12217 | pg_toast_12215                                                 |       12786 |      0 | i     |    -1 |           99 | t       | h          |        12215
 12222 | pg_toast_12220                                                 |       12793 |      0 | i     |    -1 |           99 | t       | h          |        12220
 12227 | pg_toast_12225                                                 |       12800 |      0 | i     |    -1 |           99 | t       | h          |        12225
 12232 | pg_toast_12230                                                 |       12807 |      0 | i     |    -1 |           99 | t       | h          |        12230
  2755 | pg_opfamily_oid_index                                          |       10033 |  65536 | i     |    -1 |           11 | i       | h          |         2753
  6058 | pg_statlastshop_classid_objid_staactionname_index              |           0 |      0 | i     |    -1 |           11 | i       | h          |         6056
  6056 | pg_stat_last_shoperation                                       |           0 |      0 | i     |    -1 |           11 | r       | h          |
  2692 | pg_rewrite_oid_index                                           |       10047 |  65536 | i     |    -1 |           11 | i       | h          |         2618
  2677 | pg_authid_oid_index                                            |           0 |      0 | i     |    -1 |           11 | i       | h          |         1260
  2611 | pg_inherits                                                    |       12573 |      0 | i     |    -1 |           11 | r       | h          |
 12214 | pg_toast_12210_index                                           |       10155 |  32768 | i     |    -1 |           99 | i       | h          |        12212
--More--

Let’s explain some steps:

1	`arenadata_toolkit.so` should be loaded as shared preload library before setting changes tracking as active for the database.
2	`arenadata_toolkit.tracking_relkinds` GUC defines a set of table types for which events are tracked. By default, the following table types are tracked: `r`, `i`, `t`, `m`, `o`, `b`, `M`. This corresponds to the following set of types: RELKIND_RELATION 'r' /* ordinary table / RELKIND_INDEX 'i' / secondary index / RELKIND_TOASTVALUE 't' / for out-of-line values / RELKIND_MATVIEW 'm' / materialized view / RELKIND_AOSEGMENTS 'o' / AO segment files and eof's / RELKIND_AOBLOCKDIR 'b' / AO block directory / RELKIND_AOVISIMAP 'M' / AO visibility map */ As you can see from the line with marker `2`, we reduce this set to tables, indexes, and toast. For example, aux tables for AO tables are not included in the track.
3	`arenadata_toolkit.tracking_relstorages` GUC defines a set of table storage types. By default, the following types of tables are tracked: `h`, `a`, `c`. This corresponds to the set: RELSTORAGE_HEAP 'h' RELSTORAGE_AOROWS 'a' RELSTORAGE_AOCOLS 'c' From the line with the marker `3`, it can be seen that this set is reduced to heap and AO tables with row storage organization.
4	Display a list of schemes that are included to the selection from the changes track by default.
5	Exclude `information_schema` from the list of schemas we are interested in.
6	Exclude `pg_aoseg` from the list of schemas we are interested in.
7	Register the current database as being tracked by the changes track.
8	To demonstrate the `tracking_trigger_initial_snapshot` function, we set the flag to receive the full track on the next query.
9	A track request from the `tables_track` view will give us a complete snapshot, as indicated by the `i` flag in each value of the `state` field.

As a result, we got a complete snapshot, including the public, arenadata_toolkit, pg_toast, and pg_catalog schemas, but from which the columnar AO tables and all aux tables are excluded. This is easy to check and reconfigure on the fly:

andrey=# CREATE TABLE aoro (c1 INT) WITH (appendoptimized=true, orientation=row);
CREATE TABLE

andrey=# CREATE TABLE aoco (c1 INT) WITH (appendoptimized=true, orientation=column);
CREATE TABLE

andrey=# BEGIN; SELECT * FROM arenadata_toolkit.tables_track WHERE relnamespace = 2200; ROLLBACK;
BEGIN
 relid | relname | relfilenode | size | state | segid | relnamespace | relkind | relstorage | parent_relid
-------+---------+-------------+------+-------+-------+--------------+---------+------------+--------------
 16542 | aoro    |       16438 |    0 | a     |    -1 |         2200 | r       | a          |
 16542 | aoro    |       16438 |    0 | a     |     1 |         2200 | r       | a          |
 16542 | aoro    |       16438 |    0 | a     |     2 |         2200 | r       | a          |
 16542 | aoro    |       16438 |    0 | a     |     0 |         2200 | r       | a          |
(4 rows)
)

ROLLBACK

andrey=# SELECT arenadata_toolkit.tracking_set_relstorages('h,a,c');    (1)
andrey=# SELECT arenadata_toolkit.tracking_set_relkinds('r,i,t,o,M');   (2)
andrey=# SELECT arenadata_toolkit.tracking_register_schema('pg_aoseg'); (3)

andrey=# BEGIN; SELECT * FROM arenadata_toolkit.tables_track WHERE relnamespace IN (2200, 6104); ROLLBACK;
BEGIN
 relid |         relname          | relfilenode | size  | state | segid | relnamespace | relkind | relstorage | parent_relid
-------+--------------------------+-------------+-------+-------+-------+--------------+---------+------------+--------------
 16546 | pg_aovisimap_16542       |       16440 |     0 | a     |    -1 |         6104 | M       | h          |        16542
 16553 | pg_aovisimap_16549       |       16444 |     0 | a     |    -1 |         6104 | M       | h          |        16549
 16551 | pg_aocsseg_16549         |       16443 |     0 | a     |    -1 |         6104 | o       | h          |        16549
 16544 | pg_aoseg_16542           |       16439 |     0 | a     |    -1 |         6104 | o       | h          |        16542
 16542 | aoro                     |       16438 |     0 | a     |    -1 |         2200 | r       | a          |
 16555 | pg_aovisimap_16549_index |       16445 | 32768 | a     |    -1 |         6104 | i       | h          |        16553
 16548 | pg_aovisimap_16542_index |       16441 | 32768 | a     |    -1 |         6104 | i       | h          |        16546
 16549 | aoco                     |       16442 |     0 | a     |    -1 |         2200 | r       | c          |
 16553 | pg_aovisimap_16549       |       16444 |     0 | a     |     0 |         6104 | M       | h          |        16549
 16553 | pg_aovisimap_16549       |       16444 |     0 | a     |     1 |         6104 | M       | h          |        16549
 16553 | pg_aovisimap_16549       |       16444 |     0 | a     |     2 |         6104 | M       | h          |        16549
 16548 | pg_aovisimap_16542_index |       16441 | 32768 | a     |     0 |         6104 | i       | h          |        16546
 16548 | pg_aovisimap_16542_index |       16441 | 32768 | a     |     1 |         6104 | i       | h          |        16546
 16548 | pg_aovisimap_16542_index |       16441 | 32768 | a     |     2 |         6104 | i       | h          |        16546
 16551 | pg_aocsseg_16549         |       16443 |     0 | a     |     0 |         6104 | o       | h          |        16549
 16551 | pg_aocsseg_16549         |       16443 |     0 | a     |     1 |         6104 | o       | h          |        16549
 16551 | pg_aocsseg_16549         |       16443 |     0 | a     |     2 |         6104 | o       | h          |        16549
 16549 | aoco                     |       16442 |     0 | a     |     0 |         2200 | r       | c          |
 16549 | aoco                     |       16442 |     0 | a     |     1 |         2200 | r       | c          |
 16549 | aoco                     |       16442 |     0 | a     |     2 |         2200 | r       | c          |
 16555 | pg_aovisimap_16549_index |       16445 | 32768 | a     |     1 |         6104 | i       | h          |        16553
 16555 | pg_aovisimap_16549_index |       16445 | 32768 | a     |     0 |         6104 | i       | h          |        16553
 16555 | pg_aovisimap_16549_index |       16445 | 32768 | a     |     2 |         6104 | i       | h          |        16553
 16546 | pg_aovisimap_16542       |       16440 |     0 | a     |     1 |         6104 | M       | h          |        16542
 16546 | pg_aovisimap_16542       |       16440 |     0 | a     |     0 |         6104 | M       | h          |        16542
 16546 | pg_aovisimap_16542       |       16440 |     0 | a     |     2 |         6104 | M       | h          |        16542
 16544 | pg_aoseg_16542           |       16439 |     0 | a     |     1 |         6104 | o       | h          |        16542
 16544 | pg_aoseg_16542           |       16439 |     0 | a     |     0 |         6104 | o       | h          |        16542
 16544 | pg_aoseg_16542           |       16439 |     0 | a     |     2 |         6104 | o       | h          |        16542
 16542 | aoro                     |       16438 |     0 | a     |     1 |         2200 | r       | a          |
 16542 | aoro                     |       16438 |     0 | a     |     0 |         2200 | r       | a          |
 16542 | aoro                     |       16438 |     0 | a     |     2 |         2200 | r       | a          |
(32 rows)

ROLLBACK

Note that in the lines with markers 1-3, we returned the AO tables, visibility maps (RELKIND_AOVISIMAP), the main metadata table of AO tables (RELKIND_AOSEGMENTS), and the pg_aoseg service schema.

Moreover, the information on these tables was included in the second selection, since we did ROLLBACK!

Implement the simplest metrics monitoring

Let’s build a simple, but nevertheless useful, basic monitoring based on our solution.

Diagram of monitoring components

The architecture of interaction between the components will be as follows:

Arenadata DB cluster with installed arenadata_toolkit supporting the changes track.
A separate PostgreSQL 16 instance that will request the track every 30 seconds, store the raw track data, and calculate aggregates.
A Grafana instance for storing and displaying metrics.

Table structure for storing track data and data aggregation

To obtain metrics from the Arenadata DB cluster, we use the postgres_fdw extension, which we install on the PostgreSQL cluster, and configure the server, user mapping, and external tables:

postgres=# CREATE EXTENSION IF NOT EXISTS postgres_fdw;

CREATE SERVER adb6_server
FOREIGN DATA WRAPPER postgres_fdw
OPTIONS (host 'localhost', port '6000', dbname 'andrey');

CREATE USER MAPPING FOR CURRENT_USER
SERVER adb6_server
OPTIONS (user 'adb_track_files', password 'qwerty');

CREATE FOREIGN TABLE public.adb_tables_track (
  relid oid,
  relname name,
  relfilenode oid,
  size bigint,
  state "char",
  segid int,
  relnamespace oid,
  relkind "char",
  relstorage "char",
  parent_relid oid
)
SERVER adb6_server
OPTIONS (schema_name 'arenadata_toolkit', table_name 'tables_track');

CREATE FOREIGN TABLE public.is_initial_snapshot_triggered (is_triggered int)
SERVER adb6_server
OPTIONS (schema_name 'arenadata_toolkit', table_name 'is_initial_snapshot_triggered');

On the Arenadata DB side, configure a user to receive a track from the PostgreSQL side:

$ psql -c 'CREATE USER adb_track_files WITH ENCRYPTED PASSWORD '\''qwerty'\'';'
$ echo 'host andrey adb_track_files 127.0.0.1/32 md5' >> $MASTER_DATA_DIRECTORY/pg_hba.conf
$ gpstop -u

On the PostgreSQL side, check the possibility of obtaining a track:

postgres=# BEGIN; SELECT * FROM public.adb_tables_track; ROLLBACK;
 relid |                   relname                   | relfilenode | size  | state | segid | relnamespace | relkind | relstorage | parent_relid
-------+---------------------------------------------+-------------+-------+-------+-------+--------------+---------+------------+--------------
  5002 | gp_distribution_policy                      |       12704 | 32768 | a     |    -1 |           11 | r       | h          |
  6231 | pg_attribute_encoding                       |       12727 | 32768 | a     |    -1 |           11 | r       | h          |
  6105 | pg_appendonly                               |       12717 | 32768 | a     |    -1 |           11 | r       | h          |
  6067 | gp_fastsequence_objid_objmod_index          |       10127 | 65536 | a     |    -1 |           11 | i       | h          |         5043
 16555 | pg_aovisimap_16549_index                    |       16445 | 32768 | a     |    -1 |         6104 | i       | h          |        16553
 16546 | pg_aovisimap_16542                          |       16440 |     0 | a     |    -1 |         6104 | M       | h          |        16542
  5007 | pg_appendonly_relid_index                   |       10126 | 65536 | a     |    -1 |           11 | i       | h          |         6105
 16553 | pg_aovisimap_16549                          |       16444 |     0 | a     |    -1 |         6104 | M       | h          |        16549
 16548 | pg_aovisimap_16542_index                    |       16441 | 32768 | a     |    -1 |         6104 | i       | h          |        16546
 16551 | pg_aocsseg_16549                            |       16443 |     0 | a     |    -1 |         6104 | o       | h          |        16549
 16544 | pg_aoseg_16542                              |       16439 |     0 | a     |    -1 |         6104 | o       | h          |        16542
 16542 | aoro                                        |       16438 |     0 | a     |    -1 |         2200 | r       | a          |
  6236 | pg_attribute_encoding_attrelid_index        |       10137 | 65536 | a     |    -1 |           11 | i       | h          |         6231
 16549 | aoco                                        |       16442 |     0 | a     |    -1 |         2200 | r       | c          |
  6103 | gp_policy_localoid_index                    |       10121 | 65536 | a     |    -1 |           11 | i       | h          |         5002
  5043 | gp_fastsequence                             |       12719 | 32768 | a     |    -1 |           11 | r       | h          |
  6237 | pg_attribute_encoding_attrelid_attnum_index |       10138 | 65536 | a     |    -1 |           11 | i       | h          |         6231
  5002 | gp_distribution_policy                      |       12704 | 32768 | a     |     1 |           11 | r       | h          |
  6231 | pg_attribute_encoding                       |       12727 | 32768 | a     |     1 |           11 | r       | h          |
  6105 | pg_appendonly                               |       12717 | 32768 | a     |     1 |           11 | r       | h          |
  6067 | gp_fastsequence_objid_objmod_index          |       10127 | 65536 | a     |     1 |           11 | i       | h          |         5043
 16555 | pg_aovisimap_16549_index                    |       16445 | 32768 | a     |     1 |         6104 | i       | h          |        16553
 16546 | pg_aovisimap_16542                          |       16440 |     0 | a     |     1 |         6104 | M       | h          |        16542
  5007 | pg_appendonly_relid_index                   |       10126 | 65536 | a     |     1 |           11 | i       | h          |         6105
 16553 | pg_aovisimap_16549                          |       16444 |     0 | a     |     1 |         6104 | M       | h          |        16549
 16548 | pg_aovisimap_16542_index                    |       16441 | 32768 | a     |     1 |         6104 | i       | h          |        16546
 16551 | pg_aocsseg_16549                            |       16443 |     0 | a     |     1 |         6104 | o       | h          |        16549
 16544 | pg_aoseg_16542                              |       16439 |     0 | a     |     1 |         6104 | o       | h          |        16542
 16542 | aoro                                        |       16438 |     0 | a     |     1 |         2200 | r       | a          |
  6236 | pg_attribute_encoding_attrelid_index        |       10137 | 65536 | a     |     1 |           11 | i       | h          |         6231
 16549 | aoco                                        |       16442 |     0 | a     |     1 |         2200 | r       | c          |
  6103 | gp_policy_localoid_index                    |       10121 | 65536 | a     |     1 |           11 | i       | h          |         5002
  5043 | gp_fastsequence                             |       12719 | 32768 | a     |     1 |           11 | r       | h          |
  6237 | pg_attribute_encoding_attrelid_attnum_index |       10138 | 65536 | a     |     1 |           11 | i       | h          |         6231
  5002 | gp_distribution_policy                      |       12704 | 32768 | a     |     2 |           11 | r       | h          |
  6231 | pg_attribute_encoding                       |       12727 | 32768 | a     |     2 |           11 | r       | h          |
  6105 | pg_appendonly                               |       12717 | 32768 | a     |     2 |           11 | r       | h          |
  6067 | gp_fastsequence_objid_objmod_index          |       10127 | 65536 | a     |     2 |           11 | i       | h          |         5043
 16555 | pg_aovisimap_16549_index                    |       16445 | 32768 | a     |     2 |         6104 | i       | h          |        16553
 16546 | pg_aovisimap_16542                          |       16440 |     0 | a     |     2 |         6104 | M       | h          |        16542
  5007 | pg_appendonly_relid_index                   |       10126 | 65536 | a     |     2 |           11 | i       | h          |         6105
:

Create service tables to save "raw" track data:

CREATE TABLE public.adb_tables_track_snapshot (relid oid,
            relname name,
            relfilenode oid,
            size bigint,
            state "char",
            segid int,
            relnamespace oid,
            relkind name,
            relstorage name,
            parent_relid oid,
            ts timestamp);

CREATE UNIQUE INDEX relfilenode_segid_relid_idx ON public.adb_tables_track_snapshot (relfilenode, segid);

CREATE OR REPLACE VIEW public.table_sizes AS
WITH RECURSIVE t AS (
  SELECT
    relid as toprelid,
    relid,
    relname,
    relkind,
    relstorage,
    relnamespace,
    size,
    segid
  FROM public.adb_tables_track_snapshot
  WHERE
  parent_relid IS NULL
  AND
  state <> 'd'
  UNION
  SELECT
    c.toprelid,
    p.relid,
    c.relname,
    c.relkind,
    c.relstorage,
    c.relnamespace,
    p.size,
    p.segid
  FROM public.adb_tables_track_snapshot p
  INNER JOIN t c ON c.relid = p.parent_relid
  WHERE
  state <> 'd'
)
SELECT toprelid, relname, relkind, relstorage, relnamespace, SUM(size) AS total_size
FROM t
GROUP BY toprelid, relname, relkind, relstorage, relnamespace;

To collect and process track data, use this function:

CREATE OR REPLACE FUNCTION public.collect_track()
RETURNS void
LANGUAGE plpgsql
AS $function$
BEGIN

    DROP TABLE IF EXISTS adb_tables_track_snapshot_temp;

    CREATE TEMP TABLE adb_tables_track_snapshot_temp AS
    SELECT
        relid,
        relname,
        relfilenode,
        size,
        state,
        segid,
        relnamespace,
        case
            when relkind='r' then 'ordinary table'
            when relkind='i' then 'secondary index'
            when relkind='s' then 'sequence object'
            when relkind='t' then 'toast'
            when relkind='v' then 'view'
            when relkind='m' then 'materialized view'
            when relkind='c' then 'composite type'
            when relkind='f' then 'foreign table'
            when relkind='p' then 'partitioned table'
            when relkind='I' then 'partitioned index'
            when relkind='o' then 'AO segment files'
            when relkind='b' then 'AO block directory'
            when relkind='M' then 'AO visibility map'
            else 'unknown relkind'
        end as relkind,
        case
            when relstorage='h' then 'heap storage'
            when relstorage='a' then 'append only storage'
            when relstorage='c' then 'append only column storage'
            when relstorage='v' then 'virtual storage'
            when relstorage='x' then 'stored externally'
            when relstorage='f' then 'stored in another server'
            else 'unknown relstorage'
        end as relstorage,
        parent_relid,
        current_timestamp AS ts
    FROM adb_tables_track;

	IF EXISTS (SELECT 1 FROM is_initial_snapshot_triggered WHERE is_triggered = 1) (3)
	THEN

		TRUNCATE TABLE adb_tables_track_snapshot;

		INSERT INTO adb_tables_track_snapshot
		SELECT *
		FROM adb_tables_track_snapshot_temp
		WHERE relfilenode <> 0; (1)

	ELSE

        INSERT INTO adb_tables_track_snapshot
		SELECT
			relid,
			relname,
			relfilenode,
			size,
			state,
			segid,
			relnamespace,
			relkind,
			relstorage,
			parent_relid,
			current_timestamp AS ts
		FROM adb_tables_track_snapshot_temp
		WHERE relfilenode <> 0
		ON CONFLICT (relfilenode, segid) (2)
		DO UPDATE SET
		  size = EXCLUDED.size,
		  ts = EXCLUDED.ts,
		  state = EXCLUDED.state;

	END IF;

END$function$;

There are some important points to note:

1	With the `WHERE relfilenode <> 0` condition, we exclude shared system tables (in PostgreSQL terms), for which the `relfilenode` field is always equal to `0`. For example, these are the `pg_type`, `pg_attribute`, `pg_database`, `pg_tablespace` tables and a number of others. There are no restrictions on tracking changes for them, but this processing would be rather complicated for this example case. The `relfilenode` identifier for such tables will be `0`. These rows would have to be identified by the `segid` and `relid` pair, and not by `relfilenode`, `segid` (the line with the `2` marker).
2	To update information on tables, the `INSERT INTO … ON CONFLICT (…) DO UPDATE SET` statement is used. It allows updating data in the snapshot based on the changes received in the track.
3	If an initial snapshot was requested on the source side, then the full initial snapshot is saved.

Before getting the initial snapshot, call the arenadata_toolkit.tracking_trigger_initial_snapshot() function on the Arenadata DB side:

andrey=# SELECT arenadata_toolkit.tracking_trigger_initial_snapshot();
 tracking_trigger_initial_snapshot
-----------------------------------
 t
(1 row)

Check that the changes track is received and updated:

postgres=# SELECT public.collect_track();
NOTICE:  table "adb_tables_track_snapshot_temp" does not exist, skipping
 collect_track
---------------

(1 row)

postgres=# SELECT * FROM table_sizes WHERE relnamespace = 2200;

 toprelid |         relname         |    relkind     |         relstorage         | relnamespace | total_size
----------+-------------------------+----------------+----------------------------+--------------+------------
    16542 | aoro                    | ordinary table | append only storage        |         2200 |     131072
    16549 | aoco                    | ordinary table | append only column storage |         2200 |     131072
(2 rows)

On the Arenadata DB side, perform insertion into the aoro table and delete the aoco table:

andrey=# INSERT INTO aoro SELECT generate_series(1,100000000);
INSERT 0 100000000

andrey=# DROP TABLE aoco;

andrey=# BEGIN; SELECT * from arenadata_toolkit.tables_track; ROLLBACK;
BEGIN
 relid |    relname     | relfilenode |   size    | state | segid | relnamespace | relkind | relstorage | parent_relid
-------+----------------+-------------+-----------+-------+-------+--------------+---------+------------+--------------
       |                |       16442 |         0 | d     |    -1 |              |         |            |
       |                |       16443 |         0 | d     |    -1 |              |         |            |
       |                |       16444 |         0 | d     |    -1 |              |         |            |
       |                |       16445 |         0 | d     |    -1 |              |         |            |
 16544 | pg_aoseg_16542 |       16439 |     32768 | a     |    -1 |         6104 | o       | h          |        16542
       |                |       16442 |         0 | d     |     2 |              |         |            |
       |                |       16443 |         0 | d     |     2 |              |         |            |
       |                |       16444 |         0 | d     |     2 |              |         |            |
       |                |       16445 |         0 | d     |     2 |              |         |            |
 16544 | pg_aoseg_16542 |       16439 |     32768 | a     |     2 |         6104 | o       | h          |        16542
 16542 | aoro           |       16438 | 333872320 | a     |     2 |         2200 | r       | a          |
       |                |       16442 |         0 | d     |     0 |              |         |            |
       |                |       16443 |         0 | d     |     0 |              |         |            |
       |                |       16444 |         0 | d     |     0 |              |         |            |
       |                |       16445 |         0 | d     |     0 |              |         |            |
 16544 | pg_aoseg_16542 |       16439 |     32768 | a     |     0 |         6104 | o       | h          |        16542
 16542 | aoro           |       16438 | 333860888 | a     |     0 |         2200 | r       | a          |
       |                |       16442 |         0 | d     |     1 |              |         |            |
       |                |       16443 |         0 | d     |     1 |              |         |            |
       |                |       16444 |         0 | d     |     1 |              |         |            |
       |                |       16445 |         0 | d     |     1 |              |         |            |
 16544 | pg_aoseg_16542 |       16439 |     32768 | a     |     1 |         6104 | o       | h          |        16542
 16542 | aoro           |       16438 | 333829336 | a     |     1 |         2200 | r       | a          |
(23 rows)

Update the metrics and check the changes (to reduce the amount of output, filter by the public scheme and relnamespace = 2200):

postgres=# SELECT public.collect_track();
 collect_track
---------------

(1 row)

postgres=# SELECT * FROM table_sizes WHERE relnamespace = 2200;
toprelid |         relname         |    relkind     |         relstorage         | relnamespace | total_size
----------+-------------------------+----------------+----------------------------+--------------+------------
    16542 | aoro                    | ordinary table | append only storage        |         2200 | 1001824688
(1 row)

As can be seen from the updated track, the size of the aoro table has been updated, and the aoco table has disappeared from the selection, since it was deleted.

In order to periodically request a track and update metrics, I suggest using the pg_cron extension:

postgres=# CREATE EXTENSION IF NOT EXISTS pg_cron;
SELECT cron.schedule('vacuum_track', '0 10 * * *', 'VACUUM adb_tables_track_snapshot');
SELECT cron.schedule('collect_track', '30 seconds', 'SELECT collect_track()');

postgres=# SELECT * FROM cron.job_run_details ORDER BY start_time DESC LIMIT 5;
 jobid | runid | job_pid | database | username |        command         |  status   | return_message |          start_time           |           end_time
-------+-------+---------+----------+----------+------------------------+-----------+----------------+-------------------------------+-------------------------------
     2 |     4 |  185730 | postgres | andrey   | SELECT collect_track() | succeeded | 1 row          | 2025-01-24 18:21:17.044724+03 | 2025-01-24 18:21:17.096463+03
     2 |     3 |  185688 | postgres | andrey   | SELECT collect_track() | succeeded | 1 row          | 2025-01-24 18:20:47.044481+03 | 2025-01-24 18:20:47.103201+03
     2 |     2 |  185660 | postgres | andrey   | SELECT collect_track() | succeeded | 1 row          | 2025-01-24 18:20:17.044327+03 | 2025-01-24 18:20:17.097553+03
     2 |     1 |  185617 | postgres | andrey   | SELECT collect_track() | succeeded | 1 row          | 2025-01-24 18:19:47.043519+03 | 2025-01-24 18:19:47.102101+03

Draw Grafana dashboards

Next, I configured the source for metrics as a PostgreSQL database and created basic Grafana dashboards.

The first group is detailed by tables in the form of a common list ordered by size.

Dashboards with lists of top 10 and top 200 tables by size

The second group, with analytics on table types (relkind), storage type (relstorage), schema distribution (relnamespace), and data distribution by segments, is extremely useful for tracking data skew.

Dashboards with analytics by type and distribution by segment

To demonstrate the data skew across segments, create a table and insert the data as follows:

andrey=# CREATE TABLE aoro_part (id INT, part INT) WITH (appendoptimized=true, orientation=row) DISTRIBUTED BY (part);
CREATE TABLE

andrey=# INSERT INTO aoro_part SELECT generate_series(1,10000), 1;
INSERT 0 10000
andrey=# INSERT INTO aoro_part SELECT generate_series(1,9000000), 2;
INSERT 0 1000000

After updating the track, you can see that the aoro_part table has taken up significantly more space.

Tracking changes after insertion into the "aoro_part" table

You can also see a data skew on the first segment (seg0).

Data skew between "aoro_part" table segments

When processing data in this way, our solution can be an alternative to the gp_skew_coefficients and possibly gp_skew_idle_fractitions views.

Found a mistake? Seleсt text and press Ctrl+Enter to report it