Work with arenadata_toolkit

Daria Barysheva

Contents

Schema contents
- Tables
- Views
Cluster management scripts
- Overview
- Example of a custom function for VACUUM

When you install an ADB cluster, the arenadata_toolkit schema is automatically created in the default database. This schema is used to collect information on cluster operations. The schema includes several tables and views, which allow users to get results of the VACUUM/ANALYZE commands that are launched on a schedule, view information about data files, skew coefficients, etc. Additionally, ADB provides a set of scripts that interact with arenadata_toolkit and automatically perform various cluster management operations (according to the predefined schedule).

NOTE

In custom databases that are added after ADB installation, the arenadata_toolkit schema is created automatically when the operation.py script is first run.

Schema contents

Tables

The tables included in the arenadata_toolkit schema are described below.

daily_operation

The arenadata_toolkit.daily_operation table contains information about the VACUUM and ANALYZE operations that are automatically applied to database tables on a schedule. The table contents are updated daily when the vacuum, analyze, vacuum_analyze_pg_catalog scripts are run.

Column Description

schema_name

A schema name

table_name

A table name

action

An operation name. Possible values:

ANALYZE
VACUUM

status

An operation status. Possible values:

SCS — success.
ERROR — error.

time

An operation duration (in seconds)

processed_dttm

Operation date and time

db_files_current

The arenadata_toolkit.db_files_current table contains information about data files on all cluster segments with links to tables, indexes, and other database objects (if possible). The table contents are actual at the moment of the last collect_table_stats script launch. The table is overwritten every time the script is run.

Column Description

oid

A table object identifier (OID)

table_name

A table name

table_schema

A schema name

type

A table type. See pg_class.relkind

storage

A table storage mode. See pg_class.relstorage

table_parent_table

A parent table name (NULL if missing). Available for partitions

table_parent_schema

A parent table schema (NULL if missing). Available for partitions

table_database

A database name

table_tablespace

A tablespace name

content

A content identifier of the segment instance where the data file is located. See gp_segment_configuration.content

segment_preferred_role

The role that a segment was originally assigned at initialization time. See gp_segment_configuration.preferred_role

hostname

A name of the host where the data file is located

address

The hostname used to access a particular segment instance on a segment host. This value may be the same as hostname on systems that do not have per-interface hostnames configured

file

A full path to the data file

modifiedtime

The last file modification time that is returned by the system command stat. The value of this timestamp is updated during the file modifications, e.g. when the following commands are run: mknod, truncate, utime, and write (if more than zero bytes are written). The timestamp value does not change when the inode information is being modified (file owner, group, number of hard links, access mode, etc.)

file_size

A data file size (in bytes)

tablespace_location

The file system location of the tablespace

db_files_history

The arenadata_toolkit.db_files_history table contains the history of data file changes on all cluster segments with links to tables, indexes, and other database objects (if possible). The table may be useful for tracking the database size changes. The table structure is identical to arenadata_toolkit.db_files_current, except for the additional column collecttime described below.

The arenadata_toolkit.db_files_history contents are updated daily when the collect_table_stats script is run (by copying actual data from arenadata_toolkit.db_files_current).

Column

Description

collecttime

Data collection time

operation_exclude

The arenadata_toolkit.operation_exclude table contains information on database schemas that should be skipped by the VACUUM and ANALYZE commands when the vacuum and analyze scripts are run.

Column

Description

schema_name

A schema name

Views

The arenadata_toolkit schema contains several views. The most useful of them are listed below. Other views are for internal usage only.

adb_skew_coefficients

The arenadata_toolkit.adb_skew_coefficients view allows you to indicate data distribution skew by calculating a coefficient of variation (CV). Unlike the Greenplum gp_skew_coefficients view, adb_skew_coefficients takes into account the physical size of table data files rather than a number of table rows, which is a faster and more accurate approach.

Column Description

skcoid

A table object identifier (OID)

skcnamespace

A schema name

skcrelname

A table name

skccoeff

A coefficient of variation (CV) for data distribution. The value is calculated as the standard deviation of <file_size> divided by the average of <file_size> among all segments, where <file_size> is a size of the current table files per single segment. skccoeff values indicate data distribution between cluster segments and query processing skew. The lower the coefficient, the better. If the coefficient value is high, it is recommended to evaluate distribution policies (see Recommendations on avoiding data skew). For example, if at least one segment has more than 25% table data in comparison with other cluster segments, the skccoeff value will be more than 10. However, the critical coefficient value should be determined for each environment individually

adb_hba_file_rules_view

The arenadata_toolkit.adb_hba_file_rules_view provides a summary of the contents of the pg_hba.conf file, which is used to configure client authentication. A row appears in this view for each non-empty, non-comment line in the pg_hba.conf file, with annotations indicating whether the rule could be applied successfully.

This view can be helpful for checking whether planned changes in the authentication configuration file will work, or for diagnosing a previous failure.

Column Description

line_number

A line number of the rule in pg_hba.conf

type

A connection type. Possible values:

host
local
hostssl
hostnossl

database

A list of database names to which the rule applies

user_name

A list of user and group names to which the rule applies

address

A host name or IP address, or one of all, samehost, samenet, or null for local connections

netmask

An IP address mask, or null if not applicable

auth_method

An authentication method

options

Options specified for the authentication method, if any

error

If not null, an error message indicating why the rule could not be processed

__db_files_current

The arenadata_toolkit.__db_files_current view provides the same information that is regularly added to the table db_files_current. This view allows you to get actual data without waiting for the next start of the corresponding script. The view columns are identical to the corresponding table columns except for one described below.

Column Description

changed_dttm

The last file change time that is returned by the system command stat. The timestamp value changes when the inode information is being modified (file owner, group, number of hard links, access mode, etc.)

__db_files_current_unmapped

The arenadata_toolkit.__db_files_current_unmapped view displays information about data files that are currently not linked to tables, indexes, and other database objects. For these data files, such fields as oid, table_name, table_schema are set to NULL when the db_files_current table is automatically filled in.

Note that the __db_files_current_unmapped view can show files that have no links with database objects temporarily. For example, the view can return data files of tables that have been created within a transaction, but not committed yet. Also, you can see files of deleted tables for which the CHECKPOINT command has not yet been executed (after which such files will be deleted).

For information on the view columns, see corresponding descriptions in the db_files_current table.

Cluster management scripts

Overview

ADB maintenance scripts are displayed on the Primary configuration tab of the ADB service configuration page. To view all scripts, expand the Crontab → Crontab maintenance scripts node in the tree of configuration settings.

ADB cluster management scripts

CAUTION

If you need to configure custom scripts, use the system crontab (for example, /etc/cron.d/custom_gpadmin) under the user other than gpadmin. Do not use the Crontab → Crontab maintenance scripts section in ADCM for this purpose. Otherwise, after switching off the Crontab toggle and reconfiguring ADB, the custom scripts added via ADCM will be deleted.

The cron scheduler is used to run scripts automatically. Scripts are launched on behalf of the system user (by default, gpadmin). In the ADCM interface, each script is described as a separate string including:

Start time — a cron expression that defines how often a script should be run.
Path to the script — a full path to the script. Currently, two main scripts are used:
- run_sql_to_gpssh.sh — sends the result of the selected SQL query as a bash command via gpssh. Requires a path to the SQL script as an argument.
- operation.py — the Python script that performs various operations at the ADB cluster level. The operation type is defined by the first argument (e.g. vacuum, analyze, etc.).
Each of these scripts can in turn refer to other SQL queries, scripts, and functions of the arenadata_toolkit schema. All bash, Python, and SQL scripts are stored in the /home/gpadmin/<Arenadata_configs_directory_name>/ directory, where <Arenadata_configs_directory_name> is a value of the Advanced → Arenadata configs directory name ADB configuration parameter (by default, arenadata_configs).
Arguments — input parameters that are used to run the script (if available). The arguments available for the operation.py script are listed below.

operation.py arguments

Argument Description Default value

The first positional parameter (without alias)

An operation name:

vacuum_system_db
remove_orphaned_temp_schemas
vacuum
analyze
collect_table_stats
vacuum_analyze_pg_catalog

For information on all operations, see Purpose of ADB cluster management scripts

—

--loglevel

A logging level. Possible values:

2 — INFO;
3 — WARNING;
4 — ERROR;
5 — CRITICAL.

For more information on logging levels, see Python documentation

--processes

A number of worker processes. Possible values: 1 — 10

--timelimit

A time period after which the script is stopped (in minutes)

—

--excludedb

A list of databases that should be excluded from the script process. This list extends databases that are skipped by operation.py by default:

template0
template1
postgres
gpperfmon

—

--vacuum-order

The order in which the VACUUM command should be applied to tables from custom databases. You can use this parameter only when running the vacuum operation. Possible values:

newest-first — the tables for which there is no information on the VACUUM command in pg_catalog.pg_stat_last_operation are processed first. The arenadata_toolkit.adb_vacuum_strategy_newest_first function is used to sort tables.
newest-last — the tables for which there is no information on the VACUUM command in pg_catalog.pg_stat_last_operation are processed last. The arenadata_toolkit.adb_vacuum_strategy_newest_last function is used to sort tables.
A custom function name in the <schema_name>.<function_name> format. The custom function signature should be identical to the two functions mentioned above. It should return a list of rows, each containing the schema name and table name. If the function is absent in some databases, a warning message is stored in the log and the VACUUM command is not applied to the corresponding databases. For more information, see Example of a custom function for VACUUM.

newest-first

Script logs are stored in the /home/gpadmin/<Arenadata_configs_directory_name>/operation_log/ directory (for operation.py) and the gzip_pg_log.log file (for gzip_pg_log.sql). You can configure the log retention period via the Crontab → Delete old maintenane script logs configuration parameter of the ADB service. The default parameter value is 30 days.

TIP

It is not recommended to change scripts in any way. However, if you need to update the script start time or add/modify script arguments, follow the steps below:

Make necessary changes in the Crontab → Crontab maintenance scripts section on the Primary configuration tab of the ADB service configuration page.
Click Save.
Apply the Reconfigure action to the ADB service.

Most scripts interact with the arenadata_toolkit schema. The purpose of all scripts is described below.

Purpose of ADB cluster management scripts
№	Script	Description	Start time HH:mm (UTC)
0	run_sql_to_gpssh.sh → gzip_pg_log.sql	For each cluster segment, archives CSV logs that are stored in <data_directory>/pg_log and have not changed for more than 3 days (where `<data_directory>` is a segment data directory)	01:00
1	operation.py → vacuum_system_db	Applies the VACUUM FREEZE command to the `postgres` and `template1` system databases (by running `vaccumdb -F`)	02:00
2	operation.py → remove_orphaned_temp_schemas	In all custom databases, drops temporary schemas that are no longer in use and whose names start with `pg_temp_` and `pg_toast_temp_`	03:00
3	operation.py → vacuum	Runs the VACUUM command in all custom databases, except for schemas defined in the arenadata_toolkit.operation_exclude table. Writes operation results to the arenadata_toolkit.daily_operation table.	06:00
4	operation.py → analyze	Runs the ANALYZE command in all custom databases, except for schemas defined in the arenadata_toolkit.operation_exclude table. Writes operation results to the arenadata_toolkit.daily_operation table.	11:00
5	operation.py → collect_table_stats	In all custom databases: Collects statistics for data files on cluster segments (size, location, etc.) and tries to define file links with tables, indexes, and other database objects. To do this, the script runs `arenadata_toolkit` views, which get information from the `gp_segment_configuration`, `pg_class`, `pg_namespace`, `pg_tablespace`, and `pg_database` tables. Saves the received information in the arenadata_toolkit.db_files_current table after clearing its contents first. Some table columns may remain NULL if it was impossible to find corresponding database objects for data files in the previous step (e.g. `oid`, `table_name`, etc.). Adds new rows to the arenadata_toolkit.db_files_history table (by copying data from arenadata_toolkit.db_files_current).	21:00
6	operation.py → vacuum_analyze_pg_catalog	Runs the VACUUM and ANALYZE commands in the `pg_catalog` schema of all custom databases. Writes operation results to the arenadata_toolkit.daily_operation table.	00:04 08:04 16:04

Example of a custom function for VACUUM

Below you can find an example of using a custom function that determines the order in which user tables should be processed when performing the VACUUM operation (see --vacuum-order in the operation.py arguments table):

Create a function that meets the following conditions:

The function should have one input parameter of the text type. When the operation.py script is run, VACUUM is always passed as the parameter value. There is no need to use the parameter in the function body.
The function should return a table with the following columns: table_schema (schema in DB) and table_name (table in DB). This is a sorted list of database tables, to which the VACUUM operation should be applied.

For example, the following function sorts tables by schema names and then by table names in the alphabetical order.

CREATE OR REPLACE FUNCTION public.adb_vacuum_strategy_custom(actionname text)
	RETURNS TABLE (table_schema name, table_name name)
	LANGUAGE sql
	STABLE
AS $$
	SELECT nspname, relname
	FROM pg_catalog.pg_class c
	JOIN pg_catalog.pg_namespace n ON relnamespace = n.oid
	LEFT JOIN pg_catalog.pg_partition_rule ON parchildrelid = c.oid
	WHERE relkind = 'r' AND relstorage != 'x' AND parchildrelid IS NULL
	AND nspname NOT IN (SELECT schema_name FROM arenadata_toolkit.operation_exclude)
	ORDER BY nspname, relname;

$$
EXECUTE ON MASTER;

CAUTION

The function should be added to all custom databases to which the VACUUM operation will be applied, i.e. all databases except for system ones (template0, template1, postgres, gpperfmon) and those that can be optionally specified in the --excludedb argument of the operation.py script.

Ensure the function works:

SELECT * FROM public.adb_vacuum_strategy_custom('VACUUM');

The result:

   table_schema    |    table_name
-------------------+-------------------
 arenadata_toolkit | daily_operation
 arenadata_toolkit | db_files_current
 arenadata_toolkit | db_files_history
 arenadata_toolkit | operation_exclude
 diskquota         | quota_config
 diskquota         | state
 diskquota         | table_size
 diskquota         | target
 kadb              | offsets
 madlib            | migrationhistory
 public            | spatial_ref_sys
 public            | test
 public            | test2
 public            | test3
(14 rows)

In ADCM, open the Primary configuration tab on the configuration page of the ADB service.
Expand the Crontab → Crontab maintenance scripts node in the tree of configuration settings and click a description of the script that performs vacuum.

Proceed to editing the script arguments

In the window that opens, add the --vacuum-order <schema_name>.<function_name> argument to the end of the current script description. For example: --vacuum-order public.adb_vacuum_strategy_custom. Click Apply.

IMPORTANT

The custom function name should be defined in the following format: <schema_name>.<function_name>, where <schema_name> — a name of the schema where the function was created; <function_name> — the function name. Note that passing arguments to the function is not supported.

Add the --vacuum-order argument

Click Save to save the configuration changes. Then, apply the Reconfigure action to the ADB service.

Save changes in the ADB configuration

After the operation.py script is run according to the schedule, view the contents of the log created in the /home/gpadmin/<Arenadata_configs_directory_name>/operation_log/ directory (by default, /home/gpadmin/arenadata_configs/operation_log/). Each log name contains date and time of the script launch — for example, 2024-01-24_10:05:01.172592.log.

Ensure that the table processing order corresponds to the one defined in the custom function. To do this, pay attention to the log entries marked as DO, which mean that the VACUUM command is started for the corresponding tables.

Note that the order of events with the same timestamp in the log may differ from the actual one. In this case, compare identifiers of worker processes PoolWorker. The smaller the identifier, the earlier the start of processing.

Log fragment for the VACUUM operation

2024-01-24 10:05:02,281: INFO: PoolWorker-1: pid=10232: DO: 'vacuum "arenadata_toolkit"."daily_operation";' for 'adb'
2024-01-24 10:05:02,281: INFO: PoolWorker-2: pid=10233: DO: 'vacuum "arenadata_toolkit"."db_files_current";' for 'adb'
2024-01-24 10:05:02,281: INFO: PoolWorker-3: pid=10234: DO: 'vacuum "arenadata_toolkit"."db_files_history";' for 'adb'
2024-01-24 10:05:02,281: INFO: PoolWorker-5: pid=10239: DO: 'vacuum "diskquota"."quota_config";' for 'adb'
2024-01-24 10:05:02,282: INFO: PoolWorker-7: pid=10243: DO: 'vacuum "diskquota"."table_size";' for 'adb'
2024-01-24 10:05:02,281: INFO: PoolWorker-4: pid=10236: DO: 'vacuum "arenadata_toolkit"."operation_exclude";' for 'adb'
2024-01-24 10:05:02,282: INFO: PoolWorker-6: pid=10241: DO: 'vacuum "diskquota"."state";' for 'adb'
2024-01-24 10:05:02,283: INFO: PoolWorker-8: pid=10245: DO: 'vacuum "diskquota"."target";' for 'adb'
2024-01-24 10:05:02,284: INFO: PoolWorker-9: pid=10247: DO: 'vacuum "kadb"."offsets";' for 'adb'
2024-01-24 10:05:02,285: INFO: PoolWorker-10: pid=10249: DO: 'vacuum "madlib"."migrationhistory";' for 'adb'
2024-01-24 10:05:02,341: INFO: PoolWorker-5: pid=10239: DONE: 'vacuum "diskquota"."quota_config";' for 'adb'
2024-01-24 10:05:02,347: INFO: PoolWorker-10: pid=10249: DONE: 'vacuum "madlib"."migrationhistory";' for 'adb'
2024-01-24 10:05:02,347: INFO: PoolWorker-2: pid=10233: DONE: 'vacuum "arenadata_toolkit"."db_files_current";' for 'adb'
2024-01-24 10:05:02,350: INFO: PoolWorker-8: pid=10245: DONE: 'vacuum "diskquota"."target";' for 'adb'
2024-01-24 10:05:02,351: INFO: PoolWorker-6: pid=10241: DONE: 'vacuum "diskquota"."state";' for 'adb'
2024-01-24 10:05:02,352: INFO: PoolWorker-7: pid=10243: DONE: 'vacuum "diskquota"."table_size";' for 'adb'
2024-01-24 10:05:02,353: INFO: PoolWorker-9: pid=10247: DONE: 'vacuum "kadb"."offsets";' for 'adb'
2024-01-24 10:05:02,513: INFO: PoolWorker-5: pid=10239: DO: 'vacuum "public"."spatial_ref_sys";' for 'adb'
2024-01-24 10:05:02,546: INFO: PoolWorker-5: pid=10239: DONE: 'vacuum "public"."spatial_ref_sys";' for 'adb'
2024-01-24 10:05:02,607: INFO: PoolWorker-7: pid=10243: DO: 'vacuum "public"."test";' for 'adb'
2024-01-24 10:05:02,633: INFO: PoolWorker-4: pid=10236: DONE: 'vacuum "arenadata_toolkit"."operation_exclude";' for 'adb'
2024-01-24 10:05:02,647: INFO: PoolWorker-7: pid=10243: DONE: 'vacuum "public"."test";' for 'adb'
2024-01-24 10:05:02,692: INFO: PoolWorker-9: pid=10247: DO: 'vacuum "public"."test2";' for 'adb'
2024-01-24 10:05:02,692: INFO: PoolWorker-10: pid=10249: DO: 'vacuum "public"."test3";' for 'adb'
2024-01-24 10:05:02,967: INFO: PoolWorker-9: pid=10247: DONE: 'vacuum "public"."test2";' for 'adb'
2024-01-24 10:05:02,972: INFO: PoolWorker-10: pid=10249: DONE: 'vacuum "public"."test3";' for 'adb'
2024-01-24 10:05:03,054: INFO: PoolWorker-3: pid=10234: DONE: 'vacuum "arenadata_toolkit"."db_files_history";' for 'adb'
2024-01-24 10:05:03,202: INFO: PoolWorker-1: pid=10232: DONE: 'vacuum "arenadata_toolkit"."daily_operation";' for 'adb'

IMPORTANT

If there is no selected function in some databases, the following message is written to the log: This function <schema_name>.<function_name> does not exist in the database.
If the --vacuum-order argument has an invalid value (other than newest-first/newest-last and not containing a point), the following message is written to the log: --vacuum-order parameter is only valid for vacuum action or invalid function name.

In both cases, the VACUUM operation is not applied to the databases for which errors have been detected.

Found a mistake? Seleсt text and press Ctrl+Enter to report it