Work with arenadata_toolkit

When you install an ADB cluster, the arenadata_toolkit schema is automatically created in the default database. This schema is used to collect information on cluster operations. The schema includes several tables, which allow users to get results of the VACUUM/ANALYZE commands that are launched on a schedule, view information about data files, etc. Additionally, ADB provides a set of scripts that interact with arenadata_toolkit and automatically perform various cluster management operations (according to the predefined schedule).

NOTE

In custom databases that are added after ADB installation, the arenadata_toolkit schema is created automatically when the operation.py script is first run.

Schema tables

The tables included in the arenadata_toolkit schema are described below.

daily_operation

 

The arenadata_toolkit.daily_operation monthly-partitioned table contains information about the VACUUM and ANALYZE operations that are automatically applied to database tables on a schedule. The table contents are updated daily when the vacuum, analyze, vacuum_analyze_pg_catalog scripts are run.

Column Description

schema_name

A schema name

table_name

A table name

action

An operation name. Possible values:

  • ANALYZE

  • VACUUM

status

An operation status. Possible values:

  • SCS — success.

  • ERROR — error.

time

An operation duration (in seconds)

processed_dttm

Operation date and time

db_files_current

 

The arenadata_toolkit.db_files_current table contains information about data files on all cluster segments with links to tables, indexes, and other database objects (if possible). The table contents reflect the state at the time of the last collect_table_stats script launch. The table is overwritten every time the script is run.

Column Description

oid

A table object identifier (OID)

table_name

A table name

table_schema

A schema name

type

A table type. See pg_class.relkind

relamname

A table access method. See pg_am.amname

table_parent_table

A parent table name (NULL if missing). Available for partitions

table_parent_schema

A parent table schema (NULL if missing). Available for partitions

table_database

A database name

table_tablespace

A tablespace name

content

A content identifier of the segment instance where the data file is located. See gp_segment_configuration.content

segment_preferred_role

The role that a segment was originally assigned at initialization time. See gp_segment_configuration.preferred_role

hostname

A name of the host where the data file is located

address

The hostname used to access a particular segment instance on a segment host. This value may be the same as hostname on systems that do not have per-interface hostnames configured

file

A full path to the data file

modifiedtime

The last file modification time that is returned by the system command stat. The value of this timestamp is updated during the file modifications, for example when the following commands are run: mknod, truncate, utime, and write (if more than zero bytes are written). The timestamp value does not change when the inode information is being modified (file owner, group, number of hard links, access mode, etc.)

file_size

A data file size (in bytes)

tablespace_location

The file system location of the tablespace

db_files_history

 

The arenadata_toolkit.db_files_history monthly-partitioned table contains the history of data file changes on all cluster segments with links to tables, indexes, and other database objects (if possible). The table may be useful for tracking the database size changes. The table structure is identical to arenadata_toolkit.db_files_current, except for the additional column collecttime described below.

The arenadata_toolkit.db_files_history contents are updated daily when the collect_table_stats script is run (by copying actual data from arenadata_toolkit.db_files_current).

Column Description

collecttime

Data collection time

operation_exclude

 

The arenadata_toolkit.operation_exclude table contains information on database schemas that should be skipped by the VACUUM and ANALYZE commands when the vacuum and analyze scripts are run.

Column Description

schema_name

A schema name

backup

The flag that controls whether the row should be included in backups when using the gpbackup, gprestore, and pg_dump utilities. The default value is true, meaning the row will be restored. Rows that are automatically added by the arenadata_toolkit extension upon initial table creation (gp_toolkit, information_schema, pg_aoseg, pg_bitmapindex, pg_catalog, and pg_toast) have a default value of false to prevent them from being re-added during restoration

Cluster management scripts

Overview

ADB maintenance scripts are displayed on the Primary configuration tab of the ADB service configuration page. To view all scripts, expand the Crontab → Crontab maintenance scripts node in the tree of configuration settings.

ADB cluster management scripts
ADB cluster management scripts
CAUTION

If you need to configure custom scripts, use the system crontab (for example, /etc/cron.d/custom_gpadmin) under the user other than gpadmin. Do not use the Crontab → Crontab maintenance scripts section in ADCM for this purpose. Otherwise, after switching off the Crontab toggle and reconfiguring ADB, the custom scripts added via ADCM will be deleted.

The cron scheduler is used to run scripts automatically. Scripts are launched on behalf of the system user (by default, gpadmin). In the ADCM interface, each script is described as a separate string including:

  • Start time — a cron expression that defines how often a script should be run.

  • Path to the script — a full path to the script. Currently, two main scripts are used:

    • run_sql_to_gpssh.sh — sends the result of the selected SQL query as a bash command via gpssh. Requires a path to the SQL script as an argument.

    • operation.py — the Python script that performs various operations at the ADB cluster level. The operation type is defined by the first argument (for example, vacuum or analyze).

    Each of these scripts can in turn refer to other SQL queries, scripts, and functions of the arenadata_toolkit schema. All bash, Python, and SQL scripts are stored in the /home/gpadmin/<Arenadata_configs_directory_name>/ directory, where <Arenadata_configs_directory_name> is a value of the Advanced → Arenadata configs directory name ADB configuration parameter (by default, arenadata_configs).

  • Arguments — input parameters that are used to run the script (if available). The arguments available for the operation.py script are listed below.

operation.py arguments
Argument Description Default value

The first positional parameter (without alias)

An operation name:

  • vacuum_system_db

  • remove_orphaned_temp_schemas

  • vacuum

  • analyze

  • collect_table_stats

  • vacuum_analyze_pg_catalog

For information on all operations, see Purpose of ADB cluster management scripts

 — 

--loglevel

A logging level. Possible values:

  • 2 — INFO;

  • 3 — WARNING;

  • 4 — ERROR;

  • 5 — CRITICAL.

For more information on logging levels, see Python documentation

2

--processes

A number of worker processes. Possible values: 1 — 10

10

--timelimit

The limit on the script execution time (in minutes). If the limit is reached, the operation will finish processing the current object, and then the script will stop:

  • For vacuum operations, it will stop after processing the current table.

  • For analyze operations — after processing the entire schema.

  • For vacuum_system_db operations — after processing the current system database (postgres or template1).

  • For collect_table_stats, remove_orphaned_temp_schemas, and vacuum_analyze_pg_catalog operations — after processing a database.

When multiple workers are used (via --processes), multiple objects can be held at the same time. The script will wait for each of them to be processed, and the workers won’t start processing a new object after the limit is reached

 — 

--excludedb

A list of databases that should be excluded from the script process. This list extends databases that are skipped by operation.py by default:

  • template0

  • template1

  • postgres

 — 

--vacuum-order

The order in which the VACUUM command should be applied to tables from custom databases. You can use this parameter only when running the vacuum operation. Possible values:

  • newest-first — the tables for which there is no information on the VACUUM command in pg_catalog.pg_stat_last_operation are processed first. The arenadata_toolkit.adb_vacuum_strategy_newest_first function is used to sort tables.

  • newest-last —  the tables for which there is no information on the VACUUM command in pg_catalog.pg_stat_last_operation are processed last. The arenadata_toolkit.adb_vacuum_strategy_newest_last function is used to sort tables.

  • A custom function name in the <schema_name>.<function_name> format. The custom function signature should be identical to the two functions mentioned above. It should return a list of rows, each containing the schema name and table name. If the function is absent in some databases, a warning message is stored in the log and the VACUUM command is not applied to the corresponding databases. For more information, see Example of a custom function for VACUUM.

newest-first

Script logs are stored in the /home/gpadmin/<Arenadata_configs_directory_name>/operation_log/ directory (for operation.py) and the gzip_pg_log.log file (for gzip_pg_log.sql). You can configure the log retention period via the Crontab → Delete old maintenance script logs configuration parameter of the ADB service. The default parameter value is 30 days.

TIP

It is not recommended to change scripts in any way. However, if you need to update the script start time or add/modify script arguments, follow the steps below:

  1. Make necessary changes in the Crontab → Crontab maintenance scripts section on the Primary configuration tab of the ADB service configuration page.

  2. Click Save.

  3. Apply the Reconfigure action to the ADB service.

Most scripts interact with the arenadata_toolkit schema. The purpose of all scripts is described below.

Purpose of ADB cluster management scripts
# Script Description Start time HH:mm (UTC)

0

run_sql_to_gpssh.sh → gzip_pg_log.sql

For each cluster segment, archives CSV logs that are stored in <data_directory>/pg_log and have not changed for more than 3 days (where <data_directory> is a segment data directory)

01:00

1

operation.py → vacuum_system_db

Applies the VACUUM FREEZE command to the postgres and template1 system databases (by running vacuumdb -F)

02:00

2

operation.py → remove_orphaned_temp_schemas

In all custom databases, drops temporary schemas that are no longer in use and whose names start with pg_temp_ and pg_toast_temp_

03:00

3

operation.py → vacuum

06:00

4

operation.py → analyze

11:00

5

operation.py → collect_table_stats

In all custom databases:

  1. Collects statistics for data files on cluster segments (size, location, etc.) and tries to define file links with tables, indexes, and other database objects. To do this, the script runs a function, which gets information from the gp_toolkit.gp_db_files_current view.

  2. Saves the received information in the arenadata_toolkit.db_files_current table after clearing its contents first. Some table columns may remain NULL if it was impossible to find corresponding database objects for data files in the previous step (for example, oid and table_name).

  3. Adds new rows to the arenadata_toolkit.db_files_history table (by copying data from arenadata_toolkit.db_files_current).

21:00

6

operation.py → vacuum_analyze_pg_catalog

00:04

08:04

16:04

7

gzip_gpadmin_log.sh

Performs automatic compression of log files that are located in /home/gpadmin/gpAdminLogs/ and are older than 3 days

01:00

Example of a custom function for VACUUM

Below you can find an example of using a custom function that determines the order in which user tables should be processed when performing the VACUUM operation (see --vacuum-order in the operation.py arguments table):

  1. Create a function that meets the following conditions:

    • The function should have one input parameter of the text type. When the operation.py script is run, VACUUM is always passed as the parameter value. There is no need to use the parameter in the function body.

    • The function should return a table with the following columns: table_schema (schema in DB) and table_name (table in DB). This is a sorted list of database tables, to which the VACUUM operation should be applied.

    Suppose that you want certain ETL data in the staging environment to be processed by the VACUUM operation only after all other data has been processed. The following function moves tables whose names start with etl_staging to the end of the list and sorts tables by schema names and then by table names in the alphabetical order.

    CREATE OR REPLACE FUNCTION public.adb_vacuum_strategy_etl_staging(actionname text)
        RETURNS TABLE (table_schema name, table_name name)
        LANGUAGE sql
        STABLE
    AS $$
        SELECT nspname, relname
        FROM pg_catalog.pg_class c
        JOIN pg_catalog.pg_namespace n ON relnamespace = n.oid
        WHERE relkind = 'r'
          AND nspname NOT IN (SELECT schema_name FROM arenadata_toolkit.operation_exclude)
        ORDER BY
            CASE WHEN relname LIKE 'etl_staging%' THEN 1 ELSE 0 END,
            nspname,
            relname;
    $$
    EXECUTE ON MASTER;
    CAUTION

    The function should be added to all custom databases to which the VACUUM operation will be applied, i.e. all databases except for system ones (template0, template1, postgres) and those that can be optionally specified in the --excludedb argument of the operation.py script.

  2. Ensure the function works:

    SELECT * FROM public.adb_vacuum_strategy_etl_staging('VACUUM');

    In the resulting output, tables related to ETL staging (etl_staging_*) are placed at the very bottom:

       table_schema    |    table_name
    -------------------+--------------------------------
     arenadata_toolkit | daily_operation_1_prt_other
     arenadata_toolkit | daily_operation_1_prt_p202510
     arenadata_toolkit | daily_operation_1_prt_p202511
     arenadata_toolkit | db_files_current
     arenadata_toolkit | db_files_history_1_prt_other
     arenadata_toolkit | db_files_history_1_prt_p202510
     arenadata_toolkit | db_files_history_1_prt_p202511
     arenadata_toolkit | operation_exclude
     kadb              | offsets
     public            | spatial_ref_sys
     public            | etl_staging_main
     public            | etl_staging_orders
     public            | etl_staging_products
     public            | etl_staging_users
    (14 rows)
  3. In ADCM, open the Primary configuration tab on the configuration page of the ADB service.

  4. Expand the Crontab → Crontab maintenance scripts node in the tree of configuration settings and click a description of the script that performs vacuum.

    Proceed to editing the script arguments
    Proceed to editing the script arguments
  5. In the window that opens, add the --vacuum-order <schema_name>.<function_name> argument to the end of the current script description. For example: --vacuum-order public.adb_vacuum_strategy_custom. Click Apply.

    IMPORTANT

    The custom function name should be defined in the following format: <schema_name>.<function_name>, where <schema_name> — a name of the schema where the function was created; <function_name> — the function name. Note that passing arguments to the function is not supported.

    Add the --vacuum-order argument
    Add the --vacuum-order argument
  6. Click Save to save the configuration changes. Then, apply the Reconfigure action to the ADB service.

    Save changes in the ADB configuration
    Save changes in the ADB configuration
  7. After the operation.py script is run according to the schedule, view the contents of the log created in the /home/gpadmin/<Arenadata_configs_directory_name>/operation_log/ directory (by default, /home/gpadmin/arenadata_configs/operation_log/). Each log name contains date and time of the script launch — for example, 2025-11-10_10:05:01.172592.log.

    Ensure that the table processing order corresponds to the one defined in the custom function. To do this, pay attention to the log entries marked as DO, which mean that the VACUUM command is started for the corresponding tables.

    Note that the order of events with the same timestamp in the log may differ from the actual one. In this case, compare identifiers of worker processes ForkPoolWorker. The smaller the identifier, the earlier the start of processing.

    Log fragment for the VACUUM operation
    2025-11-10 08:25:02,665: INFO: ForkPoolWorker-3: pid=3437270: DO: 'vacuum "arenadata_toolkit"."daily_operation_1_prt_p202511";' for 'adb'
    2025-11-10 08:25:02,665: INFO: ForkPoolWorker-7: pid=3437279: DO: 'vacuum "arenadata_toolkit"."db_files_history_1_prt_p202511";' for 'adb'
    2025-11-10 08:25:02,665: INFO: ForkPoolWorker-2: pid=3437269: DO: 'vacuum "arenadata_toolkit"."daily_operation_1_prt_p202510";' for 'adb'
    2025-11-10 08:25:02,665: INFO: ForkPoolWorker-1: pid=3437268: DO: 'vacuum "arenadata_toolkit"."daily_operation_1_prt_other";' for 'adb'
    2025-11-10 08:25:02,665: INFO: ForkPoolWorker-4: pid=3437272: DO: 'vacuum "arenadata_toolkit"."db_files_current";' for 'adb'
    2025-11-10 08:25:02,665: INFO: ForkPoolWorker-5: pid=3437275: DO: 'vacuum "arenadata_toolkit"."db_files_history_1_prt_other";' for 'adb'
    2025-11-10 08:25:02,667: INFO: ForkPoolWorker-8: pid=3437281: DO: 'vacuum "arenadata_toolkit"."operation_exclude";' for 'adb'
    2025-11-10 08:25:02,669: INFO: ForkPoolWorker-9: pid=3437283: DO: 'vacuum "kadb"."offsets";' for 'adb'
    2025-11-10 08:25:02,665: INFO: ForkPoolWorker-6: pid=3437277: DO: 'vacuum "arenadata_toolkit"."db_files_history_1_prt_p202510";' for 'adb'
    2025-11-10 08:25:02,988: INFO: ForkPoolWorker-4: pid=3437272: DONE: 'vacuum "arenadata_toolkit"."db_files_current";' for 'adb'
    2025-11-10 08:25:02,990: INFO: ForkPoolWorker-9: pid=3437283: DONE: 'vacuum "kadb"."offsets";' for 'adb'
    2025-11-10 08:25:03,090: INFO: ForkPoolWorker-5: pid=3437275: DONE: 'vacuum "arenadata_toolkit"."db_files_history_1_prt_other";' for 'adb'
    2025-11-10 08:25:03,092: INFO: ForkPoolWorker-8: pid=3437281: DONE: 'vacuum "arenadata_toolkit"."operation_exclude";' for 'adb'
    2025-11-10 08:25:03,186: INFO: ForkPoolWorker-6: pid=3437277: DONE: 'vacuum "arenadata_toolkit"."db_files_history_1_prt_p202510";' for 'adb'
    2025-11-10 08:25:03,186: INFO: ForkPoolWorker-2: pid=3437269: DONE: 'vacuum "arenadata_toolkit"."daily_operation_1_prt_p202510";' for 'adb'
    2025-11-10 08:25:03,196: INFO: ForkPoolWorker-1: pid=3437268: DONE: 'vacuum "arenadata_toolkit"."daily_operation_1_prt_other";' for 'adb'
    2025-11-10 08:25:03,398: INFO: ForkPoolWorker-3: pid=3437270: DONE: 'vacuum "arenadata_toolkit"."daily_operation_1_prt_p202511";' for 'adb'
    2025-11-10 08:25:03,431: INFO: ForkPoolWorker-7: pid=3437279: DONE: 'vacuum "arenadata_toolkit"."db_files_history_1_prt_p202511";' for 'adb'
    2025-11-10 08:25:04,020: INFO: ForkPoolWorker-4: pid=3437272: DO: 'vacuum "public"."spatial_ref_sys";' for 'adb'
    2025-11-10 08:25:04,123: INFO: ForkPoolWorker-10: pid=3437285: DO: 'vacuum "public"."etl_staging_main";' for 'adb'
    2025-11-10 08:25:04,123: INFO: ForkPoolWorker-8: pid=3437281: DO: 'vacuum "public"."etl_staging_orders";' for 'adb'
    2025-11-10 08:25:04,124: INFO: ForkPoolWorker-2: pid=3437269: DO: 'vacuum "public"."etl_staging_products";' for 'adb'
    2025-11-10 08:25:04,129: INFO: ForkPoolWorker-5: pid=3437275: DO: 'vacuum "public"."etl_staging_users";' for 'adb'
    2025-11-10 08:25:04,326: INFO: ForkPoolWorker-4: pid=3437272: DONE: 'vacuum "public"."spatial_ref_sys";' for 'adb'
    2025-11-10 08:25:04,424: INFO: ForkPoolWorker-9: pid=3437283: DONE: 'vacuum "public"."etl_temp_processing";' for 'adb'
    2025-11-10 08:25:04,526: INFO: ForkPoolWorker-2: pid=3437269: DONE: 'vacuum "public"."etl_staging_products";' for 'adb'
    2025-11-10 08:25:04,526: INFO: ForkPoolWorker-5: pid=3437275: DONE: 'vacuum "public"."etl_staging_users";' for 'adb'
    2025-11-10 08:25:04,641: INFO: ForkPoolWorker-8: pid=3437281: DONE: 'vacuum "public"."etl_staging_orders";' for 'adb'
    2025-11-10 08:25:04,727: INFO: ForkPoolWorker-10: pid=3437285: DONE: 'vacuum "public"."etl_staging_main";' for 'adb'
    IMPORTANT
    • If there is no selected function in some databases, the following message is written to the log: This function <schema_name>.<function_name> does not exist in the database.

    • If the --vacuum-order argument has an invalid value (other than newest-first/newest-last and not containing a point), the following message is written to the log: --vacuum-order parameter is only valid for vacuum action or invalid function name.

    In both cases, the VACUUM operation is not applied to the databases for which errors have been detected.

Found a mistake? Seleсt text and press Ctrl+Enter to report it