ADB monitoring metrics

This article describes metrics for monitoring an ADB cluster. For information on how to install monitoring, refer to Install monitoring.

Overview

The Monitoring service consists of the following components:

  • Node Exporter — exposes hardware- and OS-related metrics, such as memory and CPU usage or filesystem space. These metrics are available on the port and endpoint specified in the Node Exporter settings section of the Monitoring service configuration (11203/metrics by default).

  • Process Exporter — collects metrics for the specified processes. In ADB, these are the processes related to ADBM and ADB Control agents. The Process Exporter metrics are available on the port specified in the Process exporter settings section of the Monitoring service configuration (9256 by default). The Process Exporter component is available in the Enterprise edition of ADB.

  • Greengage Exporter — collects cluster- and database-related metrics, such as segment health or active connections. These metrics are available on the port and endpoint specified in the Greengage Exporter settings section of the Monitoring service configuration (9080/metrics by default).

  • Prometheus — scrapes and stores metrics from the configured data sources: Node Exporter, Process Exporter, and Greengage Exporter. These metrics are available in the Prometheus web interface on the port specified in the Prometheus settings section of the Monitoring service configuration (11200 by default).

  • Grafana — uses Prometheus as a data source and visualizes its metrics as graphs and charts organized into dashboards. These dashboards are available in the Grafana web interface on the port specified in the Grafana settings section of the Monitoring service configuration (11210 by default).

View metrics in Prometheus

Prometheus is a monitoring and alerting toolkit. Prometheus collects metrics from exporters, and Grafana sends requests to Prometheus to collect data for its dashboards. If Grafana dashboards show empty panels or unexpected values, checking Prometheus helps determine whether the issue is related to metric collection or dashboard configuration:

  1. In your browser, enter <IP address of monitoring server>:<port>. The default port is 11200, and it can be changed in the Prometheus settings section in the Monitoring service configuration.

    The IP address, port, and hostname of Prometheus are also available on the Info tab of the Monitoring service.

  2. In the window that opens, enter the user name and the password that you have configured in the Prometheus users to login/logout to Prometheus field in the Monitoring service configuration.

In the Prometheus web interface, you can check its configuration and the state of the exporters (on the Targets page). You can also use the Prometheus Query Language (PromQL) to check specific metrics.

Using Prometheus Query Language
Using Prometheus Query Language
Using Prometheus Query Language
Using Prometheus Query Language

Grafana dashboards

Grafana allows you to visualize metrics stored in Prometheus, create your own dashboards, or modify existing ones.

Open Grafana

  1. In your browser, enter <IP address of monitoring server>:<port>. The default port is 11210, and it can be changed in the Grafana settings section in the Monitoring service configuration.

    The IP address, port, and hostname of Grafana are also available on the Info tab of the Monitoring service.

  2. In the window that opens, in the Email or username field, enter admin and in the Password field enter the password that you have configured in the Grafana administrator’s password field in the Monitoring service configuration.

By default, the following dashboards are available in Grafana:

Greengage - Cluster Overview

This dashboard provides a view of your ADB cluster health and performance including cluster status, segment health, and connection usage.

Greengage - Cluster Overview dashboard in Grafana
Greengage - Cluster Overview dashboard in Grafana
Panel name Description

Cluster Status

Indicates whether the cluster is running and reachable (UP) or not (DOWN)

Cluster Uptime

Total uptime of the cluster since the last restart

Total Databases

Total number of databases present in the cluster

Total Database Size

Total size of all databases on the cluster

Connection Usage (% of Max)

Current connections as a percentage of the connection limit (defined in the max_connections server configuration parameter). Shows how close the cluster is to connection exhaustion

Locked Sessions

Number of sessions currently locked, indicating potential contention issues. To further analyze locks, see the Query Performance dashboard

Total Segments

Total number of segments (primary and mirror) configured in the cluster

Segments UP

Number of segments currently up and running

Segments DOWN

Number of segments currently down

Segment Status by Host

Graph showing the status (1 — up or 0 — down) of each segment over time, identified by hostname, port, content ID, and role

Connections by State

Graph showing the number of connections grouped by state (active, idle, idle in transaction) over time

Query Activity

Distribution of queries by state:

  • Total Active Queries — active queries as determined by the state column of the pg_stat_activity view.

  • Slow Queries — queries that run for more than 180 seconds.

  • Waiting for Locks — active queries blocked while attempting to acquire a lock (reflected by entries in the pg_locks system view where granted is false).

Active Queries by Duration Bucket

Distribution of active queries by their duration buckets. The buckets categorize queries by their execution time: 0-10 seconds, 10-60 seconds, 60-180 seconds, 180-600 seconds, and over 600 seconds

Replication Lag (Replay)

Replication lag in bytes for each segment, representing the amount of WAL data to be replayed on mirrors

Max Replication Lag

Maximum replication lag across all segments, shown as a single value with thresholds for warning and critical levels

Greengage - Database Health

This dashboard monitors database-level health metrics: vacuum operations, table bloat, and data distribution skew.

Greengage - Database Health dashboard in Grafana
Greengage - Database Health dashboard in Grafana
Panel name Description

Vacuum Running

Indicates whether a vacuum process is currently running

Max Time Since Last Vacuum

Maximum time elapsed since any table in the database was last vacuumed. Greengage Exporter compares the timestamps of the last vacuum operation (manual or autovacuum) on every table and shows the largest interval found across all tables

Max Dead Tuple Ratio

Maximum ratio of dead tuples to live tuples across all tables, indicating potential vacuum need

Average Dead Tuple Ratio

Average dead tuple ratio across all tables, showing the overall table bloat level

Top Tables by Time Since Last Vacuum

The list of tables with the longest time since their last vacuum, sorted by time descending. The Value column shows the time elapsed since the last vacuum was performed on this table.

During each metric collection, only tables where the sum of dead and live tuples exceeds a configured threshold (COLLECTOR_TABLE_VACUUM_TUPLE_THRESHOLD) are included. This means that after a successful vacuum, a table may no longer satisfy this condition and will be absent from the metric query; in this case, new data from this table will not be reflected in this list

Vacuum & Autovacuum Count

Graph showing the cumulative count of manual and automatic vacuum operations over time per database.

During each metric collection, only tables where the sum of dead and live tuples exceeds a configured threshold (COLLECTOR_TABLE_VACUUM_TUPLE_THRESHOLD) are included. This means that after a successful vacuum, a table may no longer satisfy this condition and will be absent from the metric query; in this case, new data from this table will not be reflected in this graph

Top Tables by Dead Tuple Ratio

Graph tracking the top tables with the highest dead tuple ratio over time

Table Bloat State (0=none, 1=moderate, 2=severe)

Graph displaying the bloat state of tables over time

Top Tables by Skew Factor (>1.5 is significant)

Graph showing how the skew factor changes over time for the top tables exceeding the 1.5 threshold

Tables with Bloat

Shows tables currently experiencing bloat, with bloat state (1 — moderate, 2 — severe)

Tables with High Skew Factor

Shows the current top tables ordered by skew factor

Greengage - Exporter Monitoring

This dashboard monitors the health and performance of the Greengage exporter itself.

Greengage - Exporter Monitoring dashboard in Grafana
Greengage - Exporter Monitoring dashboard in Grafana
Panel name Description

Exporter Uptime

Total uptime of the exporter process

Total Scrapes

Total number of scrapes — that is, the number of times Prometheus successfully collected metrics from Greengage Exporter

Total Errors

Total number of errors encountered during scrapes

Max Scrape Duration

Maximum duration of a single scrape operation in seconds, with thresholds

Average Scrape Duration (5m)

Average scrape duration over the last 5 minutes

Scrape Duration Over Time

Graph showing max and average scrape duration over time

Scrape & Error Rate (per second)

Rate of scrape operations and errors per second over 5-minute windows

Collector Durations

Duration of each collector execution over time. In Greengage Exporter, collectors are components responsible for gathering specific types of metrics, such as the collector for table vacuum statistics or the one for memory usage

Circuit Breaker State

Current state of the internal circuit breakers:

  • Closed — normal operation of the database method calls.

  • Open — the failure threshold is reached, and the requests are blocked.

  • Half Open — the breaker checks if the failure is resolved; if so, it transitions to the Closed state.

Circuit Breaker Opened Count (5m increase)

Number of times circuit breakers opened in the last 5 minutes

Circuit Breaker Calls Rate (per second)

Rate of calls to circuit breakers per second

Timeout Calls Rate (per second)

Rate of timeout-protected method calls per second, labeled by whether they timed out

Timeout Execution Duration (Average)

Average execution duration of methods

Retry Calls Rate (per second)

Rate of retry-protected method calls per second, labeled by retry status and result

Total Retry Attempts (5m increase)

Total number of retry attempts across all methods over the last 5 minutes

Method Invocation Rate (per second)

Rate of all method invocations per second. Invocations are split by their results, for example valueReturned or exceptionThrown

Greengage - Host & Resource Group Resources

This dashboard provides a view of resource utilization in an ADB cluster on the host-level and resource group-level.

To view data for a specific host or resource group, use the Hostname and Resource Group filters at the top of the dashboard.

Greengage - Host & Resource Group Resources dashboard in Grafana
Greengage - Host & Resource Group Resources dashboard in Grafana
Panel name Description

AVG CPU Usage vs Rate Limit (% of Limit Used)

Average CPU usage as percentage of CPU rate limit for each resource group, showing how close each group is to its configured limit

AVG CPU Usage vs Limit Over Time (% of Limit)

Graph of average CPU usage relative to limit, expressed as percentage of limit over time

CPU Usage Skew Ratio

Ratio of maximum to average CPU usage across hosts; values greater than 1.3 indicate uneven CPU distribution

Absolute CPU Usage by Host and Resource Group

Graph of absolute CPU usage percentage per host and resource group, with limit annotation

Memory Usage vs Limit (% of Limit Used) - LIMITED GROUPS ONLY

Memory usage as percentage of memory limit for resource groups with finite limits (unlimited groups excluded)

Memory Usage Skew Ratio

Ratio of maximum to average memory usage across hosts; values greater than 1.3 indicate uneven memory distribution

Average Memory Usage

Average memory usage across all hosts in the cluster

Max Memory Usage

Maximum memory usage observed on any host in the cluster

Memory Usage by Host and Resource Group (with Limits)

Graph of absolute memory usage per host and resource group, showing configured limits

Running Sessions by Resource Group

Number of currently running (active) sessions per resource group; high values indicate high workload

Queueing Sessions by Resource Group (Resource Saturation Indicator)

Number of sessions queueing due to resource limits in a resource group

Running vs Queueing Sessions Over Time

Graph comparing running and queueing sessions per resource group over time

Average Disk Total

Average total disk space across all hosts

Average Disk Used

Average used disk space across all hosts

Disk Usage Skew Ratio

Ratio of maximum to average disk used across hosts; values >1.3 indicate uneven disk usage distribution

Max Disk Usage Percent

Maximum disk usage percentage observed on any host

Disk Usage Percent by Host

Graph of disk usage percentage per host

Disk Usage by Host (Total/Used/Available)

Graph showing total, used, and available disk space per host

Database Size by Name

Shows database sizes in MB for selected databases

Max Spill Usage

Maximum spill (temp file) usage observed on any host

Average Spill Usage

Average spill usage across all hosts

Spill Usage Skew Ratio

Ratio of maximum to average spill usage

Spill Usage by Host

Shows spill usage per host

Greengage - Query Performance

This dashboard provides monitoring of query performance: active queries, slow queries, lock contention, and connection statistics.

Greengage - Query Performance dashboard in Grafana
Greengage - Query Performance dashboard in Grafana
Panel name Description

Total Active Queries

Current number of active (running) queries across the cluster, with thresholds for normal, elevated, and high activity

Slow Queries (>180s)

Number of queries currently running that have exceeded 180 seconds (3 minutes) of execution time, indicating potential performance issues

Queries Waiting for Locks

Number of active queries blocked while attempting to acquire a lock. Reflected by entries in the pg_locks system view where granted is false

Total Locked Sessions

Total number of sessions currently waiting for locks

Query Activity Over Time

Distribution of queries by state for the selected time range:

  • Total Active Queries — active queries as determined by the state column of the pg_stat_activity view;

  • Slow Queries — queries that run for more than 180 seconds;

  • Waiting for Locks — active queries blocked while attempting to acquire a lock.

Active Queries by Duration Bucket (Stacked)

Distribution of active queries across predefined duration buckets. The buckets categorize queries by their execution time: 0-10 seconds, 10-60 seconds, 60-180 seconds, 180-600 seconds, and over 600 seconds

Query Duration Distribution (Donut)

Current distribution of active queries across duration buckets, with absolute counts and percentages

Total Connections (All States)

Total number of connections to the database in all states

Active Connections

Number of connections currently executing queries

Idle Connections

Number of connections that are idle (waiting for client activity)

Connections by State Over Time

Shows connections grouped by state (active, idle, idle in transaction, and idle in transaction (aborted)) over time

Queries Waiting for Locks Over Time

The number of queries blocked waiting for locks, with thresholds for warning and critical levels

Locked Sessions Over Time

Total locked sessions over time

Max Lock Wait Time by Type and Mode

Maximum wait time for locks grouped by lock type and mode. When locks are released, metrics disappear from this panel (no stale data shown)

Waiting Queries by Lock Type and Mode

Number of queries waiting for locks, grouped by lock type and mode. When locks are released, metrics disappear from this panel (no stale data shown)

Greengage - Replication & Segments

This dashboard provides monitoring of segment health and replication state.

Greengage - Replication & Segments dashboard in Grafana
Greengage - Replication & Segments dashboard in Grafana
Panel name Description

Total Segments

Total number of segments (primary and mirror) configured in the cluster

Segments UP

Number of segments currently up and running

Segments DOWN

Number of segments currently down

Sync Replicas Active

Indicates whether synchronous replication is active

Segment Status

Graph showing segment status over time, identified by hostname, port, content ID, and role

Segment Role (1=Primary, 2=Mirror)

Graph displaying segment role (primary or mirror) over time

Segment Mode (1=Sync, 2=Resync, 3=Change Track, 4=Not Sync)

Graph of segment replication mode over time

Max Replication Lag

Maximum replication lag in bytes across all segments, with thresholds

Average Replication Lag

Average replication lag in bytes across all segments, with thresholds

Minimum Sync State (2=sync, 1=async, 0.5=potential, 0=unknown)

Minimum sync state among replication connections, indicating the worst‑case sync status

Replication Lag Details (Replay, Write, Flush)

Graph of replication lag in bytes for each segment, broken down by replay lag, write lag, and flush lag

Replication State (1=streaming, 2=catchup, 3=backup, 0=unknown)

Graph of replication state for each mirror

Replication Sync State (2=sync, 1=async, 0.5=potential, 0=unknown)

Graph indicating the synchronization policy over time:

  • sync — the primary segment waits for acknowledgment from the standby/mirror server.

  • potential — the standby server is now asynchronous, but can potentially become synchronous if one of the current synchronous ones fails.

  • async — the standby server is asynchronous.

Segment Details Table

The table lists all segments with their current status, hostname, port, content ID, role, and preferred role

Node Exporter statistics

The Node Exporter statistics dashboard provides system metrics for each host in the cluster where Node Exporter is installed. You can select a host in the host filter at the top of the page.

The Node Exporter statistics dashboard in Grafana
The Node Exporter statistics dashboard in Grafana

Process exporter metrics

This dashboard provides monitoring of ADB Control and ADBM agents (adcc-agent and adbm-agent processes) using metrics exposed by Process Exporter.

The Process exporter metrics dashboard in Grafana
The Process exporter metrics dashboard in Grafana
Panel name Description

Agents uptime

Uptime of each agent instance

Agents memory usage

Memory usage of each agent over time

Agents CPU usage

CPU usage percentage of each agent over time

Found a mistake? Seleсt text and press Ctrl+Enter to report it