Obtain statistics via ANALYZE

Eugenia Kuzina

Contents

The ANALYZE command
Statistics used by the planner
- Single-column statistics
- Extended statistics

Database statistics play a significant role in improving performance. The ANALYZE command collects statistics about the contents of tables in a database. The query planner uses these statistics to create efficient query plans.

The ANALYZE command

Overview

The ANALYZE command collects statistics about the contents of tables in the database and stores the results in the pg_statistic system catalog. Subsequently, the query planner uses these statistics to determine the most efficient execution plans for queries. ANALYZE has the following syntax:

ANALYZE [ ( <option> [, ...] ) ] [ <table_and_columns> [, ...] ]
ANALYZE [ VERBOSE ] [ <table_and_columns> [, ...] ]

where <option> can be one of:

    VERBOSE [ <boolean> ]
    SKIP_LOCKED [ <boolean> ]

and <table_and_columns> is:

    <table_name> [ ( <column_name> [, ...] ) ]

Parameters of the ANALYZE command

VERBOSE

Enables display of progress messages. When VERBOSE is specified, ANALYZE emits progress messages to indicate which table is currently being processed and prints various statistics

SKIP_LOCKED

Specifies that ANALYZE should not wait for any conflicting locks to be released when processing a relation. If a relation cannot be locked without waiting, the relation is skipped. Note that even with this option, ANALYZE can wait when opening relation indexes or when acquiring sample rows from partitions, child tables in the inheritance hierarchy, and some types of foreign tables. This option also forces ANALYZE to skip all partitions if there is a conflicting lock on a partitioned table

<boolean>

Specifies whether the selected option should be turned on or off. You can write TRUE, ON, or 1 to enable the option, and FALSE, OFF, or 0 to disable it. If the boolean value is omitted, TRUE is assumed

<table_name>

A name (possibly schema-qualified) of a table to analyze. If omitted, all regular tables, partitioned tables, and materialized views in the current database are analyzed but not foreign tables. If the specified table is the partitioned table, both the statistics of the entire partitioned table and the statistics of the individual partitions are updated

<column_name>

A name of a specific column to analyze. By default, all columns are analyzed

To analyze a table, a user must be the table’s owner or a superuser. Database owners are allowed to analyze all tables in their databases, except shared catalogs. The ANALYZE skips over any tables that the user does not have permission to analyze.

Examples

ANALYZE VERBOSE on a database

To run ANALYZE on a database, first connect to the database. For example, you can do it using psql:

\c example

Run the ANALYZE VERBOSE command:

ANALYZE VERBOSE;

Result:

INFO:  analyzing "public.pgbench_accounts"
INFO:  "pgbench_accounts": scanned 30000 of 81968 pages, containing 1830000 live rows and 0 dead rows; 30000 rows in sample, 5000048 estimated total rows
INFO:  analyzing "public.pgbench_branches"
INFO:  "pgbench_branches": scanned 1 of 1 pages, containing 50 live rows and 0 dead rows; 50 rows in sample, 50 estimated total rows
INFO:  analyzing "public.pgbench_history"
INFO:  "pgbench_history": scanned 0 of 0 pages, containing 0 live rows and 0 dead rows; 0 rows in sample, 0 estimated total rows
INFO:  analyzing "public.pgbench_tellers"
INFO:  "pgbench_tellers": scanned 3 of 3 pages, containing 500 live rows and 0 dead rows; 500 rows in sample, 500 estimated total rows
INFO:  analyzing "pg_catalog.pg_type"
INFO:  "pg_type": scanned 15 of 15 pages, containing 609 live rows and 0 dead rows; 609 rows in sample, 609 estimated total rows
INFO:  analyzing "pg_catalog.pg_foreign_table"
INFO:  "pg_foreign_table": scanned 0 of 0 pages, containing 0 live rows and 0 dead rows; 0 rows in sample, 0 estimated total rows
INFO:  analyzing "pg_catalog.pg_authid"
INFO:  "pg_authid": scanned 1 of 1 pages, containing 16 live rows and 3 dead rows; 16 rows in sample, 16 estimated total rows
INFO:  analyzing "pg_catalog.pg_statistic_ext_data"
INFO:  "pg_statistic_ext_data": scanned 0 of 0 pages, containing 0 live rows and 0 dead rows; 0 rows in sample, 0 estimated total rows
...

Since the VERBOSE option is used, the output contains detailed information on analyzed relations.

ANALYZE on a table

To analyze a table, specify its name in the command:

ANALYZE pgbench_accounts;

ANALYZE for specific columns

To analyze a column, specify a table name and a column name in parentheses:

ANALYZE pgbench_accounts (bid, abalance);

Foreign tables are analyzed only when explicitly selected. Not all foreign data wrappers support ANALYZE. If a table’s wrapper does not support ANALYZE, the command prints a warning and does nothing.

The autovacuum daemon runs ANALYZE for tables when they are first loaded with data, and if they are changed during regular operations. If autovacuum is disabled (not recommended), you should run ANALYZE periodically and after major changes in the contents of tables. Accurate statistics help the planner to choose the most appropriate query plan, and improve the speed of query processing. A common strategy for databases that are mainly intended for reading is to run VACUUM and ANALYZE once a day during a low-usage time of day. This is not sufficient if a database is being actively updated.

ANALYZE requires only a read lock on the target table, so it can run in parallel with other operations on the table.

If a table is empty, ANALYZE does not write new statistics for that table. Any existing statistics are retained.

Each backend running ANALYZE reports its progress to the pg_stat_progress_analyze view. For more information, see ANALYZE progress reporting.

Configuration parameters affecting ANALYZE

The statistics collected by ANALYZE usually include a list of the most common values and a histogram showing the approximate data distribution in each column. One or both of them can be omitted if they are not useful or if the column data type does not support the appropriate operators. For example, there are no common values in a column with unique keys.

For large tables, ANALYZE takes a random sample of the table’s contents, rather than examining each row. This allows even very large tables to be analyzed in a short period of time. Note, that the statistics are only approximate, and will change slightly each time ANALYZE is run, even if the actual contents of the table have not changed. It results in small changes in the planner’s estimated costs shown by EXPLAIN. In rare situations, this non-determinism will cause the planner’s choice of a query plan to change after ANALYZE is run. To avoid this, increase the amount of statistics collected by the ANALYZE command as described below.

The amount of statistics can be controlled by setting the default_statistics_target configuration parameter. Specify this parameter in the ADPG configurations section on the Clusters → ADPG cluster → Services → ADPG → Primary configuration tab in ADCM UI (see Configuration parameters). You can also set the per-column statistics target with the following command ALTER TABLE … ALTER COLUMN … SET STATISTICS. See ALTER TABLE for details.

ALTER TABLE table1 ALTER COLUMN column1 SET STATISTICS 0;

The target value sets the maximum number of entries in the most common value list and the maximum number of bins in the histogram. The default target value is 100, but it can be increased or decreased to find a balance between the accuracy of the planner’s estimates and the time required to run ANALYZE, as well as the amount of statistics in pg_statistic. Setting the statistics target to 0 disables collection of statistics for that column. It can be useful for columns that are never used as part of the WHERE, GROUP BY, or ORDER BY clauses of queries, since the planner does not use the statistics on such columns.

The largest statistics target among the columns determines the number of table rows selected for preparing statistics. Increasing the target causes a proportional increase in the time and space required for ANALYZE.

The second value estimated by ANALYZE is the number of distinct values that appear in each column. Because only a subset of the rows is examined, this estimate can be quite inaccurate, even with the largest possible statistics target. If this leads to bad query plans, a more accurate value can be determined manually and installed with the command: ALTER TABLE … ALTER COLUMN … SET (n_distinct = <value>).

ALTER TABLE table2 ALTER COLUMN column2 SET (n_distinct = 500);

ANALYZE on partitioned and child tables

If a table has inheritance children, ANALYZE collects two sets of statistics: one on the rows of the parent table only, and the second including rows of the parent table and all of its children. The planner uses the second set of statistics when processing queries that operate with the entire inheritance tree. In this case, the child tables are not individually analyzed. The autovacuum daemon only considers inserts or updates on the parent table itself when deciding whether to trigger an automatic ANALYZE for that table. If the table is rarely updated, the statistics will be outdated unless you run ANALYZE manually.

For partitioned tables, ANALYZE gathers statistics by sampling rows from all partitions. In addition, it will recurse into each partition and update its statistics. Each partition is analyzed only once, even with multi-level partitioning. ANALYZE does not collect statistics for a partitioned table (without data from its partitions), because it is empty.

The autovacuum daemon does not process partitioned tables, and parent tables if only children or partitions were modified. It is usually necessary to periodically run ANALYZE manually to keep the statistics of the table hierarchy up to date.

If any child tables or partitions are foreign tables whose foreign data wrappers do not support ANALYZE, these tables are ignored while gathering statistics.

Statistics used by the planner

As mentioned above, the query planner needs to estimate the number of rows retrieved by a query to make efficient query plans. This section provides a quick look at the statistics that the system uses for these estimates.

Single-column statistics

One component of the statistics is the total number of entries and the number of disk blocks occupied by each table and index. The pg_class table contains this information in the reltuples and relpages columns.

SELECT relname, relkind, reltuples, relpages
FROM pg_class
WHERE relname LIKE 'pgbench_accounts%';

Result:

        relname        | relkind |  reltuples   | relpages
-----------------------+---------+--------------+----------
 pgbench_accounts      | r       | 5.000048e+06 |    81968
 pgbench_accounts_pkey | i       | 5.000048e+06 |    13713

The pgbench_accounts table and its index contain 5.000048e+06 rows, but the index is smaller than the table.

For efficiency reasons, reltuples and relpages are updated by VACUUM, ANALYZE, and a few DDL commands such as CREATE INDEX. The VACUUM or ANALYZE operation may not scan the entire table but incrementally updates reltuples on the basis of the part of the table it did scan, resulting in an approximate value. The planner scales the values from pg_class to match the current physical table size and obtain a closer approximation.

Most queries retrieve only a fraction of the rows in a table limited by the WHERE clauses. The planner needs to estimate the selectivity of WHERE clauses, that is, the fraction of rows that match each condition in the WHERE clause. To do this, the planner uses information stored in the pg_statistic system catalog. The ANALYZE and VACUUM ANALYZE commands update entries in pg_statistic, but they are always approximate even immediately after an update.

Rather than look at pg_statistic directly, it is better to look at the pg_stats view when examining the statistics. The pg_stats view is readable by all, whereas pg_statistic is only available for a superuser. This prevents unprivileged users from obtaining information about the contents of other user’s tables from the statistics. The pg_stats view is restricted to show only rows of information about those tables that the current user can read.

You can execute the following query to pg_stats where terminals is a table name:

SELECT attname, n_distinct,
array_to_string(most_common_vals, E'\n') as most_common_vals
FROM pg_stats
WHERE tablename = 'terminals';

Result:

   attname     | n_distinct  |  most_common_vals
---------------+-------------+--------------------
 terminal_code |          -1 |
 timezone      | -0.16346154 | Europe/Moscow     +
               |             | Asia/Yekaterinburg+
               |             | Asia/Krasnoyarsk  +
               |             | Asia/Irkutsk      +
               |             | Asia/Yakutsk      +
               |             | Europe/Samara     +
               |             | Asia/Vladivostok  +
               |             | Asia/Novokuznetsk +
               |             | Europe/Volgograd
 terminal_name |          -1 |
 city          | -0.97115386 | Moscow            +
               |             | Ulyanovsk

where:

attname — name of column described by this row.
n_distinct — if greater than zero, the value is the estimated number of distinct values in the column. If this number is less than zero, its modulus represents the number of distinct values divided by the number of rows. The negated form is used when ANALYZE expects that the number of distinct values is likely to increase as the table grows; the positive form is used when the column seems to have a fixed number of possible values. -1 indicates a unique column in which the number of distinct values is the same as the number of rows.
most_common_vals — a list of the most common values in the column. Null (empty in the result above) if no values seem to be more common than any others.

Extended statistics

It is common to see slow queries running bad execution plans because multiple columns used in the query clauses are correlated. The planner usually assumes that multiple conditions are independent of each other. Regular statistics cannot take into account cross-column correlation. However, ADPG/PostgreSQL can compute multivariate statistics that collect such information.

Since the number of possible column combinations is very large, it is impractical to compute multivariate statistics automatically. Instead, ADPG/PostgreSQL allows you to create extended statistics objects to obtain statistics on selected sets of columns.

Statistics objects are created using the CREATE STATISTICS command. The creation of such an object adds an entry expressing the need for these statistics. ANALYZE collects the required statistical data, which can be viewed in the pg_statistic_ext_data catalog.

The simplest type of extended statistics is tracking functional dependencies. This concept is used in the definitions of database normal forms. The b column is functionally dependent on the a column if knowing the value of a is sufficient to determine the value of b, that is, there are no two rows with the same value of a but different values of b.

The existence of functional dependencies directly affects the accuracy of estimates in certain queries. To inform the planner about functional dependencies, ANALYZE can gather measurements of cross-column dependency. Data is collected only for those groups of columns that are referenced in the extended statistics object defined with the dependencies option. It is recommended to create dependencies statistics only for column groups that are strongly correlated to avoid unnecessary overhead in ANALYZE and later query planning.

The following code creates the stat1 extended statistics object to collect functional-dependency statistics on the terminal_name and city columns of the terminals table:

CREATE STATISTICS stat1 (dependencies) ON terminal_name, city FROM terminals;

ANALYZE terminals;

Check the result:

SELECT stxname, stxkeys, stxddependencies
FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid)
WHERE stxname = 'stat1';

Result:

 stxname | stxkeys |             stxddependencies
---------+---------+------------------------------------------
 stat1   | 2 3     | {"2 => 3": 1.000000, "3 => 2": 0.86751}

The column 2 (terminal_name) fully determines the column 3 (city) — the coefficient is 1.0, while city only determines terminal_name about 87% of all cases, meaning that there are cities that have more than one terminal.

For more examples of multivariate statistics, refer to How the planner uses statistics.

Found a mistake? Seleсt text and press Ctrl+Enter to report it