Obtain statistics via ANALYZE
Database statistics play a significant role in improving performance. The ANALYZE command collects statistics about the contents of tables in a database. The query planner uses these statistics to create efficient query plans.
The ANALYZE command
Overview
The ANALYZE command collects statistics about the contents of tables in the database and stores the results in the pg_statistic system catalog. Subsequently, the query planner uses these statistics to determine the most efficient execution plans for queries. ANALYZE
has the following syntax:
ANALYZE [ ( <option> [, ...] ) ] [ <table_and_columns> [, ...] ]
ANALYZE [ VERBOSE ] [ <table_and_columns> [, ...] ]
where option
can be one of:
VERBOSE [ <boolean> ]
SKIP_LOCKED [ <boolean> ]
and table_and_columns
is:
<table_name> [ ( <column_name> [, ...] ) ]
VERBOSE |
Enables display of progress messages. When |
SKIP_LOCKED |
Specifies that |
boolean |
Specifies whether the selected option should be turned on or off. You can write |
table_name |
A name (possibly schema-qualified) of a table to analyze. If omitted, all regular tables, partitioned tables, and materialized views in the current database are analyzed but not foreign tables. If the specified table is the partitioned table, both the statistics of the entire partitioned table and the statistics of the individual partitions are updated |
column_name |
A name of a specific column to analyze. By default, all columns are analyzed |
To analyze a table, a user must be the table’s owner or a superuser. Database owners are allowed to analyze all tables in their databases, except shared catalogs. The ANALYZE
skips over any tables that the user does not have permission to analyze.
Examples
To run ANALYZE
on a database, first connect to the database. For example, you can do it using psql:
\c example
Run the ANALYZE VERBOSE
command:
ANALYZE VERBOSE;
Result:
INFO: analyzing "public.pgbench_accounts" INFO: "pgbench_accounts": scanned 30000 of 81968 pages, containing 1830000 live rows and 0 dead rows; 30000 rows in sample, 5000048 estimated total rows INFO: analyzing "public.pgbench_branches" INFO: "pgbench_branches": scanned 1 of 1 pages, containing 50 live rows and 0 dead rows; 50 rows in sample, 50 estimated total rows INFO: analyzing "public.pgbench_history" INFO: "pgbench_history": scanned 0 of 0 pages, containing 0 live rows and 0 dead rows; 0 rows in sample, 0 estimated total rows INFO: analyzing "public.pgbench_tellers" INFO: "pgbench_tellers": scanned 3 of 3 pages, containing 500 live rows and 0 dead rows; 500 rows in sample, 500 estimated total rows INFO: analyzing "pg_catalog.pg_type" INFO: "pg_type": scanned 15 of 15 pages, containing 609 live rows and 0 dead rows; 609 rows in sample, 609 estimated total rows INFO: analyzing "pg_catalog.pg_foreign_table" INFO: "pg_foreign_table": scanned 0 of 0 pages, containing 0 live rows and 0 dead rows; 0 rows in sample, 0 estimated total rows INFO: analyzing "pg_catalog.pg_authid" INFO: "pg_authid": scanned 1 of 1 pages, containing 16 live rows and 3 dead rows; 16 rows in sample, 16 estimated total rows INFO: analyzing "pg_catalog.pg_statistic_ext_data" INFO: "pg_statistic_ext_data": scanned 0 of 0 pages, containing 0 live rows and 0 dead rows; 0 rows in sample, 0 estimated total rows ...
Since the VERBOSE
option is used, the output contains detailed information on analyzed relations.
To analyze a table, specify its name in the command:
ANALYZE pgbench_accounts;
To analyze a column, specify a table name and a column name in parentheses:
ANALYZE pgbench_accounts (bid, abalance);
Foreign tables are analyzed only when explicitly selected. Not all foreign data wrappers support ANALYZE
. If a table’s wrapper does not support ANALYZE
, the command prints a warning and does nothing.
The autovacuum daemon runs ANALYZE
for tables when they are first loaded with data, and if they are changed during regular operations. If autovacuum is disabled (not recommended), you should run ANALYZE
periodically and after major changes in the contents of tables. Accurate statistics help the planner to choose the most appropriate query plan, and improve the speed of query processing. A common strategy for databases that are mainly intended for reading is to run VACUUM
and ANALYZE
once a day during a low-usage time of day. This is not sufficient if a database is being actively updated.
ANALYZE
requires only a read lock on the target table, so it can run in parallel with other operations on the table.
If a table is empty, ANALYZE
does not write new statistics for that table. Any existing statistics are retained.
Each backend running ANALYZE
reports its progress to the pg_stat_progress_analyze
view. For more information, see ANALYZE progress reporting.
Configuration parameters affecting ANALYZE
The statistics collected by ANALYZE
usually include a list of the most common values and a histogram showing the approximate data distribution in each column. One or both of them can be omitted if they are not useful or if the column data type does not support the appropriate operators. For example, there are no common values in a column with unique keys.
For large tables, ANALYZE
takes a random sample of the table’s contents, rather than examining each row. This allows even very large tables to be analyzed in a short period of time. Note, that the statistics are only approximate, and will change slightly each time ANALYZE
is run, even if the actual contents of the table have not changed. It results in small changes in the planner’s estimated costs shown by EXPLAIN
. In rare situations, this non-determinism will cause the planner’s choice of a query plan to change after ANALYZE
is run. To avoid this, increase the amount of statistics collected by the ANALYZE
command as described below.
The amount of statistics can be controlled by setting the default_statistics_target configuration parameter. Specify this parameter in the ADPG configurations section on the Clusters → ADPG cluster → Services → ADPG → Primary configuration tab in ADCM UI (see Configuration parameters). You can also set the per-column statistics target with the following command ALTER TABLE … ALTER COLUMN … SET STATISTICS
. See ALTER TABLE for details.
ALTER TABLE table1 ALTER COLUMN column1 SET STATISTICS 0;
The target value sets the maximum number of entries in the most common value list and the maximum number of bins in the histogram. The default target value is 100
, but it can be increased or decreased to find a balance between the accuracy of the planner’s estimates and the time required to run ANALYZE
, as well as the amount of statistics in pg_statistic
. Setting the statistics target to 0
disables collection of statistics for that column. It can be useful for columns that are never used as part of the WHERE
, GROUP BY
, or ORDER BY
clauses of queries, since the planner does not use the statistics on such columns.
The largest statistics target among the columns determines the number of table rows selected for preparing statistics. Increasing the target causes a proportional increase in the time and space required for ANALYZE
.
The second value estimated by ANALYZE
is the number of distinct values that appear in each column. Because only a subset of the rows is examined, this estimate can be quite inaccurate, even with the largest possible statistics target. If this leads to bad query plans, a more accurate value can be determined manually and installed with the command: ALTER TABLE … ALTER COLUMN … SET (n_distinct = <value>)
.
ALTER TABLE table2 ALTER COLUMN column2 SET (n_distinct = 500);
ANALYZE on partitioned and child tables
If a table has inheritance children, ANALYZE
collects two sets of statistics: one on the rows of the parent table only, and the second including rows of the parent table and all of its children. The planner uses the second set of statistics when processing queries that operate with the entire inheritance tree. In this case, the child tables are not individually analyzed. The autovacuum daemon only considers inserts or updates on the parent table itself when deciding whether to trigger an automatic ANALYZE
for that table. If the table is rarely updated, the statistics will be outdated unless you run ANALYZE
manually.
For partitioned tables, ANALYZE
gathers statistics by sampling rows from all partitions. In addition, it will recurse into each partition and update its statistics. Each partition is analyzed only once, even with multi-level partitioning. ANALYZE
does not collect statistics for a partitioned table (without data from its partitions), because it is empty.
The autovacuum daemon does not process partitioned tables, and parent tables if only children or partitions were modified. It is usually necessary to periodically run ANALYZE
manually to keep the statistics of the table hierarchy up to date.
If any child tables or partitions are foreign tables whose foreign data wrappers do not support ANALYZE
, these tables are ignored while gathering statistics.
Statistics used by the planner
As mentioned above, the query planner needs to estimate the number of rows retrieved by a query to make efficient query plans. This section provides a quick look at the statistics that the system uses for these estimates.
Single-column statistics
One component of the statistics is the total number of entries and the number of disk blocks occupied by each table and index. The pg_class table contains this information in the reltuples
and relpages
columns.
SELECT relname, relkind, reltuples, relpages
FROM pg_class
WHERE relname LIKE 'pgbench_accounts%';
Result:
relname | relkind | reltuples | relpages -----------------------+---------+--------------+---------- pgbench_accounts | r | 5.000048e+06 | 81968 pgbench_accounts_pkey | i | 5.000048e+06 | 13713
The pgbench_accounts
table and its index contain 5.000048e+06
rows, but the index is smaller than the table.
For efficiency reasons, reltuples
and relpages
are updated by VACUUM
, ANALYZE
, and a few DDL commands such as CREATE INDEX
. The VACUUM
or ANALYZE
operation may not scan the entire table but incrementally updates reltuples
on the basis of the part of the table it did scan, resulting in an approximate value. The planner scales the values from pg_class
to match the current physical table size and obtain a closer approximation.
Most queries retrieve only a fraction of the rows in a table limited by the WHERE
clauses. The planner needs to estimate the selectivity of WHERE
clauses, that is, the fraction of rows that match each condition in the WHERE
clause. To do this, the planner uses information stored in the pg_statistic system catalog. The ANALYZE
and VACUUM ANALYZE
commands update entries in pg_statistic
, but they are always approximate even immediately after an update.
Rather than look at pg_statistic
directly, it is better to look at the pg_stats view when examining the statistics. The pg_stats
view is readable by all, whereas pg_statistic
is only available for a superuser. This prevents unprivileged users from obtaining information about the contents of other user’s tables from the statistics. The pg_stats
view is restricted to show only rows of information about those tables that the current user can read.
You can execute the following query to pg_stats
where terminals
is a table name:
SELECT attname, n_distinct,
array_to_string(most_common_vals, E'\n') as most_common_vals
FROM pg_stats
WHERE tablename = 'terminals';
Result:
attname | n_distinct | most_common_vals ---------------+-------------+-------------------- terminal_code | -1 | timezone | -0.16346154 | Europe/Moscow + | | Asia/Yekaterinburg+ | | Asia/Krasnoyarsk + | | Asia/Irkutsk + | | Asia/Yakutsk + | | Europe/Samara + | | Asia/Vladivostok + | | Asia/Novokuznetsk + | | Europe/Volgograd terminal_name | -1 | city | -0.97115386 | Moscow + | | Ulyanovsk
where:
-
attname
— name of column described by this row. -
n_distinct
— if greater than zero, the value is the estimated number of distinct values in the column. If this number is less than zero, its modulus represents the number of distinct values divided by the number of rows. The negated form is used whenANALYZE
expects that the number of distinct values is likely to increase as the table grows; the positive form is used when the column seems to have a fixed number of possible values.-1
indicates a unique column in which the number of distinct values is the same as the number of rows. -
most_common_vals
— a list of the most common values in the column.Null
(empty in the result above) if no values seem to be more common than any others.
Extended statistics
It is common to see slow queries running bad execution plans because multiple columns used in the query clauses are correlated. The planner usually assumes that multiple conditions are independent of each other. Regular statistics cannot take into account cross-column correlation. However, ADPG/PostgreSQL can compute multivariate statistics that collect such information.
Since the number of possible column combinations is very large, it is impractical to compute multivariate statistics automatically. Instead, ADPG/PostgreSQL allows you to create extended statistics objects to obtain statistics on selected sets of columns.
Statistics objects are created using the CREATE STATISTICS command. The creation of such an object adds an entry expressing the need for these statistics. ANALYZE
collects the required statistical data, which can be viewed in the pg_statistic_ext_data catalog.
The simplest type of extended statistics is tracking functional dependencies. This concept is used in the definitions of database normal forms. The b
column is functionally dependent on the a
column if knowing the value of a
is sufficient to determine the value of b
, that is, there are no two rows with the same value of a
but different values of b
.
The existence of functional dependencies directly affects the accuracy of estimates in certain queries. To inform the planner about functional dependencies, ANALYZE
can gather measurements of cross-column dependency. Data is collected only for those groups of columns that are referenced in the extended statistics object defined with the dependencies
option. It is recommended to create dependencies statistics only for column groups that are strongly correlated to avoid unnecessary overhead in ANALYZE
and later query planning.
The following code creates the stat1
extended statistics object to collect functional-dependency statistics on the terminal_name
and city
columns of the terminals
table:
CREATE STATISTICS stat1 (dependencies) ON terminal_name, city FROM terminals;
ANALYZE terminals;
Check the result:
SELECT stxname, stxkeys, stxddependencies
FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid)
WHERE stxname = 'stat1';
Result:
stxname | stxkeys | stxddependencies ---------+---------+------------------------------------------ stat1 | 2 3 | {"2 => 3": 1.000000, "3 => 2": 0.86751}
The column 2
(terminal_name
) fully determines the column 3
(city
) — the coefficient is 1.0
, while city
only determines terminal_name
about 87% of all cases, meaning that there are cities that have more than one terminal.
For more examples of multivariate statistics, refer to How the planner uses statistics.