Конференция Arenadata
Новое время — новый Greenplum
Мы приглашаем вас принять участие в конференции, посвященной будущему Open-Source Greenplum 19 сентября в 18:00:00 UTC +3. Встреча будет проходить в гибридном формате — и офлайн, и онлайн. Онлайн-трансляция будет доступна для всех желающих.
Внезапное закрытие Greenplum его владельцем — компанией Broadcom - стало неприятным сюрпризом для всех, кто использует или планирует начать использовать решения на базе этой технологии. Многие ожидают выхода стабильной версии Greenplum 7 и надеются на её дальнейшее активное развитие.
Arenadata не могла допустить, чтобы разрабатываемый годами Open-Source проект Greenplum прекратил своё существование, поэтому 19 сентября мы представим наш ответ на данное решение Broadcom, а участники сообщества получат исчерпывающие разъяснения на все вопросы о дальнейшей судьбе этой технологии.

На конференции вас ждёт обсуждение следующих тем:

  • План возрождения Greenplum;
  • Дорожная карта;
  • Экспертное обсуждение и консультации.
Осталось до события

Indexes

Overview

Indexes are a common way to increase database performance. Without indexes, the database must perform a full table scan (sequential scan) to find the data. This can be a slow and resource-intensive process, especially for large tables. Indexes can significantly speed up searching because they provide a data structure that points to the location of the information you need in a table. An index allows the database server to find and retrieve specific rows much faster than it could be without an index.

Indexes are not always a positive factor for performance. When data is inserted, updated, or deleted, the indexes must also be updated, which can result in additional time and resource overhead. It is important to carefully analyze and select indexes that improve query performance and do not create unnecessary load of the system.

Index types

PostgreSQL provides multiple index types: B-tree, Hash, GiST, SP-GiST, GIN, BRIN, and the bloom extension. Each index type uses a different algorithm that is best suited for a certain type of clauses.

B-Tree

B-tree indexes can handle equality and range queries on data that can be sorted in some order. The query planner considers using a B-tree index whenever an indexed column is involved in a comparison using one of these operators: <, <=, =, >=, and >.

Constructs equivalent to combinations of these operators, such as BETWEEN and IN, can be implemented with a B-tree index search. A B-tree index can also handle the IS NULL or IS NOT NULL condition on an index column.

In addition, the optimizer can also use a B-tree index for queries with the pattern matching operators LIKE and ~ if the pattern is a constant and is anchored to the beginning of the string. For example, column1 LIKE 'abc%' or column1 ~ '^abc', but not column1 LIKE '%abc'. If your database does not use the C locale, you need to create an index with a special operator class to support indexing of pattern-matching queries, see Operator Classes and Operator Families.

It is also possible to use B-tree indexes for ILIKE and ~*, but only if the pattern starts with non-alphabetic characters (characters that are not affected by upper/lower case conversion).

B-tree indexes can also be used to retrieve data in sorted order. This is not always faster than a simple scanning and sorting, but can sometimes be useful.

By default, PostgreSQL creates a B-tree index. The following command adds a B-tree index to the book_id column in the books table:

CREATE INDEX ind_book_id ON books(book_id);

Hash

Hash indexes store a 32-bit hash code derived from the value of the indexed column. Hence, such indexes can only handle simple equality comparisons. The query planner considers using a hash index when an indexed column is involved in a comparison using the equality operator =.

The following command creates a Hash index on the id column in the book table:

CREATE INDEX idx_id ON book USING HASH (id);

GiST

GiST (Generalized Search Tree) indexes are designed to work with complex data types such as geometric objects, text, and arrays. GiST indexes allow you to quickly search spatial, text, and hierarchical data. GiST indexes are not a single kind of index. They are an infrastructure within which multiple different indexing strategies can be implemented. As a consequence, GiST indexes can be used with different operators, depending on the indexing strategy (operator class). The ADPG distribution includes GiST operator classes for several two-dimensional geometric data types, which support indexed queries with the following operators: <<, &<, &>, >>, <<|, &<|, |&>, |>>, @>, <@, ~=, and &&. These operators are described in the Geometric Functions and Operators article.

The GiST operator classes included in the standard distribution are described in the Built-in Operator Classes article. Many other GiST operator classes can be found in the contrib collection and in other separate projects. The contrib extension is preinstalled in the ADPG cluster. For more information on GiST, see GiST Indexes.

GiST indexes are also capable of optimizing the "nearest-neighbor" searches, for example:

SELECT * FROM places ORDER BY location <-> point '(101,456)' LIMIT 10;

The code above finds ten places closest to a given target point. Whether an index can be used in this way depends on the class of operator. In the Built-in GiST Operator Classes table, operators that can be used in this way are listed in the column Ordering Operators.

You can also create a GiST index for searching text data:

CREATE INDEX ind_gist ON my_table USING gist (to_tsvector('english', ind_column));

SP-GiST

SP-GiST indexes (space-partitioned GiST), like GiST indexes, offer an infrastructure that supports various kinds of searches. SP-GiST allows you to implement a wide range of different unbalanced disk-based data structures, such as quadtrees, k-d trees, and radix trees (radix tries). For example, PostgreSQL includes SP-GiST operator classes for two-dimensional points, which support indexed queries with the following operators: <<, >>, ~=, <@, <<|, and |>>. These operators are described in the Geometric Functions and Operators article. The SP-GiST operator classes included in the standard distribution are listed in the Built-in SP-GiST Operator Classes table. For more information, see SP-GiST Indexes.

SP-GiST also supports the "nearest-neighbor" searches. For SP-GiST operator classes that support distance ordering, the corresponding operator is listed in the Ordering Operators column in the Built-in SP-GiST Operator Classes table.

The following example creates an SP-GiST index for a field storing IP addresses:

CREATE INDEX ind_spgist ON my_table USING spgist (inet(ind_column));

GIN

GIN (Generalized Inverted Index) indexes are inverted indexes that are appropriate for data values with multiple component values. They are used for full-text search and search in arrays, JSON, and trigrams. GIN indexes provide high performance when searching large volumes of data. An inverted index contains a separate entry for each component value, and can work effectively in queries that check for the presence of certain component values.

GIN supports multiple different user-defined indexing strategies, and the particular operators with which a GIN index can be used vary depending on the indexing strategy. For example, PostgreSQL includes a GIN operator class for arrays, which supports indexed queries using the operators: <@, @>, =, and &&. These operators are described in the Array Functions and Operators article.

The GIN operator classes included in the standard distribution are listed in the Built-in GIN Operator Classes table. Many other GIN operator classes can be found in the contrib collection and in other separate projects. The contrib extension is preinstalled in the ADPG cluster. For more information on GIN, see GIN Indexes.

The following example creates a GIN index for full-text search:

CREATE INDEX ind_gin ON my_table USING gin (to_tsvector('english', ind_column));

BRIN

BRIN (Block Range Indexes) indexes store summaries about the values saved in consecutive physical block ranges of a table. They are most effective for columns whose values are well-correlated with the physical order of the table rows. They are effective for storing and processing time series and geographic data. BRIN supports multiple different indexing strategies, and the particular operators with which a BRIN index can be used vary depending on the indexing strategy. For data types that have a linear sort order, the indexed data corresponds to the minimum and maximum values in the column for each block range. This data types support indexed queries with the operators: <, <=, =, >=, and >.

The BRIN operator classes included in the standard distribution are listed in the Built-in BRIN Operator Classes table. For more information, see BRIN Indexes.

The following example creates a BRIN index:

CREATE INDEX ind_brin ON my_table USING brin (ind_column);

Manage indexes

You can create, alter, and drop indexes using corresponding commands.

The CREATE INDEX command constructs an index on the specified columns of a relation that can be a table or a materialized view. By default, CREATE INDEX creates B-tree indexes, which are efficient in most cases. The following example creates a B-tree index on the title column in the books table:

CREATE INDEX table_idx ON books (title);

To change the index type, add the USING keyword with the index type name. For example, to create a Hash index, use the following command:

CREATE INDEX name ON table USING HASH (column);

Creating an index can interfere with regular operation of a database. Normally, PostgreSQL locks the table to be indexed against writes and performs the entire index build with a single table scan. Other transactions can still read the table, but if they try to insert, update, or delete rows, they are blocked until the index build is finished. This behavior can have a severe effect if the system is a live production database. Large tables can take many hours to be indexed, and even for smaller tables, an index build can lock out writers for periods that are unacceptably long for a production system.

PostgreSQL supports building indexes without locking out writes. This method is invoked by specifying the CONCURRENTLY option of CREATE INDEX. When this option is used, PostgreSQL performs two scans of the table, and in addition it must wait for all existing transactions that could potentially modify or use the index. This method requires more work than a standard index build and takes significantly longer to complete. However, since it allows normal operations to continue while the index is built, this method is useful for adding new indexes in a production environment.

You can use the ALTER INDEX command to change the definition of an existing index. ALTER INDEX does not allow changing the index method. Note that the required lock level may differ for different ALTER INDEX form and used parameters.

The following example renames an existing index:

ALTER INDEX distributors_idx RENAME TO suppliers_idx;

The example below moves an index to a different tablespace:

ALTER INDEX suppliers_idx SET TABLESPACE my_tablespace;

To delete an index, utilize DROP INDEX.

This command removes the suppliers_idx index:

DROP INDEX suppliers_idx;

Rebuild indexes

PostgreSQL provides the REINDEX command that allows you to rebuild indexes. REINDEX rebuilds an index using the data stored in the index’s table and replaces the old copy of the index. REINDEX can be useful in the following scenarios:

  • An index is corrupted, and no longer contains valid data. Although in theory this should never happen, in practice indexes can become corrupted due to software bugs or hardware failures. REINDEX provides a recovery method.

  • An index contains many empty or nearly-empty pages. This can occur with B-tree indexes in PostgreSQL under certain access patterns. REINDEX writes a new index version without empty pages to reduce the space consumption of the index. For more information, see Routine Reindexing.

  • You changed a storage parameter (such as fillfactor) for an index, and wish to ensure that the change has full effect.

  • If an index build fails with the CONCURRENTLY option, this index becomes invalid. Use REINDEX to rebuild it. Note that only the REINDEX INDEX command can perform a concurrent build on an invalid index.

The REINDEX command can rebuild the specified index, all indexes of the specified table, all indexes of the specified schema, all indexes within the current database, and all indexes on system catalogs within the current database. See REINDEX for details.

The following example rebuilds a single index:

REINDEX INDEX my_index;

The command below rebuilds all the indexes on the my_table table:

REINDEX TABLE my_table;

Multicolumn indexes

An index can be defined on more than one column of a table. For example, you have the following table:

CREATE TABLE authors (
    id SERIAL PRIMARY KEY,
    first_name VARCHAR(50) NOT NULL,
    last_name VARCHAR(50) NOT NULL
);

And you frequently issue the queries:

SELECT * FROM authors WHERE first_name = 'value1' AND last_name = 'value2';

You can define an index on the first_name and last_name columns together:

CREATE INDEX idx_authors_names ON authors (last_name, first_name);

Currently, only the B-tree, GiST, GIN, and BRIN index types support multiple-key-column indexes. The ability to build an index on multiple key columns does not depend on the ability to add non-key columns to the index using INCLUDE. The number of columns in the index is limited to 32 with INCLUDE columns.

Multicolumn indexes should be used carefully. In most cases, an index on a single column works efficiently and saves time and space. Indexes with more than three columns are unlikely to be useful unless the table is used in a very uniform way. The merits of different index configurations are described in the Combining Multiple Indexes and Index-Only Scans and Covering Indexes articles.

Unique indexes

Indexes can also ensure that a value in a column is unique or a combination of values in multiple columns is unique. To create a unique index, utilize the UNIQUE keyword:

CREATE UNIQUE INDEX <name> ON <table> (<column> [, ...]);

For example:

CREATE UNIQUE INDEX ind_unique ON table1 (column1, column2);

Currently, only B-tree indexes can be declared as unique.

If an index is created as unique, multiple rows with the same index key values cannot be added to the table. The NULL values are not considered equal. A multicolumn unique index only rejects rows in which all indexed columns contain the same values.

PostgreSQL defines a unique index when a unique constraint or primary key is specified for a table. The index is created for the primary key columns or a unique constraint (it can be a multicolumn index). This mechanism enforces the constraint. This way, you do not need to manually create indexes for unique columns — they will duplicate the automatically created indexes.

Indexes on expressions

An index can be created on a function or scalar expression computed from one or more columns of the table. This feature is useful to obtain fast access to tables based on the results of computations.

For example, a common way to do case-insensitive comparisons is to use the lower function:

SELECT * FROM table1 WHERE lower(column1) = 'value1';

The query can use the index that is defined on the result of the lower(column1) function:

CREATE INDEX idx_lower_column1 ON table1 (lower(column1));

If this index is declared as unique (UNIQUE), it will prevent the addition of rows where the values of column1 differ only in case and if the values of column1 are the same. This way, indexes on expressions can also be used to enforce constraints that cannot be written as simple uniqueness constraints.

As another example, the following query can be often executed:

SELECT * FROM authors WHERE (first_name || ' ' || last_name) = 'Ivan Bunin';

Then it might be worth to create the following index:

CREATE INDEX idx_names ON authors ((first_name || ' ' || last_name));

The syntax of the CREATE INDEX command requires writing the additional parentheses around index expressions, as shown in the second example. These parentheses can be omitted when the expression is a function call, as in the first example.

Index expressions are expensive to maintain, because the expressions must be computed for each row insertion and update. However, the index expressions are not recomputed during an indexed search, since they are already stored in the index. In both examples above, PostgreSQL considers the query as WHERE <indexed column> = 'constant' and the speed of the search is equivalent to any other simple index query. Indexes on expressions are useful when retrieval speed is more important than insertion and update speed.

Partial indexes

A partial index is an index built over a subset of a table. The subset is defined by a conditional expression called the predicate of the partial index. The index contains entries only for table rows that satisfy the predicate.

One major reason for using a partial index is to avoid indexing common values. If a value is contained in a significant percentage of all rows, a query searching for this value does not use an index. So, there is no point to keep these rows in the index. It reduces the size of the index and speed up the queries that use this index. It also makes faster table update operations since the index is not always updated.

Consider the following example: web server access logs are stored in a database.

CREATE TABLE access_log (
    url varchar,
    client_ip inet,
    ...
);

Most connections come from the organization’s IP address range, but some come from other sources, such as remote employee connections. If you usually search by IP of outside accesses, you do not need to index the IP range that corresponds to the organization subnet. It is possible to create the following index:

CREATE INDEX log_ip_idx ON access_log (client_ip)
    WHERE NOT (client_ip > inet '192.168.100.0' AND
               client_ip < inet '192.168.100.255');

A typical query that can use this index:

SELECT * FROM access_log
    WHERE url = '/index.html' AND client_ip = inet '212.78.10.32';

The following query cannot use a partial index because it uses an IP address that is excluded from the index:

SELECT * FROM access_log
    WHERE url = '/index.html' AND client_ip = inet '192.168.100.23';

This type of partial index requires that the common values be predetermined, so such partial indexes are best used for data distributions that do not change. These indexes can be recreated occasionally to adjust for new data distributions, but this makes them much more difficult to maintain.

As a second example, consider a table that stores both paid and unpaid invoices. The unpaid invoices are only a small part of the entire table, but are the most interesting. You can improve query performance by creating an index on only unpaid invoices. The following command creates the index:

CREATE INDEX orders_unpaid_inx ON orders (order_nr)
    WHERE paid is not true;

A possible query to use this index can be:

SELECT * FROM orders WHERE paid is not true AND order_nr < 10000;

However, the index can also be used in queries that do not involve order_nr:

SELECT * FROM orders WHERE paid is not true AND amount > 5000.00;

This is not as efficient as a partial index on the amount column, since the system should scan the entire index. If there are relatively few unpaid orders, using this partial index just to find the unpaid orders can be effective.

Note that the following query cannot use this index:

SELECT * FROM orders WHERE order_nr = 4201;

The order 4201 can be among the paid or unpaid orders.

This example shows that the indexed column and the column used in the predicate do not need to match. PostgreSQL supports partial indexes with arbitrary predicates if they include only the columns of the table being indexed. However, keep in mind that the predicate must match the conditions of the queries that the index is intended to optimize.

The third possible use of partial indexes does not involve using the index in queries. The idea is to create a unique index on a subset of the table’s rows. This enforces uniqueness among the rows that satisfy the predicate condition without constraining those that do not satisfy.

Suppose that a table describes test results. It is necessary to ensure that there is only one "successful" entry for a given subject and target combination, but there might be any number of "unsuccessful" entries. To achieve this goal, you can use the following index:

CREATE TABLE tests (
    subject text,
    target text,
    success boolean,
    ...
    );

CREATE UNIQUE INDEX tests_success_constraint ON tests (subject, target)
    WHERE success;

This approach will be especially effective when there are more unsuccessful attempts than successful ones.

Examine the use of indexes

Although indexes in PostgreSQL do not need maintenance or tuning, it is important to check which indexes are actually used. You can utilize the EXPLAIN command to examine the use of indexes for an individual query. This approach is described in the Analyze queries article.

It is also possible to collect overall statistics about index usage on a running server, as described in The Statistics Collector. For example, the idx.scan field of the pg_stat_all_indexes view shows the number of index scans initiated on this index. If it is zero, the index is not used.

It is difficult to formulate a general approach for determining which indexes to create. Often a lot of experimentation is necessary. You can also use the following tips:

  • Always run ANALYZE first. This command collects statistics about the value distribution in the table. This information is required to estimate the number of rows returned by a query, which is needed by the planner to assign realistic costs to each possible query plan. Without real statistics, some default values are assumed that are almost certainly not to be true. It is almost impossible to understand how the index is used by an application without running ANALYZE. See Updating Planner Statistics and The Autovacuum Daemon for details.

  • Use real data for experimentation. By analyzing the system’s performance with test data, you will understand what indexes are needed for test data, but nothing more.

    It is especially fatal to use very small test data sets. While an index can be used to retrieve 1000 rows out of 100 000, it is unlikely to be needed to select 1 out of 100. 100 rows would likely fit on one page of data on disk, and no other plan would be better than just scanning 1 page.

    Also, be careful when making up test data while the application is not in production. With values that are very similar, completely random, or inserted in sorted order, you will get very different distribution statistics than real data.

  • When indexes are not used, it can be useful for testing to force their use. PostgreSQL provides run-time parameters that can turn off various plan types (see Planner Method Configuration). For instance, turning off sequential scans (enable_seqscan) and nested-loop joins (enable_nestloop), which are the most basic plans, will force the system to use a different plan. If the system still chooses a sequential scan or nested-loop join, there is probably a more fundamental reason why the index is not being used. For example, the query condition does not match the index.

  • If the system only uses indexes when it is forced, there are two possibilities: either the system is right and the use of the index is actually ineffective, or the cost estimates of the query plans are not reflecting reality. In this case, you should measure the execution time of the query with and without indexes. The EXPLAIN ANALYZE command can be useful in analyzing this situation.

  • If it turns out that the cost estimates are wrong, there are, again, two possibilities. The total cost is calculated as the cost of each row of each plan node multiplied by the selectivity estimate of the plan node. The costs estimated for the plan nodes can be adjusted via run-time parameters described in Planner Cost Constants. On the other side, selectivity estimates may be inaccurate due to insufficient statistics. It might be possible to improve this by tuning the statistics-gathering parameters (see ALTER TABLE).

The total cost is calculated as the cost of each row of each plan node multiplied by the selectivity estimate of the plan node.

Found a mistake? Seleсt text and press Ctrl+Enter to report it