Конференция Arenadata
Новое время — новый Greenplum
Мы приглашаем вас принять участие в конференции, посвященной будущему Open-Source Greenplum 19 сентября в 18:00:00 UTC +3. Встреча будет проходить в гибридном формате — и офлайн, и онлайн. Онлайн-трансляция будет доступна для всех желающих.
Внезапное закрытие Greenplum его владельцем — компанией Broadcom - стало неприятным сюрпризом для всех, кто использует или планирует начать использовать решения на базе этой технологии. Многие ожидают выхода стабильной версии Greenplum 7 и надеются на её дальнейшее активное развитие.
Arenadata не могла допустить, чтобы разрабатываемый годами Open-Source проект Greenplum прекратил своё существование, поэтому 19 сентября мы представим наш ответ на данное решение Broadcom, а участники сообщества получат исчерпывающие разъяснения на все вопросы о дальнейшей судьбе этой технологии.

На конференции вас ждёт обсуждение следующих тем:

  • План возрождения Greenplum;
  • Дорожная карта;
  • Экспертное обсуждение и консультации.
Осталось до события

Full text search

PostgreSQL provides the functionality to implement the full text search. It allows you to identify documents that match a query and sort them by relevance.

The pattern text search is a part of the SQL standard since the beginning. PostgreSQL provides ~, ~*, LIKE, and ILIKE operators to support it, but they lack many essential properties required by modern information systems. The pattern search does not take into account word forms. For example, if you search try, tries is not included in the result. It also does not rank search results. When thousands of matching documents are found, more relevant documents may be at the end of the result list. Pattern search operators are slow because they do not support indexes and must process all documents for each search request.

The full text search solves these problems. PostgreSQL has the to_tsvector function to convert a document to the tsvector data type. A tsvector value is a sorted list of distinct lexemes that are normalized to merge different variants of the same word. PostgreSQL also includes to_tsquery and other functions that parse a user query and create the tsquery value from the query text. The tsquery type contains normalized lexemes combined with the logical and phrase search operators. PostgreSQL uses dictionaries to converts text to lexemes. These dictionaries help to determine stop words that should be ignored, and combine different word forms in a single lexeme. This technique allows PostgreSQL to overcome the pattern search limitations.

The tsvector and tsquery formats

This section describes inner formats of the tsvector and tsquery types.

tsvector

Execute the query below to see in what format PostgreSQL stores data into the tsvector type. Use english language dictionaries to combine different word forms in a single lexeme. For this, specify the english configuration parameter in the to_tsvector function. See Configuration Example for configuration details.

SELECT to_tsvector('english', 'A row satisfies the condition if it returns true.');

The result:

to_tsvector
----------------------------------------------------
'condit':5 'return':8 'row':2 'satisfi':3 'true':9

The to_tsvector function calls a parser which breaks text into tokens and assigns a type to each token. PostgreSQL uses a list of dictionaries for each token, the list can vary depending on the token type. The first dictionary that recognizes the token emits one or more normalized lexemes to represent the token. In our example, the words satisfies, condition, and returns become satisfi, condit, and return respectively.

The resulting tsvector value does not contain the words a, the, if, and it. These words are recognized as stop words and are excluded from the result since they occur too frequently to be useful in search.

The numbers in the result determines the word position in the initial string (5 for 'condit', 8 for 'return', 2 for 'row', and etc.).

Lexemes in the tsvector type can have a weight label, which can be A, B, C, or D. D is the default and is not shown in the output.

Example:

SELECT 'condit:5B return:8C row:2A satisfi:3C true:9D'::tsvector;

The result:

                        tsvector
--------------------------------------------------------
 'condit':5B 'return':8C 'row':2A 'satisfi':3C 'true':9

Weight labels are typically used to mark entries coming from different parts of a document (title, body, and others).

PostgreSQL also contains the setweight function to set weights of tsvector entries. The example below uses setweight to label the source of each lexeme, and merges the labeled tsvector values using the tsvector concatenation operator ||. For more information on this operation, see Manipulating Documents.

UPDATE docs SET tsvector_document =
    setweight(to_tsvector(coalesce(title,'')), 'A')    ||
    setweight(to_tsvector(coalesce(keyword,'')), 'B')  ||
    setweight(to_tsvector(coalesce(summary,'')), 'C') ||
    setweight(to_tsvector(coalesce(body,'')), 'D');

After you specify lexeme labels, you can use search rank functions described below to assign different priority values for different weight labels.

tsquery

The tsquery data type contains lexemes combined with the phrase search operator <-> (FOLLOWED BY) and logical operators & (AND), | (OR), ! (NOT). Run the following query to see the tsquery value:

SELECT to_tsquery('english', 'row & satisfy');

The result:

   to_tsquery
-----------------
'row' & 'satisfi'

The result contains lexemes and the logical operator &.

The tsquery format can also include weight labels.

SELECT to_tsquery('english', 'row & satisfy:AB');

The result:

      to_tsquery
----------------------
 'row' & 'satisfi':AB

The tsquery lexemes with weight labels match only tsvector lexemes with the same weight.

A tsquery lexeme can contain * to specify prefix matching:

SELECT to_tsquery('super:*A');

This lexeme matches any word in a tsvector value that begins with the super string.

to_tsquery can also accept single-quoted phrases. This is useful, if the configuration includes a thesaurus dictionary that triggers on these phrases.

In addition to to_tsquery, PostgreSQL provides the plainto_tsquery, phraseto_tsquery, and websearch_to_tsquery functions for converting a query to the tsquery data type.

plainto_tsquery transforms unformatted text to a tsquery value. The function parses and normalizes text and inserts the & (AND) operator between words. plainto_tsquery does not recognize operators, weight labels, or prefix-match labels in its input.

Example:

SELECT plainto_tsquery('english', 'A joined table:B');

The result:

    plainto_tsquery
-----------------------
 'join' & 'tabl' & 'b'

phraseto_tsquery parses and normalizes text and inserts the <->(FOLLOWED BY) operator between words. The FOLLOWED BY operator can be represented as <N>, where N is an integer value that specifies the distance between two lexemes. <-> is equivalent to <1>. If a phrase contains stop words, they are discarded, but the <N> operator shows distance including them. The phraseto_tsquery function is useful to search for lexeme sequences (phrases), since the FOLLOWED BY operators check lexeme order. phraseto_tsquery does not recognize operators, weight labels, or prefix-match labels in its input.

Example:

SELECT phraseto_tsquery('english', 'A row satisfies the condition if it returns true.');

The result:

                     phraseto_tsquery
----------------------------------------------------------
 'row' <-> 'satisfi' <2> 'condit' <3> 'return' <-> 'true'

websearch_to_tsquery uses an alternative syntax to create a tsquery value from query text. It supports the following syntax:

  • Unquoted text is converted to lexemes separated by the & operator.

  • Quoted text is converted to lexemes separated by the <N> operator.

  • OR is converted to the | operator.

  • - is converted to the ! operator.

Example:

SELECT websearch_to_tsquery('english', '"A row satisfies the condition" if it returns true OR "false" -"index".');

The result:

                           websearch_to_tsquery
--------------------------------------------------------------------------
 'row' <-> 'satisfi' <2> 'condit' & 'return' & 'true' | 'fals' & !'index'

The websearch_to_tsquery function does not raise syntax errors. You can use it for a raw user-supplied input. websearch_to_tsquery does not recognize operators, weight labels, or prefix-match labels in its input.

TIP
You can configure a parser, dictionaries, and types of tokens that should be indexed for the to_tsvector and to_tsquery functions. See Configuration Example.

The match operator @@

PostgreSQL provides the match operator @@ for the full text search. Create a table to test it:

CREATE TABLE documents(
    document_id SERIAL,
    document_text TEXT
);

INSERT INTO documents (document_text) VALUES
('If the condition is not satisfied, rows are not returned.'),
('A joined table is a table derived from two other tables according to the rules of the particular join type.'),
('Indexes can be added to and removed from tables at any time.'),
('An index defined on a column that is part of a join condition can also significantly speed up queries with joins.'),
('A row satisfies the condition if it returns true.'),
('The type numeric can store numbers with a very large number of digits.'),
('It allows you to specify that the value in a certain column must satisfy a boolean expression.');

The match operator @@ returns true if a tsvector value (document) matches a tsquery value (query). To use english language dictionaries, specify the english configuration parameter in the to_tsvector and to_tsquery functions:

SELECT * FROM documents
    WHERE to_tsvector('english', document_text) @@ to_tsquery('english', 'satisfy');

The result includes documents that contain satisfy and its modified forms, satisfies and satisfied:

 document_id |                 document_text
-------------+----------------------------------------------------
           5 | A row satisfies the condition if it returns true.
           7 | It allows you to specify that the value in a certain column
               must satisfy a boolean expression.
           1 | If the condition is not satisfied, rows are not returned.

Increase search speed

You can use indexes to speed up the search.

Create a GIN index on an expression with the to_tsvector function call:

CREATE INDEX idx_gin_document_text ON documents
    USING GIN (to_tsvector('english', document_text));

The search query remains the same:

SELECT * FROM documents
    WHERE to_tsvector('english', document_text) @@ to_tsquery('english', 'satisfy');

Another approach is to store the to_tsvector output to a separate tsvector column and create indexes on it as follows:

ALTER TABLE documents
    ADD COLUMN tsv1 tsvector
        GENERATED ALWAYS AS (to_tsvector('english',document_text))
        STORED;

CREATE INDEX idx_gin_document_text11
    ON documents USING gin (tsv);

After the table is updated, use the tsv field for search:

SELECT document_id, document_text FROM documents
    WHERE tsv @@ to_tsquery('english', 'satisfy');

The same rows are returned:

 document_id |                 document_text
-------------+----------------------------------------------------
           7 | It allows you to specify that the value in a certain column
               must satisfy a boolean expression.
           1 | If the condition is not satisfied, rows are not returned.
           5 | A row satisfies the condition if it returns true.

In the separate-column approach, it is not necessary to explicitly specify the text search configuration in queries. Also, searches are faster, since PostgreSQL does not repeat the to_tsvector calls to verify index matches.

The expression-index approach is simpler to set up and it requires less disk space since tsvector values are not stored explicitly.

Rank search results

When searching in a large database and a lot of matching documents are included in the result, more relevant documents may be at the end of the list. To avoid this problem, PostgreSQL provides two predefined functions to rank the results: ts_rank and ts_rank_cd.

ts_rank ranks tsvector values based on the frequency of their matching lexemes. ts_rank_cd computes the cover density ranking for the given document and query. Refer to the following link for details: Ranking Search Results.

These functions have similar signatures:

ts_rank([ <weights> float4[], ] <vector> tsvector, <query> tsquery [, <normalization> integer ]) returns float4

ts_rank_cd([ <weights> float4[], ] <vector> tsvector, <query> tsquery [, <normalization> integer ]) returns float4

Where:

  • vector is the document tsvector value.

  • query is the query tsquery value.

  • weights contains weight values for different weight labels (optional).

  • normalization specifies how a document’s length impacts its rank (optional).

Execute the query with the ts_rank function without optional parameters:

SELECT document_id, ts_rank(tsv, query) AS rank, document_text
FROM documents, to_tsquery('english', 'table') query
WHERE tsv @@ query
ORDER BY rank DESC;

The result:

document_id|    rank    |                document_text
-----------+------------+--------------------------------------------------------
          2| 0.082745634|A joined table is a table derived from two other tables
           |            |according to the rules of the particular join type.
          3|  0.06079271|Indexes can be added to and removed from tables at any time.

The document that contains table twice has a higher rank.

The optional weights parameter offers the ability to change the weight of lexemes in the search based on their weight labels. A weight label is a part of the tsvector and tsquery formats described above. The weight array specifies weight values for each category of lexemes in the following order: {D-weight, C-weight, B-weight, A-weight}. If you do not set this parameter, rank functions use the default value {0.1, 0.2, 0.4, 1.0}.

Update a tsv field value for the third row with the label weight:

UPDATE documents SET tsv = setweight(tsv,'A')
WHERE document_id = 3
RETURNING tsv;

The result:

                        tsv
----------------------------------------------------
 'ad':4A 'index':1A 'remov':7A 'tabl':9A 'time':12A

Execute the query with the specified weight array:

SELECT document_id, ts_rank('{0.05, 0.2, 0.4, 1.0}', tsv, query) AS rank, document_text
FROM documents, to_tsquery('english', 'table') query
WHERE tsv @@ query
ORDER BY rank DESC;

The result is different. The row, in which the tabl lexeme has the A weight label, obtains a higher rank:

document_id|    rank     |            document_text
-----------+-------------+------------------------------------------------------------
          3|   0.6079271 |Indexes can be added to and removed from tables at any time.
          2| 0.041372817 |A joined table is a table derived from two other tables
                            according to the rules of the particular join type.

A longer document has a greater chance to contain a query term. It is reasonable to take into account the document size. A hundred-word document with five instances of a searched word is more relevant than a thousand-word document with five instances. Both rank functions have the normalization option that specifies how a document’s length impacts its rank. The table below contains the normalization values:

Value Description

0

Ignores the document length. This is the default value

1

Divides the rank by 1 + the logarithm of the document length

2

Divides the rank by the document length

4

Divides the rank by the mean harmonic distance between extents (only for ts_rank_cd)

8

Divides the rank by the number of unique words in the document

16

Divides the rank by 1 + the logarithm of the number of unique words in the document

32

Divides the rank by itself + 1

You can use | to specify several values, for example 2|4. The transformations are applied in the order listed.

Add a normalization value to the query above:

SELECT document_id, ts_rank('{0.05, 0.2, 0.4, 1.0}', tsv, query, 8) AS rank, document_text
    FROM documents, to_tsquery('english', 'table') query
    WHERE tsv @@ query
    ORDER BY rank DESC;

The result:

 document_id |    rank     |             document_text
-------------+-------------+--------------------------------------------------------------------
           3 | 0.121585414 | Indexes can be added to and removed from tables at any time.
           2 | 0.005171602 | A joined table is a table derived from two other tables according
                                to the rules of the particular join type.
NOTE
Ranking search results can be an expensive operation since it requires reading the tsvector value of each matching document. It can take significant time if PostgreSQL reads tsvector values from the disk. It is almost impossible to avoid, since search queries often result in large numbers of matches.
Found a mistake? Seleсt text and press Ctrl+Enter to report it