Full text search
PostgreSQL provides the functionality to implement the full text search. It allows you to identify documents that match a query and sort them by relevance.
The pattern text search is a part of the SQL standard since the beginning. PostgreSQL provides ~
, ~*
, LIKE
, and ILIKE
operators to support it, but they lack many essential properties required by modern information systems. The pattern search does not take into account word forms. For example, if you search try
, tries
is not included in the result. It also does not rank search results. When thousands of matching documents are found, more relevant documents may be at the end of the result list. Pattern search operators are slow because they do not support indexes and must process all documents for each search request.
The full text search solves these problems. PostgreSQL has the to_tsvector function to convert a document to the tsvector data type. A tsvector
value is a sorted list of distinct lexemes that are normalized to merge different variants of the same word. PostgreSQL also includes to_tsquery
and other functions that parse a user query and create the tsquery value from the query text. The tsquery
type contains normalized lexemes combined with the logical and phrase search operators. PostgreSQL uses dictionaries to converts text to lexemes. These dictionaries help to determine stop words that should be ignored, and combine different word forms in a single lexeme. This technique allows PostgreSQL to overcome the pattern search limitations.
The tsvector and tsquery formats
This section describes inner formats of the tsvector
and tsquery
types.
tsvector
Execute the query below to see in what format PostgreSQL stores data into the tsvector
type. Use english language dictionaries to combine different word forms in a single lexeme. For this, specify the english
configuration parameter in the to_tsvector
function. See Configuration Example for configuration details.
SELECT to_tsvector('english', 'A row satisfies the condition if it returns true.');
The result:
to_tsvector ---------------------------------------------------- 'condit':5 'return':8 'row':2 'satisfi':3 'true':9
The to_tsvector
function calls a parser which breaks text into tokens and assigns a type to each token. PostgreSQL uses a list of dictionaries for each token, the list can vary depending on the token type. The first dictionary that recognizes the token emits one or more normalized lexemes to represent the token. In our example, the words satisfies
, condition
, and returns
become satisfi
, condit
, and return
respectively.
The resulting tsvector
value does not contain the words a
, the
, if
, and it
. These words are recognized as stop words and are excluded from the result since they occur too frequently to be useful in search.
The numbers in the result determines the word position in the initial string (5 for 'condit', 8 for 'return', 2 for 'row', and etc.).
Lexemes in the tsvector
type can have a weight label, which can be A
, B
, C
, or D
. D
is the default and is not shown in the output.
Example:
SELECT 'condit:5B return:8C row:2A satisfi:3C true:9D'::tsvector;
The result:
tsvector -------------------------------------------------------- 'condit':5B 'return':8C 'row':2A 'satisfi':3C 'true':9
Weight labels are typically used to mark entries coming from different parts of a document (title, body, and others).
PostgreSQL also contains the setweight
function to set weights of tsvector
entries. The example below uses setweight
to label the source of each lexeme, and merges the labeled tsvector values using the tsvector concatenation operator ||
. For more information on this operation, see Manipulating Documents.
UPDATE docs SET tsvector_document =
setweight(to_tsvector(coalesce(title,'')), 'A') ||
setweight(to_tsvector(coalesce(keyword,'')), 'B') ||
setweight(to_tsvector(coalesce(summary,'')), 'C') ||
setweight(to_tsvector(coalesce(body,'')), 'D');
After you specify lexeme labels, you can use search rank functions described below to assign different priority values for different weight labels.
tsquery
The tsquery
data type contains lexemes combined with the phrase search operator <->
(FOLLOWED BY) and logical operators &
(AND), |
(OR), !
(NOT). Run the following query to see the tsquery
value:
SELECT to_tsquery('english', 'row & satisfy');
The result:
to_tsquery ----------------- 'row' & 'satisfi'
The result contains lexemes and the logical operator &
.
The tsquery
format can also include weight labels.
SELECT to_tsquery('english', 'row & satisfy:AB');
The result:
to_tsquery ---------------------- 'row' & 'satisfi':AB
The tsquery
lexemes with weight labels match only tsvector
lexemes with the same weight.
A tsquery
lexeme can contain *
to specify prefix matching:
SELECT to_tsquery('super:*A');
This lexeme matches any word in a tsvector
value that begins with the super
string.
to_tsquery
can also accept single-quoted phrases. This is useful, if the configuration includes a thesaurus dictionary that triggers on these phrases.
In addition to to_tsquery
, PostgreSQL provides the plainto_tsquery
, phraseto_tsquery
, and websearch_to_tsquery
functions for converting a query to the tsquery
data type.
plainto_tsquery
transforms unformatted text to a tsquery
value. The function parses and normalizes text and inserts the &
(AND) operator between words. plainto_tsquery
does not recognize operators, weight labels, or prefix-match labels in its input.
Example:
SELECT plainto_tsquery('english', 'A joined table:B');
The result:
plainto_tsquery ----------------------- 'join' & 'tabl' & 'b'
phraseto_tsquery
parses and normalizes text and inserts the <->
(FOLLOWED BY) operator between words. The FOLLOWED BY operator can be represented as <N>
, where N
is an integer value that specifies the distance between two lexemes. <->
is equivalent to <1>
. If a phrase contains stop words, they are discarded, but the <N>
operator shows distance including them. The phraseto_tsquery
function is useful to search for lexeme sequences (phrases), since the FOLLOWED BY operators check lexeme order. phraseto_tsquery
does not recognize operators, weight labels, or prefix-match labels in its input.
Example:
SELECT phraseto_tsquery('english', 'A row satisfies the condition if it returns true.');
The result:
phraseto_tsquery ---------------------------------------------------------- 'row' <-> 'satisfi' <2> 'condit' <3> 'return' <-> 'true'
websearch_to_tsquery
uses an alternative syntax to create a tsquery
value from query text. It supports the following syntax:
-
Unquoted text is converted to lexemes separated by the
&
operator. -
Quoted text is converted to lexemes separated by the
<N>
operator. -
OR
is converted to the|
operator. -
-
is converted to the!
operator.
Example:
SELECT websearch_to_tsquery('english', '"A row satisfies the condition" if it returns true OR "false" -"index".');
The result:
websearch_to_tsquery -------------------------------------------------------------------------- 'row' <-> 'satisfi' <2> 'condit' & 'return' & 'true' | 'fals' & !'index'
The websearch_to_tsquery
function does not raise syntax errors. You can use it for a raw user-supplied input. websearch_to_tsquery
does not recognize operators, weight labels, or prefix-match labels in its input.
TIP
You can configure a parser, dictionaries, and types of tokens that should be indexed for the to_tsvector and to_tsquery functions. See Configuration Example.
|
The match operator @@
PostgreSQL provides the match operator @@
for the full text search. Create a table to test it:
CREATE TABLE documents(
document_id SERIAL,
document_text TEXT
);
INSERT INTO documents (document_text) VALUES
('If the condition is not satisfied, rows are not returned.'),
('A joined table is a table derived from two other tables according to the rules of the particular join type.'),
('Indexes can be added to and removed from tables at any time.'),
('An index defined on a column that is part of a join condition can also significantly speed up queries with joins.'),
('A row satisfies the condition if it returns true.'),
('The type numeric can store numbers with a very large number of digits.'),
('It allows you to specify that the value in a certain column must satisfy a boolean expression.');
The match operator @@
returns true
if a tsvector
value (document) matches a tsquery
value (query). To use english language dictionaries, specify the english
configuration parameter in the to_tsvector
and to_tsquery
functions:
SELECT * FROM documents
WHERE to_tsvector('english', document_text) @@ to_tsquery('english', 'satisfy');
The result includes documents that contain satisfy
and its modified forms, satisfies
and satisfied
:
document_id | document_text -------------+---------------------------------------------------- 5 | A row satisfies the condition if it returns true. 7 | It allows you to specify that the value in a certain column must satisfy a boolean expression. 1 | If the condition is not satisfied, rows are not returned.
Increase search speed
You can use indexes to speed up the search.
Create a GIN
index on an expression with the to_tsvector
function call:
CREATE INDEX idx_gin_document_text ON documents
USING GIN (to_tsvector('english', document_text));
The search query remains the same:
SELECT * FROM documents
WHERE to_tsvector('english', document_text) @@ to_tsquery('english', 'satisfy');
Another approach is to store the to_tsvector
output to a separate tsvector
column and create indexes on it as follows:
ALTER TABLE documents
ADD COLUMN tsv1 tsvector
GENERATED ALWAYS AS (to_tsvector('english',document_text))
STORED;
CREATE INDEX idx_gin_document_text11
ON documents USING gin (tsv);
After the table is updated, use the tsv
field for search:
SELECT document_id, document_text FROM documents
WHERE tsv @@ to_tsquery('english', 'satisfy');
The same rows are returned:
document_id | document_text -------------+---------------------------------------------------- 7 | It allows you to specify that the value in a certain column must satisfy a boolean expression. 1 | If the condition is not satisfied, rows are not returned. 5 | A row satisfies the condition if it returns true.
In the separate-column approach, it is not necessary to explicitly specify the text search configuration in queries. Also, searches are faster, since PostgreSQL does not repeat the to_tsvector
calls to verify index matches.
The expression-index approach is simpler to set up and it requires less disk space since tsvector
values are not stored explicitly.
Rank search results
When searching in a large database and a lot of matching documents are included in the result, more relevant documents may be at the end of the list. To avoid this problem, PostgreSQL provides two predefined functions to rank the results: ts_rank
and ts_rank_cd
.
ts_rank
ranks tsvector
values based on the frequency of their matching lexemes. ts_rank_cd
computes the cover density ranking for the given document and query. Refer to the following link for details: Ranking Search Results.
These functions have similar signatures:
ts_rank([ <weights> float4[], ] <vector> tsvector, <query> tsquery [, <normalization> integer ]) returns float4
ts_rank_cd([ <weights> float4[], ] <vector> tsvector, <query> tsquery [, <normalization> integer ]) returns float4
Where:
-
vector
is the documenttsvector
value. -
query
is the querytsquery
value. -
weights
contains weight values for different weight labels (optional). -
normalization
specifies how a document’s length impacts its rank (optional).
Execute the query with the ts_rank
function without optional parameters:
SELECT document_id, ts_rank(tsv, query) AS rank, document_text
FROM documents, to_tsquery('english', 'table') query
WHERE tsv @@ query
ORDER BY rank DESC;
The result:
document_id| rank | document_text -----------+------------+-------------------------------------------------------- 2| 0.082745634|A joined table is a table derived from two other tables | |according to the rules of the particular join type. 3| 0.06079271|Indexes can be added to and removed from tables at any time.
The document that contains table
twice has a higher rank.
The optional weights
parameter offers the ability to change the weight of lexemes in the search based on their weight labels. A weight label is a part of the tsvector
and tsquery
formats described above. The weight array specifies weight values for each category of lexemes in the following order: {D-weight, C-weight, B-weight, A-weight}
. If you do not set this parameter, rank functions use the default value {0.1, 0.2, 0.4, 1.0}
.
Update a tsv
field value for the third row with the label weight:
UPDATE documents SET tsv = setweight(tsv,'A')
WHERE document_id = 3
RETURNING tsv;
The result:
tsv ---------------------------------------------------- 'ad':4A 'index':1A 'remov':7A 'tabl':9A 'time':12A
Execute the query with the specified weight array:
SELECT document_id, ts_rank('{0.05, 0.2, 0.4, 1.0}', tsv, query) AS rank, document_text
FROM documents, to_tsquery('english', 'table') query
WHERE tsv @@ query
ORDER BY rank DESC;
The result is different. The row, in which the tabl
lexeme has the A
weight label, obtains a higher rank:
document_id| rank | document_text -----------+-------------+------------------------------------------------------------ 3| 0.6079271 |Indexes can be added to and removed from tables at any time. 2| 0.041372817 |A joined table is a table derived from two other tables according to the rules of the particular join type.
A longer document has a greater chance to contain a query term. It is reasonable to take into account the document size. A hundred-word document with five instances of a searched word is more relevant than a thousand-word document with five instances. Both rank functions have the normalization
option that specifies how a document’s length impacts its rank. The table below contains the normalization
values:
Value | Description |
---|---|
0 |
Ignores the document length. This is the default value |
1 |
Divides the rank by 1 + the logarithm of the document length |
2 |
Divides the rank by the document length |
4 |
Divides the rank by the mean harmonic distance between extents (only for |
8 |
Divides the rank by the number of unique words in the document |
16 |
Divides the rank by 1 + the logarithm of the number of unique words in the document |
32 |
Divides the rank by itself + 1 |
You can use |
to specify several values, for example 2|4
. The transformations are applied in the order listed.
Add a normalization value to the query above:
SELECT document_id, ts_rank('{0.05, 0.2, 0.4, 1.0}', tsv, query, 8) AS rank, document_text
FROM documents, to_tsquery('english', 'table') query
WHERE tsv @@ query
ORDER BY rank DESC;
The result:
document_id | rank | document_text -------------+-------------+-------------------------------------------------------------------- 3 | 0.121585414 | Indexes can be added to and removed from tables at any time. 2 | 0.005171602 | A joined table is a table derived from two other tables according to the rules of the particular join type.
NOTE
Ranking search results can be an expensive operation since it requires reading the tsvector value of each matching document. It can take significant time if PostgreSQL reads tsvector values from the disk. It is almost impossible to avoid, since search queries often result in large numbers of matches.
|