Full-text search

Overview

For efficient full-text search on String or FixedString columns (the Array(String), Array(FixedString), Map(String), and Map(String) types are also supported) in tables with very large numbers of records, you can use inverted indexes — a special type of data skipping index.

ADQM/ClickHouse builds an inverted index as follows:

  • Text of each cell in a column on which the index is compiled is split into tokens — terms. By default, a string is split based on spaces, but it is also possible to split text into n-grams. All terms are written into a sorted list — dictionary.

  • Each term is associated with a sorted list of identifiers of rows which contain this term — a posting list.

In other words, an inverted index stores a mapping of all terms to their locations in a table (row numbers). If a table has a defined inverted index on a text column, ADQM automatically applies it when executing a query that searches for a term in that column (for example, a query like SELECT …​ WHERE <column_name> == | IN | LIKE '<term>') — it quickly finds a term in the dictionary and returns the required table rows using posting list.

For example, consider a table with the text column of the String type:

┌─row_id─┬─text────────────┐
│      0 │ Lower East Side │
│      1 │ Upper West Side │
│      2 │ West Village    │
│      3 │ East Village    │
└────────┴─────────────────┘

An inverted index on the text column will generally be composed as follows.

Dictionary (terms) Posting lists

East

0, 3

Lower

0

Side

0, 1

Upper

1

Village

2, 3

West

1, 2

IMPORTANT
Inverted indexes are currently supported as an experimental functionality — it is not recommended to use them on production environments.

Manage indexes

Before creating an inverted index, enable the following option:

SET allow_experimental_inverted_index = true;

Describe an inverted index in the same way as other types of data skipping index — use the INDEX clause within the column description section of the CREATE TABLE query :

CREATE TABLE <table_name>
(   <column_name> <column_type>, ...
    INDEX <index_name> <index_expr> TYPE inverted([<n>], [<max_rows_per_posting_list>]), ...)
ENGINE = MergeTree
...;

where:

  • <index_name> — index name.

  • <index_expr> — expression based on which the index will be built (often it is just a text column name).

  • <n> — tokenizer that specifies how a string in a text column cell is split into terms. For example:

    • inverted(0) (or shorter: inverted(), inverted) — a string is split into words by spaces;

    • inverted(n) (where n is a number in the [2, 8] range) — a string is split into n-grams (substrings of n characters).

  • <max_rows_per_posting_list> — the maximum number of row identifiers per posting list (if 0 — the size of a posting list for a term is not limited, if the parameter is not specified — the default limit set to 64K is used).

Since inverted indexes map terms to row identifiers (instead of granule identifiers as other data skipping indexes), the GRANULARITY parameter supplied to index creation is not relevant.

You can add an inverted index to an existing table using the following query:

ALTER TABLE <table_name> ADD INDEX <index_name> <index_expr> TYPE inverted([<n>], [<max_rows_per_posting_list>]);

When an index is added to an existing table, it is not automatically updated. ALTER TABLE changes metadata, and the index will only be calculated for new data inserted into the table. To apply the index to existing data, run the additional command:

ALTER TABLE <table_name> MATERIALIZE INDEX <index_name>;

To delete indexes, use the following queries:

  • ALTER TABLE <table_name> DROP INDEX <index_name> — removes index description from table metadata and deletes index files from disk;

  • ALTER TABLE <table_name> CLEAR INDEX <index_name> — removes index files from disk without removing the index description from metadata.

Example

Create the trips table with New York City taxi data as the ClickHouse documentation describes:

CREATE TABLE trips (
    trip_id             UInt32,
    pickup_datetime     DateTime,
    dropoff_datetime    DateTime,
    pickup_longitude    Nullable(Float64),
    pickup_latitude     Nullable(Float64),
    dropoff_longitude   Nullable(Float64),
    dropoff_latitude    Nullable(Float64),
    passenger_count     UInt8,
    trip_distance       Float32,
    fare_amount         Float32,
    extra               Float32,
    tip_amount          Float32,
    tolls_amount        Float32,
    total_amount        Float32,
    payment_type        Enum('CSH' = 1, 'CRE' = 2, 'NOC' = 3, 'DIS' = 4, 'UNK' = 5),
    pickup_ntaname      LowCardinality(String),
    dropoff_ntaname     LowCardinality(String)
)
ENGINE = MergeTree
PRIMARY KEY (pickup_datetime, dropoff_datetime);
INSERT INTO trips
SELECT
    trip_id,
    pickup_datetime,
    dropoff_datetime,
    pickup_longitude,
    pickup_latitude,
    dropoff_longitude,
    dropoff_latitude,
    passenger_count,
    trip_distance,
    fare_amount,
    extra,
    tip_amount,
    tolls_amount,
    total_amount,
    payment_type,
    pickup_ntaname,
    dropoff_ntaname
FROM s3(
    'https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/trips_{0..2}.gz',
    'TabSeparatedWithNames');

Execute a test query for data selection with a condition on the pickup_ntaname column:

SELECT pickup_datetime,pickup_datetime,pickup_ntaname,dropoff_ntaname
FROM trips
WHERE (pickup_ntaname LIKE '%Village%') OR (pickup_ntaname LIKE '%Square%');
┌─────pickup_datetime─┬─────pickup_datetime─┬─pickup_ntaname─────────────────────────────┬─dropoff_ntaname───────┐
│ 2015-07-01 00:02:24 │ 2015-07-01 00:02:24 │ East Village                               │ Gramercy              │
│ 2015-07-01 00:03:15 │ 2015-07-01 00:03:15 │ Hudson Yards-Chelsea-Flatiron-Union Square │ West Village          │

...

│ 2015-07-09 00:32:13 │ 2015-07-09 00:32:13 │ West Village                               │ East Village          │
└─────────────────────┴─────────────────────┴────────────────────────────────────────────┴───────────────────────┘
  Showed first 10000.

714664 rows in set. Elapsed: 0.831 sec. Processed 3.00 million rows, 18.00 MB (3.61 million rows/s., 21.66 MB/s.)
Peak memory usage: 13.04 MiB.

Add an inverted index to the table:

SET allow_experimental_inverted_index = true;
ALTER TABLE trips ADD INDEX inv_idx(pickup_ntaname) TYPE inverted;
ALTER TABLE trips MATERIALIZE INDEX inv_idx;

Repeat the query — note that its execution speed has increased:

SELECT pickup_datetime,pickup_datetime,pickup_ntaname,dropoff_ntaname
FROM trips
WHERE (pickup_ntaname LIKE '%Village%') OR (pickup_ntaname LIKE '%Square%');
┌─────pickup_datetime─┬─────pickup_datetime─┬─pickup_ntaname─────────────────────────────┬─dropoff_ntaname───────┐
│ 2015-07-01 00:02:24 │ 2015-07-01 00:02:24 │ East Village                               │ Gramercy              │
│ 2015-07-01 00:03:15 │ 2015-07-01 00:03:15 │ Hudson Yards-Chelsea-Flatiron-Union Square │ West Village          │

...

│ 2015-07-09 00:32:13 │ 2015-07-09 00:32:13 │ West Village                               │ East Village          │
└─────────────────────┴─────────────────────┴────────────────────────────────────────────┴───────────────────────┘
  Showed first 10000.

714664 rows in set. Elapsed: 0.129 sec. Processed 3.00 million rows, 18.00 MB (23.27 million rows/s., 139.63 MB/s.)
Peak memory usage: 13.04 MiB.

If you apply the EXPLAIN clause with the indexes parameter to the query, you can ensure that the inverted index was used:

EXPLAIN indexes = 1 SELECT pickup_datetime,pickup_datetime,pickup_ntaname,dropoff_ntaname
FROM trips
WHERE (pickup_ntaname LIKE '%Village%') OR (pickup_ntaname LIKE '%Square%');
┌─explain─────────────────────────────────────┐
│ Expression ((Project names + Projection))   │
│   Expression                                │
│     ReadFromMergeTree (default.trips)       │
│     Indexes:                                │
│       PrimaryKey                            │
│         Condition: true                     │
│         Parts: 2/2                          │
│         Granules: 368/368                   │
│       Skip                                  │
│         Name: inv_idx                       │
│         Description: inverted GRANULARITY 1 │
│         Parts: 2/2                          │
│         Granules: 368/368                   │
└─────────────────────────────────────────────┘
Found a mistake? Seleсt text and press Ctrl+Enter to report it