Use projections

Elena Dvoryadkina

Contents

Overview
Manage projections
- Create/delete projections
- Settings to enable projections
View information on projections
Examples

Overview

A projection is an additional hidden table that stores data from a source table in an alternative form to be optimal for executing some type of queries. A projection can store data rows sorted in a different order than the source table (and therefore have a different primary index) or pre-compute aggregate values across columns of the source table. So, this feature can be useful to optimize:

queries with filters on source table columns that are not part of a primary key;
queries that aggregate values of source table columns (data pre-aggregation performed by projections can reduce the consumption of both computing and I/O resources).

Projections are supported only for tables of the MergeTree family (including Replicated*). A table can have one or more projections — all of them will be automatically synchronized with the source table, i.e. updated each time new data is inserted into the source table.

Projections are similar to materialized views (which also allow pre-computing aggregations and using different row orders on data from another table), but they are defined at the level of data parts — for each data part, a projection of data in that part is created based on the projection query. This guarantees that projection data is consistent with the source table data and that ADQM can automatically choose and apply the most optimal projections when executing queries provided by users (a projection is considered optimal if it returns the correct result for a query with the least data to be scanned).

Unlike materialized views, projections do not require manual creation of separate additional tables with different structures, which you need to keep in mind and each time when building a query, you need to choose which table to send it to in order to ensure the high speed of the query execution. With projections, it is enough to send all queries to the source table — ADQM itself will determine whether the table has a projection that can speed up the query execution, and if such projection exists, it will redirect the query to it. In other words, a user does not need to know anything about projections.

CAUTION

As each projection creates a new hidden table (projection files are stored in the <projection_name>.proj subdirectories within directories of the source MergeTree table’s data parts), it may require more I/O resources and disk space. For example, if a projection is created to define an alternate sorting key, all data from the source table will be duplicated in the projection.

Manage projections

Create/delete projections

You can specify a projection when creating a MergeTree table in the section with column descriptions using the PROJECTION clause. The basic syntax for creating a table with a projection is:

CREATE TABLE <table_name>
(   <column1> ...,
    <column2> ...,
    ...
    PROJECTION <projection_name> (SELECT <column_list> [ORDER BY <sorting_key>] [GROUP BY <grouping_key>])
)
ENGINE = MergeTree()
...;

After a projection name, there is a projection query (in parentheses), which specifies how data from the source table should be transformed for storing in the projection table. A projection query can include one of the following clauses:

ORDER BY — key by which data rows will be sorted, and a primary key will be set in the projection;
GROUP BY — key by which a source table data will be grouped for aggregation in the projection. In this case, a hidden projection table is based on the AggregatingMergeTree engine, and all aggregate functions are converted to AggregateFunction.

To manage projections of existing tables, you can use the ALTER TABLE queries listed below.

Add a projection to an existing table:
ALTER TABLE [<db_name.]<table_name> [ON CLUSTER <cluster_name>] ADD PROJECTION [IF NOT EXISTS] <projection_name> (SELECT <column_list> [ORDER BY <sorting_key>] [GROUP BY <grouping_key>]);
This query adds a projection description to a MergeTree table metadata, but does not populate the projection with data (it will be rebuilt when new data is inserted into the source table or when materialized).

Materialize a projection to fill it with existing data of the source table (you can select an individual partition):

ALTER TABLE [<db_name.]<table_name> [ON CLUSTER <cluster_name>]
    MATERIALIZE PROJECTION [IF EXISTS] <projection_name> [IN PARTITION <partition_name>];

Delete projection files from the disk without removing projection description from the source table’s metadata (you can select an individual partition):

ALTER TABLE [<db_name.]<table_name> [ON CLUSTER <cluster_name>]
    CLEAR PROJECTION [IF EXISTS] <projection_name> [IN PARTITION <partition_name>];

Delete a projection description from the source table metadata and remove the projection files from the disk:

ALTER TABLE [<db_name.]<table_name> [ON CLUSTER <cluster_name>] DROP PROJECTION [IF EXISTS] <partition_name>;

The ADD PROJECTION, CLEAR PROJECTION, and DROP PROJECTION commands are lightweight because they only change metadata or delete files.

Settings to enable projections

To tell ADQM whether it should use projections when executing queries, configure the following parameters (at the session level):

optimize_use_projections — enables/disables automatic optimization of SELECT queries with projections (the default value is 1, meaning the projection optimization is enabled);
force_optimize_projection — enables/disables the obligatory use of projections for executing SELECT queries if the optimization via projections is enabled by the optimize_use_projections setting (the default value is 0).

View information on projections

You can obtain information about projections from system tables:

system.projection_parts — information about currently existing data parts for projections of all MergeTree tables in a cluster (information about the corresponding data parts of the source table is stored in columns prefixed with parent_);
system.projection_parts_columns — information about columns of all currently existing projection data parts of all MergeTree tables.

Since projection tables are hidden, they are not included in the SHOW TABLES list.

Examples

Create a MergeTree table to store pageview data (the examples below will define projections for this table):

CREATE TABLE page_views (time DateTime, page_name String, time_on_page UInt64) ENGINE = MergeTree ORDER BY time;

Populate this table with test data. For example, randomly generate 1 million views for 4 pages over approximately the last 3 years relative to the current moment:

INSERT INTO page_views SELECT
    now() - toIntervalMinute(rand() % 500000 + number),
    ['landing', 'products', 'docs', 'about'][toInt32(randBinomial(4, 0.01)) + 1],
    rand() % 15
FROM numbers(1000000);

Filter without using a primary key

If you run a query filtering data by the time column, which is included in the primary key of the page_views table, a full table scan is not required:

SELECT count() FROM page_views WHERE toYear(time) = '2022';

   ┌─count()─┐
1. │  388601 │
   └─────────┘

1 row in set. Elapsed: 0.004 sec. Processed 393.22 thousand rows, 1.57 MB (99.18 million rows/s., 396.72 MB/s.)

To filter data not by a key column, ADQM performs a full table scan, for example:

SELECT count() FROM page_views WHERE page_name = 'products';

   ┌─count()─┐
1. │   38729 │
   └─────────┘

1 row in set. Elapsed: 0.006 sec. Processed 1.00 million rows, 16.04 MB (156.85 million rows/s., 2.52 GB/s.)

To optimize such queries, you can add a projection with a different primary index to the table. For example, create the page_name_projection projection which sorts rows by the page_name column:

ALTER TABLE page_views ADD PROJECTION page_name_projection (SELECT * ORDER BY page_name);

Materialize the projection to build it based on data currently existing in the page_views table:

ALTER TABLE page_views MATERIALIZE PROJECTION page_name_projection;

Repeat the query filtering by the page_name column:

SELECT count() FROM page_views WHERE page_name = 'products';

Now, ADQM has read 41.54 thousand rows instead of 1 million to execute the query:

   ┌─count()─┐
1. │   38729 │
   └─────────┘

1 row in set. Elapsed: 0.003 sec. Processed 41.54 thousand rows, 703.30 KB (12.48 million rows/s., 211.24 MB/s.)

Use the EXPLAIN clause to verify that ADQM uses the projection (ReadFromMergeTree (page_name_projection)) to execute the query:

EXPLAIN SELECT count() FROM page_views WHERE page_name = 'products';

   ┌─explain────────────────────────────────────────┐
1. │ Expression ((Project names + Projection))      │
2. │   Aggregating                                  │
3. │     Filter                                     │
4. │       ReadFromMergeTree (page_name_projection) │
   └────────────────────────────────────────────────┘

Pre-aggregate values

Run a query that returns the number of pageviews and average pageview time by month:

SELECT
    page_name,
    toStartOfMonth(time) AS month,
    count() AS page_view_count,
    round(avg(time_on_page), 2) AS avg_time_on_page
FROM page_views
GROUP BY page_name, month
ORDER BY page_name, month;

To execute this query, ADQM reads all rows of the page_views table:

Query id: 303db76e-1e68-4454-91ea-a951c36d170b

     ┌─page_name─┬──────month─┬─page_view_count─┬─avg_time_on_page─┐
  1. │ about     │ 2022-03-01 │               1 │                9 │
  2. │ about     │ 2023-06-01 │               1 │                6 │

      ...

105. │ products  │ 2024-06-01 │             313 │             6.87 │
106. │ products  │ 2024-07-01 │             116 │             6.26 │
107. │ products  │ 2024-08-01 │               7 │                7 │
     └───────────┴────────────┴─────────────────┴──────────────────┘

107 rows in set. Elapsed: 0.018 sec. Processed 1.00 million rows, 28.04 MB (54.22 million rows/s., 1.52 GB/s.)

Create and materialize a projection that calculates the number of views and average view time for each page in the page_views table by month:

ALTER TABLE page_views ADD PROJECTION page_views_monthly_projection
    (
        SELECT
            page_name,
            toStartOfMonth(time),
            count(),
            round(avg(time_on_page), 2)
        GROUP BY
            page_name,
            toStartOfMonth(time)
    );

ALTER TABLE page_views MATERIALIZE PROJECTION page_views_monthly_projection;

Repeat the query:

SELECT
    page_name,
    toStartOfMonth(time) AS month,
    count() AS page_view_count,
    round(avg(time_on_page), 2) AS avg_time_on_page
FROM page_views
GROUP BY page_name, month
ORDER BY page_name, month;

The result is the same, but the performance is now better:

Query id: acb9a600-cbd7-44a4-aa61-ee5f8cdd941d

     ┌─page_name─┬──────month─┬─page_view_count─┬─avg_time_on_page─┐
  1. │ about     │ 2022-03-01 │               1 │                9 │
  2. │ about     │ 2023-06-01 │               1 │                6 │

      ...

105. │ products  │ 2024-06-01 │             313 │             6.87 │
106. │ products  │ 2024-07-01 │             116 │             6.26 │
107. │ products  │ 2024-08-01 │               7 │                7 │
     └───────────┴────────────┴─────────────────┴──────────────────┘

107 rows in set. Elapsed: 0.004 sec.

Use EXPLAIN to verify that ADQM now uses the projection to execute the query:

EXPLAIN SELECT
    page_name,
    toStartOfMonth(time) AS month,
    count() AS page_view_count,
    round(avg(time_on_page), 2) AS avg_time_on_page
FROM page_views
GROUP BY page_name, month
ORDER BY page_name, month;

   ┌─explain────────────────────────────────────────────────────────────────────────┐
1. │ Expression ((Project names + (Before ORDER BY + Projection) [lifted up part])) │
2. │   Sorting (Sorting for ORDER BY)                                               │
3. │     Expression ((Before ORDER BY + Projection))                                │
4. │       Aggregating                                                              │
5. │         Expression                                                             │
6. │           ReadFromMergeTree (page_views_monthly_projection)                    │
   └────────────────────────────────────────────────────────────────────────────────┘

You can also inspect the system.query_log system table where the projections field shows the name of the used projection (or remains empty if no projection was used):

SELECT query, formatReadableQuantity(read_rows) AS read_rows, projections FROM system.query_log
WHERE type='QueryFinish' AND query_id='acb9a600-cbd7-44a4-aa61-ee5f8cdd941d';\G

Row 1:
──────
query:       SELECT
                 page_name,
                 toStartOfMonth(time) AS month,
                 count() AS page_view_count,
                 round(avg(time_on_page), 2) AS avg_time_on_page
             FROM page_views
             GROUP BY page_name, month
             ORDER BY page_name, month;
read_rows:   107.00
projections: ['default.page_views.page_views_monthly_projection']

A projection aggregating values of the source table is used not only for queries that completely match the projection query, but also for queries that select just some of aggregates with grouping by some GROUP BY columns of the projection query. For example:

EXPLAIN SELECT page_name, count() AS page_view_count FROM page_views GROUP BY page_name;

   ┌─explain─────────────────────────────────────────────────┐
1. │ Expression ((Project names + Projection))               │
2. │   Aggregating                                           │
3. │     Expression                                          │
4. │       ReadFromMergeTree (page_views_monthly_projection) │
   └─────────────────────────────────────────────────────────┘

View information about projections

Get information on active data parts of the source table from the system.parts table:

SELECT name, active, path FROM system.parts WHERE table='page_views';

   ┌─name────────┬─active─┬─path────────────────────────────────────────────────────────────────────────────┐
1. │ all_1_1_0_3 │      1 │ /var/lib/clickhouse/store/7f1/7f1aea44-35a3-4b96-a5c9-07640aec1340/all_1_1_0_3/ │
   └─────────────┴────────┴─────────────────────────────────────────────────────────────────────────────────┘

The directory of the page_views table’s data part contains subdirectories corresponding to projections (page_name_projection.proj and page_views_monthly_projection.proj):

$ sudo ls /var/lib/clickhouse/store/7f1/7f1aea44-35a3-4b96-a5c9-07640aec1340/all_1_1_0_3/

checksums.txt  default_compression_codec.txt  page_name.cmrk2                     primary.cidx        time.cmrk2
columns.txt    metadata_version.txt           page_name_projection.proj           serialization.json  time_on_page.bin
count.txt      page_name.bin                  page_views_monthly_projection.proj  time.bin            time_on_page.cmrk2

Each projection subdirectory contains column data files (.bin), column mark files (.cmrk2), and a primary index file (primary.cidx) of the projection hidden table:

$ sudo ls /var/lib/clickhouse/store/7f1/7f1aea44-35a3-4b96-a5c9-07640aec1340/all_1_1_0_3/page_name_projection.proj

checksums.txt  default_compression_codec.txt  page_name.cmrk2     time.bin          time_on_page.cmrk2
columns.txt    metadata_version.txt           primary.cidx        time.cmrk2
count.txt      page_name.bin                  serialization.json  time_on_page.bin

In the system.projection_parts table, see how much disk space the existing projections take up:

SELECT
    name,
    parent_name,
    formatReadableSize(bytes_on_disk) AS bytes,
    formatReadableSize(parent_bytes_on_disk) AS parent_bytes,
    bytes_on_disk / parent_bytes_on_disk AS ratio
FROM system.projection_parts;

   ┌─name──────────────────────────┬─parent_name─┬─bytes────┬─parent_bytes─┬──────────────────ratio─┐
1. │ page_name_projection          │ all_1_1_0_3 │ 5.54 MiB │ 11.24 MiB    │    0.49312069529045727 │
2. │ page_views_monthly_projection │ all_1_1_0_3 │ 1.40 KiB │ 11.24 MiB    │ 0.00012159165770921657 │
   └───────────────────────────────┴─────────────┴──────────┴──────────────┴────────────────────────┘

From the system.projection_parts_columns table, get types of projection columns:

SELECT name as projection, column, type as column_type, column_position FROM system.projection_parts_columns;

   ┌─projection────────────────────┬─column───────────────┬─column_type────────────────────┬─column_position─┐
1. │ page_name_projection          │ time                 │ DateTime                       │               1 │
2. │ page_name_projection          │ page_name            │ String                         │               2 │
3. │ page_name_projection          │ time_on_page         │ UInt64                         │               3 │
4. │ page_views_monthly_projection │ page_name            │ String                         │               1 │
5. │ page_views_monthly_projection │ toStartOfMonth(time) │ Date                           │               2 │
6. │ page_views_monthly_projection │ count()              │ AggregateFunction(count)       │               3 │
7. │ page_views_monthly_projection │ avg(time_on_page)    │ AggregateFunction(avg, UInt64) │               4 │
   └───────────────────────────────┴──────────────────────┴────────────────────────────────┴─────────────────┘

Found a mistake? Seleсt text and press Ctrl+Enter to report it