Use projections
Overview
A projection is an additional hidden table that stores data from a source table in an alternative form to be optimal for executing some type of queries. A projection can store data rows sorted in a different order than the source table (and therefore have a different primary index) or pre-compute aggregate values across columns of the source table. So, this feature can be useful to optimize:
-
queries with filters on source table columns that are not part of a primary key;
-
queries that aggregate values of source table columns (data pre-aggregation performed by projections can reduce the consumption of both computing and I/O resources).
Projections are supported only for tables of the MergeTree family (including Replicated*). A table can have one or more projections — all of them will be automatically synchronized with the source table, i.e. updated each time new data is inserted into the source table.
Projections are similar to materialized views (which also allow pre-computing aggregations and using different row orders on data from another table), but they are defined at the level of data parts — for each data part, a projection of data in that part is created based on the projection query. This guarantees that projection data is consistent with the source table data and that ADQM can automatically choose and apply the most optimal projections when executing queries provided by users (a projection is considered optimal if it returns the correct result for a query with the least data to be scanned).
Unlike materialized views, projections do not require manual creation of separate additional tables with different structures, which you need to keep in mind and each time when building a query, you need to choose which table to send it to in order to ensure the high speed of the query execution. With projections, it is enough to send all queries to the source table — ADQM itself will determine whether the table has a projection that can speed up the query execution, and if such projection exists, it will redirect the query to it. In other words, a user does not need to know anything about projections.
CAUTION
As each projection creates a new hidden table (projection files are stored in the <projection_name>.proj subdirectories within directories of the source MergeTree table’s data parts), it may require more I/O resources and disk space. For example, if a projection is created to define an alternate sorting key, all data from the source table will be duplicated in the projection. |
Manage projections
Create/delete projections
You can specify a projection when creating a MergeTree table in the section with column descriptions using the PROJECTION
clause. The basic syntax for creating a table with a projection is:
CREATE TABLE <table_name>
( <column1> ...,
<column2> ...,
...
PROJECTION <projection_name> (SELECT <column_list> [ORDER BY <sorting_key>] [GROUP BY <grouping_key>])
)
ENGINE = MergeTree()
...;
After a projection name, there is a projection query (in parentheses), which specifies how data from the source table should be transformed for storing in the projection table. A projection query can include one of the following clauses:
-
ORDER BY
— key by which data rows will be sorted, and a primary key will be set in the projection; -
GROUP BY
— key by which a source table data will be grouped for aggregation in the projection. In this case, a hidden projection table is based on the AggregatingMergeTree engine, and all aggregate functions are converted to AggregateFunction.
To manage projections of existing tables, you can use the ALTER TABLE
queries listed below.
-
Add a projection to an existing table:
ALTER TABLE [<db_name.]<table_name> [ON CLUSTER <cluster_name>] ADD PROJECTION [IF NOT EXISTS] <projection_name> (SELECT <column_list> [ORDER BY <sorting_key>] [GROUP BY <grouping_key>]);
This query adds a projection description to a MergeTree table metadata, but does not populate the projection with data (it will be rebuilt when new data is inserted into the source table or when materialized).
-
Materialize a projection to fill it with existing data of the source table (you can select an individual partition):
ALTER TABLE [<db_name.]<table_name> [ON CLUSTER <cluster_name>] MATERIALIZE PROJECTION [IF EXISTS] <projection_name> [IN PARTITION <partition_name>];
-
Delete projection files from the disk without removing projection description from the source table’s metadata (you can select an individual partition):
ALTER TABLE [<db_name.]<table_name> [ON CLUSTER <cluster_name>] CLEAR PROJECTION [IF EXISTS] <projection_name> [IN PARTITION <partition_name>];
-
Delete a projection description from the source table metadata and remove the projection files from the disk:
ALTER TABLE [<db_name.]<table_name> [ON CLUSTER <cluster_name>] DROP PROJECTION [IF EXISTS] <partition_name>;
The ADD PROJECTION
, CLEAR PROJECTION
, and DROP PROJECTION
commands are lightweight because they only change metadata or delete files.
Settings to enable projections
To tell ADQM whether it should use projections when executing queries, configure the following parameters (at the session level):
-
optimize_use_projections
— enables/disables automatic optimization ofSELECT
queries with projections (the default value is1
, meaning the projection optimization is enabled); -
force_optimize_projection
— enables/disables the obligatory use of projections for executingSELECT
queries if the optimization via projections is enabled by theoptimize_use_projections
setting (the default value is0
).
View information on projections
You can obtain information about projections from system tables:
-
system.projection_parts
— information about currently existing data parts for projections of all MergeTree tables in a cluster (information about the corresponding data parts of the source table is stored in columns prefixed withparent_
); -
system.projection_parts_columns
— information about columns of all currently existing projection data parts of all MergeTree tables.
Since projection tables are hidden, they are not included in the SHOW TABLES
list.
Examples
Create a MergeTree table to store pageview data (the examples below will define projections for this table):
CREATE TABLE page_views (time DateTime, page_name String, time_on_page UInt64) ENGINE = MergeTree ORDER BY time;
Populate this table with test data. For example, randomly generate 1 million views for 4 pages over approximately the last 3 years relative to the current moment:
INSERT INTO page_views SELECT
now() - toIntervalMinute(rand() % 500000 + number),
['landing', 'products', 'docs', 'about'][toInt32(randBinomial(4, 0.01)) + 1],
rand() % 15
FROM numbers(1000000);
Filter without using a primary key
If you run a query filtering data by the time
column, which is included in the primary key of the page_views
table, a full table scan is not required:
SELECT count() FROM page_views WHERE toYear(time) = '2022';
┌─count()─┐ 1. │ 388601 │ └─────────┘ 1 row in set. Elapsed: 0.004 sec. Processed 393.22 thousand rows, 1.57 MB (99.18 million rows/s., 396.72 MB/s.)
To filter data not by a key column, ADQM performs a full table scan, for example:
SELECT count() FROM page_views WHERE page_name = 'products';
┌─count()─┐ 1. │ 38729 │ └─────────┘ 1 row in set. Elapsed: 0.006 sec. Processed 1.00 million rows, 16.04 MB (156.85 million rows/s., 2.52 GB/s.)
To optimize such queries, you can add a projection with a different primary index to the table. For example, create the page_name_projection
projection which sorts rows by the page_name
column:
ALTER TABLE page_views ADD PROJECTION page_name_projection (SELECT * ORDER BY page_name);
Materialize the projection to build it based on data currently existing in the page_views
table:
ALTER TABLE page_views MATERIALIZE PROJECTION page_name_projection;
Repeat the query filtering by the page_name
column:
SELECT count() FROM page_views WHERE page_name = 'products';
Now, ADQM has read 41.54 thousand rows instead of 1 million to execute the query:
┌─count()─┐ 1. │ 38729 │ └─────────┘ 1 row in set. Elapsed: 0.003 sec. Processed 41.54 thousand rows, 703.30 KB (12.48 million rows/s., 211.24 MB/s.)
Use the EXPLAIN clause to verify that ADQM uses the projection (ReadFromMergeTree (page_name_projection)
) to execute the query:
EXPLAIN SELECT count() FROM page_views WHERE page_name = 'products';
┌─explain────────────────────────────────────────┐ 1. │ Expression ((Project names + Projection)) │ 2. │ Aggregating │ 3. │ Filter │ 4. │ ReadFromMergeTree (page_name_projection) │ └────────────────────────────────────────────────┘
Pre-aggregate values
Run a query that returns the number of pageviews and average pageview time by month:
SELECT
page_name,
toStartOfMonth(time) AS month,
count() AS page_view_count,
round(avg(time_on_page), 2) AS avg_time_on_page
FROM page_views
GROUP BY page_name, month
ORDER BY page_name, month;
To execute this query, ADQM reads all rows of the page_views
table:
Query id: 303db76e-1e68-4454-91ea-a951c36d170b ┌─page_name─┬──────month─┬─page_view_count─┬─avg_time_on_page─┐ 1. │ about │ 2022-03-01 │ 1 │ 9 │ 2. │ about │ 2023-06-01 │ 1 │ 6 │ ... 105. │ products │ 2024-06-01 │ 313 │ 6.87 │ 106. │ products │ 2024-07-01 │ 116 │ 6.26 │ 107. │ products │ 2024-08-01 │ 7 │ 7 │ └───────────┴────────────┴─────────────────┴──────────────────┘ 107 rows in set. Elapsed: 0.018 sec. Processed 1.00 million rows, 28.04 MB (54.22 million rows/s., 1.52 GB/s.)
Create and materialize a projection that calculates the number of views and average view time for each page in the page_views
table by month:
ALTER TABLE page_views ADD PROJECTION page_views_monthly_projection
(
SELECT
page_name,
toStartOfMonth(time),
count(),
round(avg(time_on_page), 2)
GROUP BY
page_name,
toStartOfMonth(time)
);
ALTER TABLE page_views MATERIALIZE PROJECTION page_views_monthly_projection;
Repeat the query:
SELECT
page_name,
toStartOfMonth(time) AS month,
count() AS page_view_count,
round(avg(time_on_page), 2) AS avg_time_on_page
FROM page_views
GROUP BY page_name, month
ORDER BY page_name, month;
The result is the same, but the performance is now better:
Query id: acb9a600-cbd7-44a4-aa61-ee5f8cdd941d ┌─page_name─┬──────month─┬─page_view_count─┬─avg_time_on_page─┐ 1. │ about │ 2022-03-01 │ 1 │ 9 │ 2. │ about │ 2023-06-01 │ 1 │ 6 │ ... 105. │ products │ 2024-06-01 │ 313 │ 6.87 │ 106. │ products │ 2024-07-01 │ 116 │ 6.26 │ 107. │ products │ 2024-08-01 │ 7 │ 7 │ └───────────┴────────────┴─────────────────┴──────────────────┘ 107 rows in set. Elapsed: 0.004 sec.
Use EXPLAIN
to verify that ADQM now uses the projection to execute the query:
EXPLAIN SELECT
page_name,
toStartOfMonth(time) AS month,
count() AS page_view_count,
round(avg(time_on_page), 2) AS avg_time_on_page
FROM page_views
GROUP BY page_name, month
ORDER BY page_name, month;
┌─explain────────────────────────────────────────────────────────────────────────┐ 1. │ Expression ((Project names + (Before ORDER BY + Projection) [lifted up part])) │ 2. │ Sorting (Sorting for ORDER BY) │ 3. │ Expression ((Before ORDER BY + Projection)) │ 4. │ Aggregating │ 5. │ Expression │ 6. │ ReadFromMergeTree (page_views_monthly_projection) │ └────────────────────────────────────────────────────────────────────────────────┘
You can also inspect the system.query_log
system table where the projections
field shows the name of the used projection (or remains empty if no projection was used):
SELECT query, formatReadableQuantity(read_rows) AS read_rows, projections FROM system.query_log
WHERE type='QueryFinish' AND query_id='acb9a600-cbd7-44a4-aa61-ee5f8cdd941d';\G
Row 1: ────── query: SELECT page_name, toStartOfMonth(time) AS month, count() AS page_view_count, round(avg(time_on_page), 2) AS avg_time_on_page FROM page_views GROUP BY page_name, month ORDER BY page_name, month; read_rows: 107.00 projections: ['default.page_views.page_views_monthly_projection']
A projection aggregating values of the source table is used not only for queries that completely match the projection query, but also for queries that select just some of aggregates with grouping by some GROUP BY
columns of the projection query. For example:
EXPLAIN SELECT page_name, count() AS page_view_count FROM page_views GROUP BY page_name;
┌─explain─────────────────────────────────────────────────┐ 1. │ Expression ((Project names + Projection)) │ 2. │ Aggregating │ 3. │ Expression │ 4. │ ReadFromMergeTree (page_views_monthly_projection) │ └─────────────────────────────────────────────────────────┘
View information about projections
Get information on active data parts of the source table from the system.parts
table:
SELECT name, active, path FROM system.parts WHERE table='page_views';
┌─name────────┬─active─┬─path────────────────────────────────────────────────────────────────────────────┐ 1. │ all_1_1_0_3 │ 1 │ /var/lib/clickhouse/store/7f1/7f1aea44-35a3-4b96-a5c9-07640aec1340/all_1_1_0_3/ │ └─────────────┴────────┴─────────────────────────────────────────────────────────────────────────────────┘
The directory of the page_views
table’s data part contains subdirectories corresponding to projections (page_name_projection.proj and page_views_monthly_projection.proj):
$ sudo ls /var/lib/clickhouse/store/7f1/7f1aea44-35a3-4b96-a5c9-07640aec1340/all_1_1_0_3/
checksums.txt default_compression_codec.txt page_name.cmrk2 primary.cidx time.cmrk2 columns.txt metadata_version.txt page_name_projection.proj serialization.json time_on_page.bin count.txt page_name.bin page_views_monthly_projection.proj time.bin time_on_page.cmrk2
Each projection subdirectory contains column data files (.bin), column mark files (.cmrk2), and a primary index file (primary.cidx) of the projection hidden table:
$ sudo ls /var/lib/clickhouse/store/7f1/7f1aea44-35a3-4b96-a5c9-07640aec1340/all_1_1_0_3/page_name_projection.proj
checksums.txt default_compression_codec.txt page_name.cmrk2 time.bin time_on_page.cmrk2 columns.txt metadata_version.txt primary.cidx time.cmrk2 count.txt page_name.bin serialization.json time_on_page.bin
In the system.projection_parts
table, see how much disk space the existing projections take up:
SELECT
name,
parent_name,
formatReadableSize(bytes_on_disk) AS bytes,
formatReadableSize(parent_bytes_on_disk) AS parent_bytes,
bytes_on_disk / parent_bytes_on_disk AS ratio
FROM system.projection_parts;
┌─name──────────────────────────┬─parent_name─┬─bytes────┬─parent_bytes─┬──────────────────ratio─┐ 1. │ page_name_projection │ all_1_1_0_3 │ 5.54 MiB │ 11.24 MiB │ 0.49312069529045727 │ 2. │ page_views_monthly_projection │ all_1_1_0_3 │ 1.40 KiB │ 11.24 MiB │ 0.00012159165770921657 │ └───────────────────────────────┴─────────────┴──────────┴──────────────┴────────────────────────┘
From the system.projection_parts_columns
table, get types of projection columns:
SELECT name as projection, column, type as column_type, column_position FROM system.projection_parts_columns;
┌─projection────────────────────┬─column───────────────┬─column_type────────────────────┬─column_position─┐ 1. │ page_name_projection │ time │ DateTime │ 1 │ 2. │ page_name_projection │ page_name │ String │ 2 │ 3. │ page_name_projection │ time_on_page │ UInt64 │ 3 │ 4. │ page_views_monthly_projection │ page_name │ String │ 1 │ 5. │ page_views_monthly_projection │ toStartOfMonth(time) │ Date │ 2 │ 6. │ page_views_monthly_projection │ count() │ AggregateFunction(count) │ 3 │ 7. │ page_views_monthly_projection │ avg(time_on_page) │ AggregateFunction(avg, UInt64) │ 4 │ └───────────────────────────────┴──────────────────────┴────────────────────────────────┴─────────────────┘