Use TTL to manage data

Overview

For MergeTree tables, it is possible to set TTL (time-to-live) — time interval after which old data will be deleted from a table in the background. TTL can be set for a whole table (entire data rows with outdated values ​​will be deleted) and for each individual column (outdated values ​​in a column will be replaced with default values ​​for the column data type).

Table-level TTL also allows you to schedule the following actions to be performed automatically on expired data rows:

  • Move data between disks or volumes of an ADQM storage — this can be useful when you deploy an architecture to separate data into "hot", "warm", and "cold" (see Configure data storages).

  • Rollup data into some aggregations and computations — you can store old data with a lower level of detail to save resources.

  • Apply a higher level of compression to old data.

Thus, you can optimize a storage and improve query performance by managing your data lifecycle with the TTL mechanism.

When TTL events are triggered

ADQM deletes data with expired TTL when merging data parts (not immediately). It periodically performs "off-schedule" merges to delete or compress data according to TTL rules. You can control the frequency of such merges with the special parameters of a MergeTree table:

  • merge_with_ttl_timeout — minimum delay in seconds between merges to delete data based TTL;

  • merge_with_recompression_ttl_timeout — minimum delay in seconds between merges for data compression based on TTL.

By default, these parameters are set to 14400 seconds. This means that deletion/compression of outdated data will be performed every 4 hours or when a background merge of data parts occurs. If you want TTL rules to apply more frequently, change the corresponding setting above for your table. Note that too low value can lead to performing many "off-schedule" merges and consuming a lot of server resources — so it is not recommended to use values smaller than 300 seconds.

A MergeTree table also has the ttl_only_drop_parts setting that affects the expired data deletion as follows (depending on this parameter value):

  • 0 (default) — ADQM deletes only rows expired according to their TTL;

  • 1 — ADQM deletes a whole data part when all rows in it are expired.

Deleting whole parts instead of individual rows allows you to set a smaller merge_with_ttl_timeout timeout and reduce the impact of TTL operations on system performance.

Operations of moving data between disks/volumes or recompressing data are performed only when all rows in a data part are out of date.

When you run a SELECT query between merges, you can get expired data in a table. To avoid this, run the OPTIMIZE …​ FINAL query before selecting (to force merging). However, it is not recommended to use it frequently, especially for large tables.

Table TTL

Create a table with TTL

To set TTL for a table when creating it, use the TTL clause at the end of the table definition in the CREATE TABLE query. After the TTL clause, you can specify one rule to delete expired rows and/or multiple TTL rules to process data in a different way (for example, to move data to another disk/volume or recompress it).

Below is the basic syntax of a query for creating a table with TTL:

CREATE TABLE <table_name>
(   <timestamp_column> Date|DateTime ...,
    <column1> ...,
    <column2> ...,
    ...)
ENGINE = MergeTree()
TTL <ttl_expr> [<ttl_action>] [, <ttl_expr2> [<ttl_action2>]] [, ...];

where each TTL rule includes:

  • <ttl_expr> — TTL expression that specifies when data to be considered outdated;

  • <ttl_action> — action to be performed after the time specified by the TTL expression has passed.

A TTL expression that should return a value of the Date or DateTime type is usually specified in the following format:

<timestamp_column> [+ INTERVAL <time_interval> <interval_type>]

where:

  • <timestamp_column> — name of a column of the Date or DateTime type with timestamps relative to which a time to live of data will be determined;

  • INTERVAL <time_interval> <interval_type> — value of the Interval type that describes how long a data row should live, relative to the timestamp from the <timestamp_column> column. The structure of this value includes a time interval as a positive integer value and an interval type — for example, 1 MONTH, 3 DAY, 5 HOUR.

To specify a time interval, you can use the toInterval conversion function instead of the INTERVAL operator. In this case, an example of a TTL expression can look like one of the following:

<timestamp_column> toIntervalMonth(1)
<timestamp_column> toIntervalDay(3)
<timestamp_column> toIntervalHour(5)

To specify actions to be performed on data rows that have expired according to their TTL expressions, use clauses listed in the table below.

Clause Basic TTL syntax Action description

DELETE

TTL <ttl_expr> [DELETE] [WHERE <filter_expr>]

Delete expired data rows. This is the default action (in other words, expired rows are removed if another action is not specified explicitly after the TTL clause)

TO DISK

TTL <ttl_expr> TO DISK '<disk_name>'

Move data to the specified disk

TO VOLUME

TTL <ttl_expr> TO VOLUME '<volume_name>'

Move data to the specified volume

RECOMPRESS

TTL <ttl_expr> RECOMPRESS CODEC(<compression_codec>)

Compress data using the specified codec

GROUP BY

TTL <ttl_expr>
    GROUP BY <key_expr>
    [SET <column_name> = aggr_func(<column_name>)[, ...]]
    [WHERE <filter_expr>]

Aggregate data.

When defining a rule for aggregating old data, take into account the following points:

  • A GROUP BY expression (columns in <key_expr>) should be a prefix of the table primary key.

  • If a column is not part of the GROUP BY expression, and it is not explicitly specified after the SET clause, in a resulting row it will contain a value chosen randomly from one of the grouped rows (as if the any aggregate function is applied to it).

For the DELETE or GROUP BY action, you can specify a filtering condition with the WHERE clause to delete/aggregate only some of the expired rows. For moving and compressing data, the WHERE clause is not applicable.

Modify TTL

You can change TTL of an existing table using the following query:

ALTER TABLE <table_name> MODIFY TTL <new_ttl_rules>;

After this query execution, each time new data is inserted or a new data part is created as a result of a background merge operation, TTL will be calculated according to a new rule.

To recalculate TTL for existing data, TTL materialization is required. The materialize_ttl_after_modify setting (enabled by default) controls whether TTL materialization should be performed automatically after the ALTER TABLE …​ MODIFY TTL query execution. You can also force TTL materialization with the following query:

ALTER TABLE <table_name> MATERIALIZE TTL;

TTL materialization can be a quite heavy operation (especially for large tables), as it recalculates TTL in all data parts (updates ttl.txt files) and rewrites data for all columns, which can result in a lot of read/write operations. To facilitate the TTL materialization operation, you can set materialize_ttl_recalculate_only = 1 in the MergeTree table settings. In this case, only ttl.txt recalculation will be performed (it is a lightweight operation as only columns used in the TTL expression are read) and all columns will be copied via hard links. Some time later, expired data will be deleted during merge with ttl_only_drop_parts = 1.

In some cases, an optimal solution can be to disable TTL materialization (materialize_ttl_after_modify = 0) and delete/move old partitions manually (using the ALTER TABLE …​ DROP|MOVE PARTITION queries — see Manipulate partitions and data parts for details).

Remove TTL

To remove TTL rules from a table, use the query:

ALTER TABLE <table_name> REMOVE TTL;

Examples

Automatically delete expired data

 

  1. Create a table with a TTL rule to remove rows that will be considered outdated when 10 minutes have passed since the time in the time column:

    CREATE TABLE test_table (time DateTime, value UInt64)
    ENGINE = MergeTree
    ORDER BY time
    TTL time + INTERVAL 10 MINUTE;

    Fill the table with test data:

    INSERT INTO test_table
    SELECT now() - toIntervalMinute(number), number
    FROM numbers(10);
    SELECT * FROM test_table;
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€time─┬─value─┐
     1. β”‚ 2024-06-07 13:15:49 β”‚     9 β”‚
     2. β”‚ 2024-06-07 13:16:49 β”‚     8 β”‚
     3. β”‚ 2024-06-07 13:17:49 β”‚     7 β”‚
     4. β”‚ 2024-06-07 13:18:49 β”‚     6 β”‚
     5. β”‚ 2024-06-07 13:19:49 β”‚     5 β”‚
     6. β”‚ 2024-06-07 13:20:49 β”‚     4 β”‚
     7. β”‚ 2024-06-07 13:21:49 β”‚     3 β”‚
     8. β”‚ 2024-06-07 13:22:49 β”‚     2 β”‚
     9. β”‚ 2024-06-07 13:23:49 β”‚     1 β”‚
    10. β”‚ 2024-06-07 13:24:49 β”‚     0 β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜
  2. Wait for 5 minutes after inserting data and run the query:

    OPTIMIZE TABLE test_table FINAL;

    Repeat the query to select data from the table:

    SELECT * FROM test_table;

    You can see that the first five rows have been removed from the table (for each row, 10 minutes have passed since the time specified in the time column):

       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€time─┬─value─┐
    1. β”‚ 2024-06-07 13:20:49 β”‚     4 β”‚
    2. β”‚ 2024-06-07 13:21:49 β”‚     3 β”‚
    3. β”‚ 2024-06-07 13:22:49 β”‚     2 β”‚
    4. β”‚ 2024-06-07 13:23:49 β”‚     1 β”‚
    5. β”‚ 2024-06-07 13:24:49 β”‚     0 β”‚
       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜

    If you do not use the OPTIMIZE TABLE …​ FINAL query and run SELECT between merges, outdated records can still exist in the table.

Filter rows to delete

 

  1. Create the alerts table and define a TTL rule to remove rows with the Critical value in the alert column and a time value in the time column after which 10 minutes have passed:

    CREATE TABLE alerts (time DateTime, value UInt64, alert String)
    ENGINE = MergeTree
    ORDER BY time
    TTL time + INTERVAL 10 MINUTE DELETE WHERE alert = 'Critical';

    Insert test data into the table:

    INSERT INTO alerts
    SELECT now() - toIntervalMinute(number), number, ['Warning', 'Critical', 'Info'][(rand() % 3) + 1]
    FROM numbers(10);
    SELECT * FROM alerts;
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€time─┬─value─┬─alert────┐
     1. β”‚ 2024-06-10 07:14:56 β”‚     9 β”‚ Critical β”‚
     2. β”‚ 2024-06-10 07:15:56 β”‚     8 β”‚ Warning  β”‚
     3. β”‚ 2024-06-10 07:16:56 β”‚     7 β”‚ Info     β”‚
     4. β”‚ 2024-06-10 07:17:56 β”‚     6 β”‚ Critical β”‚
     5. β”‚ 2024-06-10 07:18:56 β”‚     5 β”‚ Warning  β”‚
     6. β”‚ 2024-06-10 07:19:56 β”‚     4 β”‚ Info     β”‚
     7. β”‚ 2024-06-10 07:20:56 β”‚     3 β”‚ Critical β”‚
     8. β”‚ 2024-06-10 07:21:56 β”‚     2 β”‚ Warning  β”‚
     9. β”‚ 2024-06-10 07:22:56 β”‚     1 β”‚ Warning  β”‚
    10. β”‚ 2024-06-10 07:23:56 β”‚     0 β”‚ Critical β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  2. 10 minutes after inserting the data, initiate a merge of data parts and look at the table contents:

    OPTIMIZE TABLE alerts FINAL;
    SELECT * FROM alerts;

    All old Critical alerts have been removed from the table:

       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€time─┬─value─┬─alert───┐
    1. β”‚ 2024-06-10 07:15:56 β”‚     8 β”‚ Warning β”‚
    2. β”‚ 2024-06-10 07:16:56 β”‚     7 β”‚ Info    β”‚
    3. β”‚ 2024-06-10 07:18:56 β”‚     5 β”‚ Warning β”‚
    4. β”‚ 2024-06-10 07:19:56 β”‚     4 β”‚ Info    β”‚
    5. β”‚ 2024-06-10 07:21:56 β”‚     2 β”‚ Warning β”‚
    6. β”‚ 2024-06-10 07:22:56 β”‚     1 β”‚ Warning β”‚
       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Save data to another table before deleting

 

If you need to move data to another table before it is deleted from the main table, you can use TTL in combination with a materialized view.

  1. Create:

    • the alerts table with a TTL rule to remove alerts after 10 minutes relative to the time from the time column;

    • the alerts_history table to store Critical alerts from the alerts table.

    DROP TABLE IF EXISTS alerts;
    CREATE TABLE alerts (time DateTime, value UInt64, alert String)
    ENGINE = MergeTree
    ORDER BY time
    TTL time + INTERVAL 10 MINUTE DELETE;
    CREATE TABLE alerts_history (time DateTime, value UInt64, alert String)
    ENGINE = MergeTree
    ORDER BY time;

    Note that the alerts_history table has the same structure as alerts, but do not use the CREATE TABLE alerts_history AS alerts query as it will also copy the TTL rule which is not needed.

  2. Create a materialized view that will be triggered on inserting data into the alerts table and automatically load Critical alerts into alerts_history:

    CREATE MATERIALIZED VIEW alerts_history_mv TO alerts_history AS
    SELECT * FROM alerts WHERE alert = 'Critical';
  3. Insert test data into the alerts table:

    INSERT INTO alerts
    SELECT now() - toIntervalMinute(number), number, ['Warning', 'Critical', 'Info'][(rand() % 3) + 1]
    FROM numbers(10);
    SELECT * FROM alerts;
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€time─┬─value─┬─alert────┐
     1. β”‚ 2024-06-10 08:02:16 β”‚     9 β”‚ Warning  β”‚
     2. β”‚ 2024-06-10 08:03:16 β”‚     8 β”‚ Warning  β”‚
     3. β”‚ 2024-06-10 08:04:16 β”‚     7 β”‚ Info     β”‚
     4. β”‚ 2024-06-10 08:05:16 β”‚     6 β”‚ Critical β”‚
     5. β”‚ 2024-06-10 08:06:16 β”‚     5 β”‚ Warning  β”‚
     6. β”‚ 2024-06-10 08:07:16 β”‚     4 β”‚ Warning  β”‚
     7. β”‚ 2024-06-10 08:08:16 β”‚     3 β”‚ Critical β”‚
     8. β”‚ 2024-06-10 08:09:16 β”‚     2 β”‚ Warning  β”‚
     9. β”‚ 2024-06-10 08:10:16 β”‚     1 β”‚ Critical β”‚
    10. β”‚ 2024-06-10 08:11:16 β”‚     0 β”‚ Warning  β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  4. Wait for 10 minutes and run the query:

    OPTIMIZE TABLE alerts FINAL;

    TTL will delete records from the alerts table, but rows with the Critical value in the alert column will be retained in the alerts_history table:

    SELECT * FROM alerts_history;
       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€time─┬─value─┬─alert────┐
    1. β”‚ 2024-06-10 08:05:16 β”‚     6 β”‚ Critical β”‚
    2. β”‚ 2024-06-10 08:08:16 β”‚     3 β”‚ Critical β”‚
    3. β”‚ 2024-06-10 08:10:16 β”‚     1 β”‚ Critical β”‚
       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Aggregate outdated data to compact it

 

  1. Create the alerts table and fill it with test data:

    DROP TABLE IF EXISTS alerts;
    CREATE TABLE alerts (time DateTime, value UInt64, alert String)
    ENGINE = MergeTree
    ORDER BY (toHour(time), alert);
    INSERT INTO alerts
    SELECT now() - toIntervalMinute(number), toInt64(randUniform(0, 100)), ['Warning', 'Critical', 'Info'][(rand() % 3) + 1]
    FROM numbers(15);
    SELECT * FROM alerts;
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€time─┬─value─┬─alert────┐
     1. β”‚ 2024-07-26 21:51:41 β”‚     5 β”‚ Critical β”‚
     2. β”‚ 2024-07-26 21:48:41 β”‚    18 β”‚ Critical β”‚
     3. β”‚ 2024-07-26 21:46:41 β”‚    53 β”‚ Critical β”‚
     4. β”‚ 2024-07-26 21:43:41 β”‚    33 β”‚ Critical β”‚
     5. β”‚ 2024-07-26 21:50:41 β”‚    98 β”‚ Info     β”‚
     6. β”‚ 2024-07-26 21:49:41 β”‚    74 β”‚ Info     β”‚
     7. β”‚ 2024-07-26 21:47:41 β”‚    73 β”‚ Info     β”‚
     8. β”‚ 2024-07-26 21:42:41 β”‚    52 β”‚ Info     β”‚
     9. β”‚ 2024-07-26 21:41:41 β”‚    92 β”‚ Info     β”‚
    10. β”‚ 2024-07-26 21:39:41 β”‚    70 β”‚ Info     β”‚
    11. β”‚ 2024-07-26 21:38:41 β”‚    81 β”‚ Info     β”‚
    12. β”‚ 2024-07-26 21:37:41 β”‚    72 β”‚ Info     β”‚
    13. β”‚ 2024-07-26 21:45:41 β”‚     7 β”‚ Warning  β”‚
    14. β”‚ 2024-07-26 21:44:41 β”‚    96 β”‚ Warning  β”‚
    15. β”‚ 2024-07-26 21:40:41 β”‚     0 β”‚ Warning  β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  2. Five minutes after inserting data (at 21:56:41 relative to timestamps generated above), modify the table to add a TTL rule that groups outdated data rows (if 10 minutes have passed since the time in the time column) by time and alert type, with each group storing the maximum value in the value column and the most recent time in the time column:

    ALTER TABLE alerts MODIFY TTL time + INTERVAL 10 MINUTE GROUP BY toHour(time), alert SET value = max(value), time = max(time);

    Note that GROUP BY columns in TTL should be a prefix of the table primary key (ORDER BY if PRIMARY KEY is not specified separately).

  3. Perform TTL materialization to apply the TTL rule to existing data:

    ALTER TABLE alerts MATERIALIZE TTL;
  4. Check the contents of the table:

    SELECT * FROM alerts;

    Rows for which the TTL rule was met (10 minutes have passed since their time values) were grouped, while the rest of the rows remained as before:

       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€time─┬─value─┬─alert────┐
    1. β”‚ 2024-07-26 21:51:41 β”‚     5 β”‚ Critical β”‚
    2. β”‚ 2024-07-26 21:48:41 β”‚    18 β”‚ Critical β”‚
    3. β”‚ 2024-07-26 21:46:41 β”‚    53 β”‚ Critical β”‚
    4. β”‚ 2024-07-26 21:50:41 β”‚    98 β”‚ Info     β”‚
    5. β”‚ 2024-07-26 21:49:41 β”‚    74 β”‚ Info     β”‚
    6. β”‚ 2024-07-26 21:47:41 β”‚    73 β”‚ Info     β”‚
    7. β”‚ 2024-07-26 21:42:41 β”‚    92 β”‚ Info     β”‚
    8. β”‚ 2024-07-26 21:45:41 β”‚    96 β”‚ Warning  β”‚
       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

    If you run optimization after another 10 minutes, the data will be aggregated further — the table will have one row for the most recent alert of each type:

    OPTIMIZE TABLE alerts FINAL;
       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€time─┬─value─┬─alert────┐
    1. β”‚ 2024-07-26 21:51:41 β”‚    53 β”‚ Critical β”‚
    2. β”‚ 2024-07-26 21:50:41 β”‚    98 β”‚ Info     β”‚
    3. β”‚ 2024-07-26 21:45:41 β”‚    96 β”‚ Warning  β”‚
       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Change compression for outdated data

 

  1. Create a table with a TTL rule to compress old data with the LZ4HC(10) codec and also set merge_with_recompression_ttl_timeout = 1200 (in seconds) in the table settings to specify how often data should be recompressed according to TTL (in this example, every 20 minutes):

    DROP TABLE IF EXISTS alerts;
    CREATE TABLE alerts (time DateTime, value UInt64, alert String)
    ENGINE = MergeTree
    ORDER BY time
    TTL time + INTERVAL 10 MINUTE RECOMPRESS CODEC(LZ4HC(10))
    SETTINGS merge_with_recompression_ttl_timeout = 1200;
  2. Insert data:

    INSERT INTO alerts
    SELECT now() - toIntervalMinute(number), number, ['Warning', 'Critical', 'Info'][(rand() % 3) + 1]
    FROM numbers(100);
  3. View information about data parts of the alerts table in the system.parts system table:

    SELECT name, active, rows, data_compressed_bytes, data_uncompressed_bytes
    FROM system.parts WHERE table = 'alerts'
    FORMAT Vertical;

    Inserted data rows were written as one part:

    Row 1:
    ──────
    name:                    all_1_1_0
    active:                  1
    rows:                    100
    data_compressed_bytes:   1114
    data_uncompressed_bytes: 1933
  4. Repeat the above INSERT INTO and SELECT …​ FROM system.parts queries. Now the table has two data parts:

    Row 1:
    ──────
    name:                    all_1_1_0
    active:                  1
    rows:                    100
    data_compressed_bytes:   1114
    data_uncompressed_bytes: 1933
    
    Row 2:
    ──────
    name:                    all_2_2_0
    active:                  1
    rows:                    100
    data_compressed_bytes:   1125
    data_uncompressed_bytes: 1931
  5. Some time later, repeat the SELECT …​ FROM system.parts query. In the query results, you can see:

    • initial data parts have been merged into one part (all_1_2_1) containing 200 rows;

    • the size of uncompressed data in the resulting part is equal to the total amount of uncompressed data in two merged parts;

    • the size of compressed data in the resulting part is smaller than the size of compressed data in both initial parts because the outdated data has been recompressed with the LZ4HC(10) codec during merging.

    Row 1:
    ──────
    name:                    all_1_2_1
    active:                  1
    rows:                    200
    data_compressed_bytes:   1509
    data_uncompressed_bytes: 3864
Move data between hot and cold storages

 

In this example, separate disks are used to store "hot" and "cold" data — SSD and HDD disks mounted on the following directories:

  • /mnt/ssd/adqm/

  • /mnt/hdd/adqm/

Each directory is assigned the clickhouse owner:

$ sudo chown clickhouse:clickhouse -R /mnt/ssd/adqm/ /mnt/hdd/adqm/
  1. In the ADCM interface, configure a tiered storage for ADQM as described below.

    Specify local disks in the Enable additional local storage section (previously enabled) of the ADQMDB service’s configuration page:

    Configure local disks
    Configure local disks

    In the Storage policies section, create a policy (hot_to_cold) that includes two volumes for storing "hot" and "cold" data.

    Configure storage policy
    Configure storage policy

    Click Save and run the Reconfig and restart action for the service to apply the changes.

    After that, information about disks and policy will appear in the appropriate tags inside the <storage_configuration> section in the /etc/clickhouse-server/config.d/storage.xml file.

    storage.xml
    <storage_configuration>
        <disks>
            <disk_ssd>
                <type>local</type>
                <path>/mnt/ssd/adqm/</path>
            </disk_ssd>
            <disk_hdd>
                <type>local</type>
                <path>/mnt/hdd/adqm/</path>
            </disk_hdd>
        </disks>
        <policies>
            <hot_to_cold>
                <volumes>
                    <volume_hot>
                        <disk>disk_ssd</disk>
                    </volume_hot>
                    <volume_cold>
                        <disk>disk_hdd</disk>
                    </volume_cold>
                </volumes>
            </hot_to_cold>
        </policies>
    </storage_configuration>

    You can view available disks in the system.disks system table and storage policies in the system.storage_policies table:

    SELECT name, path, free_space, total_space FROM system.disks;
       β”Œβ”€name─────┬─path─────────────────┬──free_space─┬─total_space─┐
    1. β”‚ default  β”‚ /var/lib/clickhouse/ β”‚ 45148381184 β”‚ 53674487808 β”‚
    2. β”‚ disk_hdd β”‚ /mnt/hdd/adqm/       β”‚ 45148381184 β”‚ 53674487808 β”‚
    3. β”‚ disk_ssd β”‚ /mnt/ssd/adqm/       β”‚ 45148381184 β”‚ 53674487808 β”‚
       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    SELECT policy_name, volume_name, disks FROM system.storage_policies;
       β”Œβ”€policy_name─┬─volume_name─┬─disks────────┐
    1. β”‚ default     β”‚ default     β”‚ ['default']  β”‚
    2. β”‚ hot_to_cold β”‚ volume_hot  β”‚ ['disk_ssd'] β”‚
    3. β”‚ hot_to_cold β”‚ volume_cold β”‚ ['disk_hdd'] β”‚
       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  2. Create a table with a TTL rule to move data between the volume_hot and volume_cold volumes:

    DROP TABLE IF EXISTS alerts;
    CREATE TABLE alerts (time DateTime, value UInt64, alert String)
    ENGINE = MergeTree
    ORDER BY time
    TTL time TO VOLUME 'volume_hot', time + INTERVAL 5 MINUTE TO VOLUME 'volume_cold'
    SETTINGS storage_policy = 'hot_to_cold';
  3. Insert test data into the table:

    INSERT INTO alerts
        SELECT
        now() - toIntervalMinute(number),
        number,
        ['Warning', 'Critical', 'Info'][(rand() % 3) + 1]
    FROM numbers(5);

    Repeat the INSERT INTO query some time later.

    As a result, the table will have two data parts located on the disk_ssd disk (the volume_hot volume):

    SELECT name, rows, active, disk_name FROM system.parts WHERE table = 'alerts';
       β”Œβ”€name──────┬─rows─┬─active─┬─disk_name─┐
    1. β”‚ all_1_1_0 β”‚    5 β”‚      1 β”‚ disk_ssd  β”‚
    2. β”‚ all_2_2_0 β”‚    5 β”‚      1 β”‚ disk_ssd  β”‚
       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  4. 5 minutes after the first insertion, the first data part will be moved to the disk_hdd disk (the volume_cold volume). To verify this, repeat the SELECT …​ FROM system.parts query:

       β”Œβ”€name──────┬─rows─┬─active─┬─disk_name─┐
    1. β”‚ all_1_1_0 β”‚    5 β”‚      1 β”‚ disk_hdd  β”‚
    2. β”‚ all_2_2_0 β”‚    5 β”‚      1 β”‚ disk_ssd  β”‚
       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Column TTL

When values in a column with TTL expire, ADQM replaces them with the default value for the column data type (or the value specified via DEFAULT). If all values in a column expire, the column is removed from the data part in the file system.

IMPORTANT
TTL cannot be used for key columns.

Assign TTL to columns

To specify a time to live for column values, use the TTL clause in the column description when creating a table with the CREATE TABLE query:

CREATE TABLE <table_name>
(   <timestamp_column> Date|DateTime,
    ...,
    <column_name1> [<data_type1>] TTL <timestamp_column> + INTERVAL <time_interval1> <interval_type1>,
    <column_name2> [<data_type2>] TTL <timestamp_column> + INTERVAL <time_interval2> <interval_type2>,
    ...)
ENGINE = MergeTree()
...;

A TTL expression specifying when column values ​​should be deleted is described in the same way as a table-level TTL expression.

Modify TTL

To add a TTL to a column of an existing table or change the existing TTL of a column, use the ALTER TABLE …​ MODIFY COLUMN query:

ALTER TABLE <table_name> MODIFY COLUMN <column_name> <data_type> TTL <new_ttl_expr>;

Remove TTL

The following query allows you to delete TTL from a column:

ALTER TABLE <table_name> MODIFY COLUMN <column_name> REMOVE TTL;

Examples

Create a table with column-level TTL

 

  1. Create a table and set the following TTL rules for the columns:

    • value1 — a value in the column should be replaced with 100 if 5 minutes have passed since the time in the time column;

    • value2 — a column value should be replaced with the default one (0 for the Int data type) if 10 minutes have passed since the time in the time column.

    CREATE TABLE column_ttl_test
    (
        time DateTime,
        value1 Int DEFAULT 100 TTL time + INTERVAL 5 MINUTE,
        value2 Int TTL time + INTERVAL 10 MINUTE,
        alert String
    )
    ENGINE = MergeTree
    ORDER BY time
    SETTINGS min_rows_for_wide_part = 0, min_bytes_for_wide_part = 0;

    The min_rows_for_wide_part and min_bytes_for_wide_part settings are used to store data parts in the Wide format (for testing purposes) — each column in a separate file.

  2. Insert data into the table:

    INSERT INTO column_ttl_test
    SELECT
    	now() - toIntervalMinute(number),
    	number,
    	number + 10,
    	['Warning', 'Critical', 'Info'][(rand() % 3) + 1]
    FROM numbers(10);

    The first five rows in the value1 column contain 100 because their values ​​in the time column were already outdated at the time of insertion:

    SELECT * FROM column_ttl_test;
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€time─┬─value1─┬─value2─┬─alert────┐
     1. β”‚ 2024-06-25 11:42:36 β”‚    100 β”‚     19 β”‚ Warning  β”‚
     2. β”‚ 2024-06-25 11:43:36 β”‚    100 β”‚     18 β”‚ Critical β”‚
     3. β”‚ 2024-06-25 11:44:36 β”‚    100 β”‚     17 β”‚ Warning  β”‚
     4. β”‚ 2024-06-25 11:45:36 β”‚    100 β”‚     16 β”‚ Info     β”‚
     5. β”‚ 2024-06-25 11:46:36 β”‚    100 β”‚     15 β”‚ Info     β”‚
     6. β”‚ 2024-06-25 11:47:36 β”‚      4 β”‚     14 β”‚ Info     β”‚
     7. β”‚ 2024-06-25 11:48:36 β”‚      3 β”‚     13 β”‚ Warning  β”‚
     8. β”‚ 2024-06-25 11:49:36 β”‚      2 β”‚     12 β”‚ Info     β”‚
     9. β”‚ 2024-06-25 11:50:36 β”‚      1 β”‚     11 β”‚ Critical β”‚
    10. β”‚ 2024-06-25 11:51:36 β”‚      0 β”‚     10 β”‚ Critical β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  3. 3 minutes after inserting data into the table, run the query:

    OPTIMIZE TABLE column_ttl_test FINAL;

    Select data from the table again. In the results, you can see that values of the value1 column ​​in rows 6-8 and values of the value2 column in rows 1-3 have been replaced:

        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€time─┬─value1─┬─value2─┬─alert────┐
     1. β”‚ 2024-06-25 11:42:36 β”‚    100 β”‚      0 β”‚ Warning  β”‚
     2. β”‚ 2024-06-25 11:43:36 β”‚    100 β”‚      0 β”‚ Critical β”‚
     3. β”‚ 2024-06-25 11:44:36 β”‚    100 β”‚      0 β”‚ Warning  β”‚
     4. β”‚ 2024-06-25 11:45:36 β”‚    100 β”‚     16 β”‚ Info     β”‚
     5. β”‚ 2024-06-25 11:46:36 β”‚    100 β”‚     15 β”‚ Info     β”‚
     6. β”‚ 2024-06-25 11:47:36 β”‚    100 β”‚     14 β”‚ Info     β”‚
     7. β”‚ 2024-06-25 11:48:36 β”‚    100 β”‚     13 β”‚ Warning  β”‚
     8. β”‚ 2024-06-25 11:49:36 β”‚    100 β”‚     12 β”‚ Info     β”‚
     9. β”‚ 2024-06-25 11:50:36 β”‚      1 β”‚     11 β”‚ Critical β”‚
    10. β”‚ 2024-06-25 11:51:36 β”‚      0 β”‚     10 β”‚ Critical β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  4. After another 2 minutes (5 minutes after the initial data insert), all values ​​in the value1 column will be considered outdated and replaced with 100:

    OPTIMIZE TABLE column_ttl_test FINAL;
    SELECT * FROM column_ttl_test;
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€time─┬─value1─┬─value2─┬─alert────┐
     1. β”‚ 2024-06-25 11:42:36 β”‚    100 β”‚      0 β”‚ Warning  β”‚
     2. β”‚ 2024-06-25 11:43:36 β”‚    100 β”‚      0 β”‚ Critical β”‚
     3. β”‚ 2024-06-25 11:44:36 β”‚    100 β”‚      0 β”‚ Warning  β”‚
     4. β”‚ 2024-06-25 11:45:36 β”‚    100 β”‚      0 β”‚ Info     β”‚
     5. β”‚ 2024-06-25 11:46:36 β”‚    100 β”‚      0 β”‚ Info     β”‚
     6. β”‚ 2024-06-25 11:47:36 β”‚    100 β”‚     14 β”‚ Info     β”‚
     7. β”‚ 2024-06-25 11:48:36 β”‚    100 β”‚     13 β”‚ Warning  β”‚
     8. β”‚ 2024-06-25 11:49:36 β”‚    100 β”‚     12 β”‚ Info     β”‚
     9. β”‚ 2024-06-25 11:50:36 β”‚    100 β”‚     11 β”‚ Critical β”‚
    10. β”‚ 2024-06-25 11:51:36 β”‚    100 β”‚     10 β”‚ Critical β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

    The value1 column has been removed from the data part directory in the file system. You can check this as follows.

    Obtain the path to the directory with files of the active data part (all_1_1_3) from the system.parts table:

    SELECT name, active, path FROM system.parts WHERE table = 'column_ttl_test';
       β”Œβ”€name──────┬─active─┬─path──────────────────────────────────────────────────────────────────────────┐
    1. β”‚ all_1_1_0 β”‚      0 β”‚ /var/lib/clickhouse/store/62a/62a16a55-c91c-4c6b-9386-f4347ffac137/all_1_1_0/ β”‚
    2. β”‚ all_1_1_1 β”‚      0 β”‚ /var/lib/clickhouse/store/62a/62a16a55-c91c-4c6b-9386-f4347ffac137/all_1_1_1/ β”‚
    3. β”‚ all_1_1_2 β”‚      0 β”‚ /var/lib/clickhouse/store/62a/62a16a55-c91c-4c6b-9386-f4347ffac137/all_1_1_2/ β”‚
    4. β”‚ all_1_1_3 β”‚      1 β”‚ /var/lib/clickhouse/store/62a/62a16a55-c91c-4c6b-9386-f4347ffac137/all_1_1_3/ β”‚
       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

    Make sure that there are no files with data of the value1 column (value1.bin and value1.cmrk2) in this directory:

    $ sudo ls /var/lib/clickhouse/store/62a/62a16a55-c91c-4c6b-9386-f4347ffac137/all_1_1_3/
    alert.bin    checksums.txt  count.txt                      metadata_version.txt  serialization.json  time.cmrk2  value2.bin
    alert.cmrk2  columns.txt    default_compression_codec.txt  primary.cidx          time.bin            ttl.txt     value2.cmrk2

    The columns.txt file also has no information about the value1 column:

    $ sudo cat /var/lib/clickhouse/store/62a/62a16a55-c91c-4c6b-9386-f4347ffac137/all_1_1_3/columns.txt
    columns format version: 1
    3 columns:
    `time` DateTime
    `value2` Int32
    `alert` String
Change a column TTL in an existing table

 

Use the ALTER TABLE …​ MODIFY COLUMN …​ TTL query to change the TTL for the value2 column of the table created in the previous example:

ALTER TABLE column_ttl_test MODIFY COLUMN value2 Int TTL time + INTERVAL 1 HOUR;

Check the current TTL settings with the SHOW CREATE TABLE query:

SHOW CREATE TABLE column_ttl_test;

As you can see, the new TTL rule is applied to the value2 column:

CREATE TABLE default.column_ttl_test
(
    `time` DateTime,
    `value1` Int32 DEFAULT 100 TTL time + toIntervalMinute(5),
    `value2` Int32 TTL time + toIntervalHour(1),
    `alert` String
)
ENGINE = MergeTree
ORDER BY time
SETTINGS min_rows_for_wide_part = 0, min_bytes_for_wide_part = 0, index_granularity = 8192
Found a mistake? Seleсt text and press Ctrl+Enter to report it