Home
Arenadata QuickMarts
Concepts
Data model
Tables
Data compression

Data compression

Elena Dvoryadkina

Contents

Data compression codecs
- General-purpose codecs
- Specialized codecs
Apply compression codecs
- At the server level
- At the column level
General recommendations
Example

Data compression plays an important role in achieving high performance for ADQM. For example, you can use compression to reduce the amount of data to be read when executing a query and speed up this query — see the example in the Improve query performance article.

ADQM supports data compression for the following table engines:

MergeTree — you can set a default compression algorithm for tables, as well as specify a compression method (or a combination of them) for an individual column when creating/modifying a table (see Apply compression codecs below).
Log — uses the LZ4 compression method by default and supports column-level compression.
Set, Join — support default compression only.

Data compression codecs

A codec defines a compression method applied to ADQM data. The following data compression codecs are available in ADQM:

General-purpose codecs that allow you to find the optimal balance between the consumption of disk space and CPU resources.
Specialized codecs for specific data types. Some of these codecs do not compress data themselves but preprocess it (taking into account data characteristics) for general-purpose codecs, which in turn compress prepared data more efficiently than unprepared data.

General-purpose codecs

Codec Compression algorithm Compression level

NONE

No compression

—

LZ4

The LZ4 lossless compression algorithm. Provides both high speed and good compression ratio. Used by default

—

LZ4HC[(level)]

The LZ4 HC (high compression) algorithm with a configurable compression level. In comparison with LZ4, it compresses data better, but slower (decompression is still fast)

Possible compression levels — [1, 12], recommended level range — [4, 9]. Default compression level (if the level is not specified, or it is set to a value less than or equal to 0) — 9

ZSTD[(level)]

The ZSTD compression algorithm. Provides both speed and a high compression ratio that can be tuned. ZSTD used without a specialized codec often outperforms other compression methods or at least competitive

Possible compression levels — [1, 22]. Default compression level — 1.

Compression levels above 3 rarely result in significant improvement, but high compression levels (which mean better compression but higher CPU usage) can be useful for asymmetric scenarios — such as "compress once, decompress repeatedly". In any case, testing is recommended

ZSTD_QAT[(level)]

The ZSTD compression algorithm with a configurable compression level, implemented by Intel QATlib and Intel QAT ZSTD Plugin. Limitations:

The ZSTD_QAT codec is disabled by default and can only be used after you enable the parameter: SET enable_zstd_qat_codec = 1.
For compression, ZSTD_QAT tries to use an Intel QuickAssist Technology (QAT) offloading device. If no such device is found, compression is performed in software.
Decompression is always performed in software.

Possible compression levels — [1, 12], recommended level range — [6, 12]. Default compression level — 1

DEFLATE_QPL

The Deflate compression algorithm implemented by Intel Query Processing Library (Intel QPL). Limitations:

The DEFLATE_QPL codec is disabled by default and can only be used after you enable the parameter: SET enable_deflate_qpl_codec = 1.
DEFLATE_QPL works best if the system has an Intel In-Memory Analytics Accelerator (IAA) offloading device — see Accelerator Configuration and Benchmark with DEFLATE_QPL for more details.

Technical support for this codec in ADQM is limited

—

Specialized codecs

Codec Description Parameters

Delta(delta_bytes)

Replaces raw values with the difference (delta) of two neighboring values (except for the first value that remains unchanged). Works well with monotonically increasing data.

Delta is a codec for data preparation, and it cannot be used stand-alone (in other words, it should be followed by some compression codec)

delta_bytes is the limit (in bytes) for storing delta values, that is, it is the maximum size of raw values. Possible values for delta_bytes are 1, 2, 4, 8. The default value is sizeof(type) if equal to 1, 2, 4, or 8. In all other cases, it is 1

DoubleDelta(bytes_size)

Calculates the difference between neighboring delta values and stores it in compact binary form.

An optimal compression ratio can be achieved for a monotonic sequence with a constant stride (such as time series data). Can be used with any fixed-length data type. Implements the algorithm used in Gorilla TSDB, extending it to support 64-bit data types. Uses 1 extra bit for 32-bit deltas: 5-bit prefixes instead of 4-bit prefixes. For more information, see Compressing Time Stamps in the Gorilla: A Fast, Scalable, In-Memory Time Series Database article.

DoubleDelta is a codec for data preparation and it cannot be used stand-alone (in other words, it should be followed by some compression codec)

Possible values for bytes_size are 1, 2, 4, 8. The default value is sizeof(type) if equal to 1, 2, 4, or 8. In all other cases, it is 1

GCD

Calculates the greatest common denominator (GCD) of the values in a column, then divides each value by the GCD. Can be used with integer, decimal, and date/time columns. The codec is well suited for columns whose values change (increase or decrease) in multiples of GCD — for example, 24, 28, 16, 24, 8, 24 (GCD = 4).

GCD is a codec for data preparation, and it cannot be used stand-alone (in other words, it should be followed by some compression codec)

—

Gorilla(bytes_size)

Calculates XOR between the current and previous floating point value and writes it in compact binary form. The smaller the difference between consecutive values (that is, the slower the values in a series change), the better the compression ratio. Implements the algorithm used in Gorilla TSDB, extending it to support 64-bit types. For more information, see Compressing values in the Gorilla: A Fast, Scalable, In-Memory Time Series Database article

Possible values for bytes_size are 1, 2, 4, 8. The default value is sizeof(type) if equal to 1, 2, 4, or 8. In all other cases, it is 1

FPC(level, float_size)

Repeatedly predicts the next floating point value in the sequence using the better of two predictors, then calculates the XOR of the actual value with the predicted value, and compresses the result with a leading zero. Similar to Gorilla, FPC efficiently stores series of slowly changing floating point values. For 64-bit values, FPC is faster than Gorilla; for 32-bit values, the estimate may vary. For a detailed description of the algorithm, refer to High Throughput Compression of Double-Precision Floating-Point Data

Possible values for the level parameter — [1, 28]. The default value is 12.

Possible values for float_size — 4, 8. The default value is sizeof(type) if the type is Float. In all other cases, it is 4

T64

Crops unused high bits of integer values (including Enum, Date, and DateTime). At each step of the algorithm, the codec places a block of 64 values into a 64x64 bit matrix, transposes it, crops the unused bits of values, and returns the rest as a sequence. Unused bits are the bits that do not differ between maximum and minimum values in the whole data part for which the compression is used

—

The DoubleDelta and Gorilla codecs are used in Gorilla TSDB as components of its compression algorithm. The Gorilla approach is effective in scenarios when data is a sequence of slowly changing values with their timestamps. Timestamps are effectively compressed by the DoubleDelta codec, and values are effectively compressed by the Gorilla codec.

Apply compression codecs

At the server level

By default, ADQM applies the compression method defined in the server configuration to columns of MergeTree tables. You can change the default compression settings in the Data compression section on the configuration page of the ADQMDB service. This section becomes visible when the Show advanced option is enabled. To access and apply configuration parameters, turn on the Data compression switch.

Set up server-level data compression settings for MergeTree tables

After you have defined the data compression parameters, click Save and run the Reconfig and restart action for the ADQMDB service.

NOTE

Changing the default compression settings does not affect existing tables.

At the column level

You can also set up the compression method for each column separately in the CREATE TABLE query using the CODEC clause:

CREATE TABLE <table_name>
(   <column_name1> <data_type1> CODEC(<compression_codec1>) [...],
    <column_name2> <data_type2> CODEC(<compression_codec2>) [...],
    ...)
ENGINE = MergeTree()
...;

If the Default codec is assigned to a column (CODEC(Default)) or no codec is specified, the default compression algorithm set at the server level is used.

Codecs can be combined sequentially — for example, when using a specialized codec to prepare data for more efficient compression by a general-purpose codec (CODEC(Delta, ZSTD)).

For an existing table, you can change the compression codec of a column using the following query:

ALTER TABLE <table_name> MODIFY COLUMN <column_name> CODEC(<new_compression_codec>);

General recommendations

There are some recommendations on how to get the maximum performance when using data compression:

Choose the most suitable compression algorithm for a specific task, taking into account the type and characteristics of your data (if known).
When specifying a compression level, remember that the bigger that value is — the more (but slower) the compression is. In fact, data compression allows you to improve I/O performance and reduce data size at the expense of CPU. Try to find the optimal settings — to avoid long compression times or slow scan rates.
The final performance of compressed append-optimized tables depends on many factors, such as hardware, query tuning parameters, etc. You should perform comparison testing to determine whether data compression is useful in your environment.

TIP

It is recommended to always balance between the available CPU resources and the required compression speed.

Example

Create a MergeTree table with Int64 and Float64 columns using the LZ4 and ZSTD data compression codecs in combination with various specialized codecs for preprocessing data:

CREATE TABLE compression_test (
    n Int32,
    i64 Int64 DEFAULT n CODEC(NONE),
    i64_lz4 Int64 DEFAULT n CODEC(LZ4),
    i64_delta_lz4 Int64 DEFAULT n CODEC(Delta, LZ4),
    i64_doubledelta_lz4 Int64 DEFAULT n CODEC(DoubleDelta, LZ4),
    i64_t64_lz4 Int64 DEFAULT n CODEC(T64, LZ4),
    i64_zstd Int64 DEFAULT n CODEC(ZSTD),
    i64_delta_zstd Int64 DEFAULT n CODEC(Delta, ZSTD),
    i64_doubledelta_zstd Int64 DEFAULT n CODEC(DoubleDelta, ZSTD),
    i64_t64_zstd Int64 DEFAULT n CODEC(T64, ZSTD),
    f64 Float64 DEFAULT n CODEC(NONE),
    f64_lz4 Float64 DEFAULT n CODEC(LZ4),
    f64_delta_lz4 Float64 DEFAULT n CODEC(Delta, LZ4),
    f64_doubledelta_lz4 Float64 DEFAULT n CODEC(DoubleDelta, LZ4),
    f64_gorilla_lz4 Float64 DEFAULT n CODEC(Gorilla, LZ4),
    f64_fpc_lz4 Float64 DEFAULT n CODEC(FPC, LZ4),
    f64_zstd Float64 DEFAULT n CODEC(ZSTD),
    f64_delta_zstd Float64 DEFAULT n CODEC(Delta, ZSTD),
    f64_doubledelta_zstd Float64 DEFAULT n CODEC(DoubleDelta, ZSTD),
    f64_gorilla_zstd Float64 DEFAULT n CODEC(Gorilla, ZSTD),
    f64_fpc_zstd Float64 DEFAULT n CODEC(FPC, ZSTD))
Engine = MergeTree()
ORDER BY tuple();

Generate test data for the table (1 million rows) so that each column contains a monotonic sequence of values with random increment:

INSERT INTO compression_test (n) SELECT number*1000+(rand()%100) FROM numbers(1000000);

Use the system.columns system table to view the size of compressed and uncompressed data in each column and column compression codecs:

SELECT
    name,
    type,
    formatReadableSize(data_uncompressed_bytes) AS uncompressed,
    formatReadableSize(data_compressed_bytes) AS compressed,
    round(data_uncompressed_bytes / data_compressed_bytes, 2) AS ratio,
compression_codec codec
FROM system.columns
WHERE table = 'compression_test' AND name != 'n';

The result of this testing shows that specialized codecs can significantly affect the level of data compression by general-purpose codecs, depending on a column data type. For example, for the Int64 type, both algorithms (LZ4 and ZSTD) provide good compression together with the DoubleDelta codec, and for the Float64 type — ZSTD together with Delta or FPC.

    ┌─name─────────────────┬─type────┬─uncompressed─┬─compressed─┬─ratio─┬─codec───────────────────────┐
 1. │ i64                  │ Int64   │ 7.63 MiB     │ 7.63 MiB   │     1 │ CODEC(NONE)                 │
 2. │ i64_lz4              │ Int64   │ 7.63 MiB     │ 4.78 MiB   │   1.6 │ CODEC(LZ4)                  │
 3. │ i64_delta_lz4        │ Int64   │ 7.63 MiB     │ 2.89 MiB   │  2.64 │ CODEC(Delta(8), LZ4)        │
 4. │ i64_doubledelta_lz4  │ Int64   │ 7.63 MiB     │ 1.23 MiB   │  6.21 │ CODEC(DoubleDelta, LZ4)     │
 5. │ i64_t64_lz4          │ Int64   │ 7.63 MiB     │ 1.62 MiB   │  4.72 │ CODEC(T64, LZ4)             │
 6. │ i64_zstd             │ Int64   │ 7.63 MiB     │ 1.97 MiB   │  3.87 │ CODEC(ZSTD(1))              │
 7. │ i64_delta_zstd       │ Int64   │ 7.63 MiB     │ 1.36 MiB   │  5.62 │ CODEC(Delta(8), ZSTD(1))    │
 8. │ i64_doubledelta_zstd │ Int64   │ 7.63 MiB     │ 1.22 MiB   │  6.23 │ CODEC(DoubleDelta, ZSTD(1)) │
 9. │ i64_t64_zstd         │ Int64   │ 7.63 MiB     │ 1.42 MiB   │  5.36 │ CODEC(T64, ZSTD(1))         │
10. │ f64                  │ Float64 │ 7.63 MiB     │ 7.63 MiB   │     1 │ CODEC(NONE)                 │
11. │ f64_lz4              │ Float64 │ 7.63 MiB     │ 5.01 MiB   │  1.52 │ CODEC(LZ4)                  │
12. │ f64_delta_lz4        │ Float64 │ 7.63 MiB     │ 2.90 MiB   │  2.63 │ CODEC(Delta(8), LZ4)        │
13. │ f64_doubledelta_lz4  │ Float64 │ 7.63 MiB     │ 3.47 MiB   │   2.2 │ CODEC(DoubleDelta, LZ4)     │
14. │ f64_gorilla_lz4      │ Float64 │ 7.63 MiB     │ 2.66 MiB   │  2.87 │ CODEC(Gorilla, LZ4)         │
15. │ f64_fpc_lz4          │ Float64 │ 7.63 MiB     │ 3.11 MiB   │  2.45 │ CODEC(FPC(12), LZ4)         │
16. │ f64_zstd             │ Float64 │ 7.63 MiB     │ 2.22 MiB   │  3.44 │ CODEC(ZSTD(1))              │
17. │ f64_delta_zstd       │ Float64 │ 7.63 MiB     │ 1.49 MiB   │  5.11 │ CODEC(Delta(8), ZSTD(1))    │
18. │ f64_doubledelta_zstd │ Float64 │ 7.63 MiB     │ 2.05 MiB   │  3.72 │ CODEC(DoubleDelta, ZSTD(1)) │
19. │ f64_gorilla_zstd     │ Float64 │ 7.63 MiB     │ 2.32 MiB   │  3.29 │ CODEC(Gorilla, ZSTD(1))     │
20. │ f64_fpc_zstd         │ Float64 │ 7.63 MiB     │ 1.87 MiB   │  4.08 │ CODEC(FPC(12), ZSTD(1))     │
    └──────────────────────┴─────────┴──────────────┴────────────┴───────┴─────────────────────────────┘

For random data, Delta and DoubleDelta do not work very well, as you can see from the example below. Such data is better compressed with T64 (for integer values) and Gorilla (for floating point numbers) — so these codecs can be recommended for use when the data pattern is unknown.

CREATE TABLE compression_test_rand AS compression_test;

INSERT INTO compression_test_rand (n) SELECT rand()%(1000000) FROM numbers(1000000);

SELECT
    name,
    type,
    formatReadableSize(data_uncompressed_bytes) AS uncompressed,
    formatReadableSize(data_compressed_bytes) AS compressed,
    round(data_uncompressed_bytes / data_compressed_bytes, 2) AS ratio,
compression_codec codec
FROM system.columns
WHERE table = 'compression_test_rand' AND name != 'n';

    ┌─name─────────────────┬─type────┬─uncompressed─┬─compressed─┬─ratio─┬─codec───────────────────────┐
 1. │ i64                  │ Int64   │ 7.63 MiB     │ 7.63 MiB   │     1 │ CODEC(NONE)                 │
 2. │ i64_lz4              │ Int64   │ 7.63 MiB     │ 4.40 MiB   │  1.73 │ CODEC(LZ4)                  │
 3. │ i64_delta_lz4        │ Int64   │ 7.63 MiB     │ 4.63 MiB   │  1.65 │ CODEC(Delta(8), LZ4)        │
 4. │ i64_doubledelta_lz4  │ Int64   │ 7.63 MiB     │ 4.43 MiB   │  1.72 │ CODEC(DoubleDelta, LZ4)     │
 5. │ i64_t64_lz4          │ Int64   │ 7.63 MiB     │ 2.40 MiB   │  3.18 │ CODEC(T64, LZ4)             │
 6. │ i64_zstd             │ Int64   │ 7.63 MiB     │ 3.32 MiB   │   2.3 │ CODEC(ZSTD(1))              │
 7. │ i64_delta_zstd       │ Int64   │ 7.63 MiB     │ 3.42 MiB   │  2.23 │ CODEC(Delta(8), ZSTD(1))    │
 8. │ i64_doubledelta_zstd │ Int64   │ 7.63 MiB     │ 3.84 MiB   │  1.99 │ CODEC(DoubleDelta, ZSTD(1)) │
 9. │ i64_t64_zstd         │ Int64   │ 7.63 MiB     │ 2.39 MiB   │  3.19 │ CODEC(T64, ZSTD(1))         │
10. │ f64                  │ Float64 │ 7.63 MiB     │ 7.63 MiB   │     1 │ CODEC(NONE)                 │
11. │ f64_lz4              │ Float64 │ 7.63 MiB     │ 4.67 MiB   │  1.63 │ CODEC(LZ4)                  │
12. │ f64_delta_lz4        │ Float64 │ 7.63 MiB     │ 4.85 MiB   │  1.57 │ CODEC(Delta(8), LZ4)        │
13. │ f64_doubledelta_lz4  │ Float64 │ 7.63 MiB     │ 5.94 MiB   │  1.28 │ CODEC(DoubleDelta, LZ4)     │
14. │ f64_gorilla_lz4      │ Float64 │ 7.63 MiB     │ 3.12 MiB   │  2.44 │ CODEC(Gorilla, LZ4)         │
15. │ f64_fpc_lz4          │ Float64 │ 7.63 MiB     │ 5.03 MiB   │  1.52 │ CODEC(FPC(12), LZ4)         │
16. │ f64_zstd             │ Float64 │ 7.63 MiB     │ 3.26 MiB   │  2.34 │ CODEC(ZSTD(1))              │
17. │ f64_delta_zstd       │ Float64 │ 7.63 MiB     │ 3.44 MiB   │  2.22 │ CODEC(Delta(8), ZSTD(1))    │
18. │ f64_doubledelta_zstd │ Float64 │ 7.63 MiB     │ 4.07 MiB   │  1.87 │ CODEC(DoubleDelta, ZSTD(1)) │
19. │ f64_gorilla_zstd     │ Float64 │ 7.63 MiB     │ 3.05 MiB   │   2.5 │ CODEC(Gorilla, ZSTD(1))     │
20. │ f64_fpc_zstd         │ Float64 │ 7.63 MiB     │ 4.20 MiB   │  1.82 │ CODEC(FPC(12), ZSTD(1))     │
    └──────────────────────┴─────────┴──────────────┴────────────┴───────┴─────────────────────────────┘

Found a mistake? Seleсt text and press Ctrl+Enter to report it