Integration between ADQM and ADPG

Elena Dvoryadkina

Overview

ClickHouse supports a set of tools to integrate with PostgreSQL:

the postgresql table function | the PostgreSQL table engine | the PostgreSQL database engine — provide the ability to send queries from ClickHouse to tables/databases on a PostgreSQL server for real-time data exchange. They support read and write operations on PostgreSQL data. When executing a SELECT or INSERT query, ClickHouse connects to a remote PostgreSQL table/database and retrieves up-to-date data from PostgreSQL or exports new data to it. In other words, the data is stored in PostgreSQL, and you can access it from ClickHouse.
the MaterializedPostgreSQL table engine | the MaterializedPostgreSQL database engine — allow using ClickHouse as a PostgreSQL replica. When you create a MaterializedPostgreSQL table/database, ClickHouse creates an initial data dump of a table or tables of the PostgreSQL database and starts the replication process — in the background, it applies data changes performed on the PostgreSQL side to the corresponding MaterializedPostgreSQL table/database. Thus, when MaterializedPostgreSQL engines are used, unlike PostgreSQL engines, data is stored in both PostgreSQL and ClickHouse.
the POSTGRESQL type of a data source for dictionaries — allows you to assign a PostgreSQL table to be a data source for a dictionary in ClickHouse.

This article describes ways to use these tools on the example of ADQM and ADPG (Arenadata Postgres — a relational database management system based on PostgreSQL). It provides the syntax of queries to call a table function and create tables/databases based on the mentioned integration engines, as well as sample queries to read/write ADPG data from ADQM.

Prepare ADPG for test examples

On the ADPG host, run the psql terminal client as the default user (postgres) using the following commands:

$ sudo su - postgres

$ psql

Now you can enter commands in psql to work with PostgreSQL databases and tables:

psql (14.1)
Type "help" for help.

postgres=#

NOTE

For an overview on how to work with PostgreSQL tables in psql, see Example of working with tables via psql in the ADPG documentation.

Create a PostgreSQL user

Create a user for connecting to ADPG from ADQM (for demonstration purposes, grant superuser privileges):

CREATE ROLE adqm_user SUPERUSER LOGIN PASSWORD '<adqm_user_secret_password>';

Create an ADPG database and table to use in examples with the table function and PostgreSQL engines

Create the adpg_db1 database in ADPG:

CREATE DATABASE adpg_db1;

Switch to this database:

\c adpg_db1

The command result:

You are now connected to database "adpg_db1" as user "postgres".
adpg_db1=#

Create a table:

CREATE TABLE table1 (id INTEGER PRIMARY KEY, name VARCHAR(10));

Add test data to the table:

INSERT INTO table1 (id, name) VALUES (1, 'one'), (2, 'two');

Create an ADPG database and tables to use in examples with MaterializedPostgreSQL engines

Create a new database in ADPG (adpg_db2) and connect to it:
CREATE DATABASE adpg_db2;
\c adpg_db2

In the adpg_db2 database, create three tables with the same structure:

CREATE TABLE table1 (id INTEGER PRIMARY KEY, name VARCHAR(10));

CREATE TABLE table2 (id INTEGER PRIMARY KEY, name VARCHAR(10));

CREATE TABLE table3 (id INTEGER PRIMARY KEY, name VARCHAR(10));

Add test data to these tables:

INSERT INTO table1 (id, name) VALUES (1, 'one'), (2, 'two');

INSERT INTO table2 (id, name) VALUES (3, 'three'), (4, 'four');

INSERT INTO table3 (id, name) VALUES (5, 'five'), (6, 'six');

Create an ADPG table to be a data source for a dictionary

In the adpg_db2 database, create the roles_dict table to be a data source for a dictionary:

CREATE TABLE roles_dict (id INTEGER PRIMARY KEY, role VARCHAR(10));

INSERT INTO roles_dict (id, role) VALUES (10, 'admin'), (20, 'owner'), (30, 'author'), (40, 'reviewer'), (50, 'editor'), (60, 'view only');

Set up authentication for connecting to ADPG

To allow the adqm_user user to connect to the adpg_db1 and adpg_db2 databases from an ADQM host and set password authentication, add the following records to the pg_hba.conf file (for ADPG, you can do this in the ADCM interface — see the Password authentication article for details):

host adpg_db1 adqm_user <adqm_host_address> password
host adpg_db2 adqm_user <adqm_host_address> password

where <adqm_host_address> is an address of the ADQM host (for example, if an IPv4 address of the ADQM host is 10.92.40.25, replace <adqm_host_address> with 10.92.40.25/32). You can find description of the address field and other fields of records in the pg_hba.conf file in the article The pg_hba.conf file.

Basic parameters for integration

To call the postgresql table function and create tables/databases based on engines for integration with PostgreSQL (the basic syntax of corresponding queries is provided in sections below), it is required to specify the following parameters.

<postgres_host>

PostgreSQL server address (IP or host name)

<postgres_database>

PostgreSQL database name

<postgres_table>

Name of a table in the PostgreSQL database (specify it for the postgresql table function and for the PostgreSQL/MaterializedPostgreSQL table engines)

<postgres_user>

PostgreSQL user name used to connect to PostgreSQL from ClickHouse

<user_password>

PostgreSQL user password

<schema>

PostgreSQL schema (you can optionally specify it for the postgresql table function and for the PostgreSQL table/database engines)

TIP

For production environments, it is recommended to pass parameters using named collections.

The postgresql table function

The postgresql table function allows you to send SELECT and INSERT queries to tables in PostgreSQL databases.

The function syntax is:

postgresql('<postgres_host>:5432', '<postgres_database>', '<postgres_table>', '<postgres_user>', '<user_password>'[, '<schema>'])

Function parameters are described above.

Example

Read data from ADPG

Run the following query to get ADPG data from the adpg_db1.table1 table in ADQM using the postgresql function:

SELECT id, name FROM postgresql('<adpg_host>:5432', 'adpg_db1', 'table1', 'adqm_user', '<adqm_user_secret_password>');

where <adpg_host> is an ADPG host address (IP or host name).

The result:

┌─id─┬─name──┐
│  1 │ one   │
│  2 │ two   │
└────┴───────┘

Write data to ADPG

Run the following query to send a new record to the ADPG table from ADQM:

INSERT INTO TABLE FUNCTION postgresql('<adpg_host>:5432', 'adpg_db1', 'table1', 'adqm_user', '<adqm_user_secret_password>') (id, name)
VALUES (3, 'three');

Make sure that data is written to the destination table. To do this, run the following command on the ADPG host (connect to the adpg_db1 database first):

SELECT * FROM table1;

 id | name
----+-------
  1 | one
  2 | two
  3 | three
(3 rows)

The PostgreSQL table engine

The PostgreSQL table engine allows you to send SELECT and INSERT queries to tables on a remote PostgreSQL server.

Create a table

Below is the basic syntax of a query that creates a PostgreSQL table (see parameter descriptions above):

CREATE TABLE <table_name> (<column_name> <column_type> [DEFAULT|MATERIALIZED|ALIAS <expr>] [TTL <expr>], ...)
ENGINE = PostgreSQL('<postgres_host>:5432', '<postgres_database>', '<postgres_table>', '<postgres_user>', '<user_password>'[, '<schema>']);

The structure of an ADQM table can differ from the structure of an original PostgreSQL table you are connecting to:

Column names in the ADQM table should be the same as in the original PostgreSQL table, but you can use just some of these columns and in any order.
Column types in the ADQM table may differ from types in the original PostgreSQL table — ADQM tries to cast values to ClickHouse data types.
The external_table_functions_use_nulls setting defines how Nullable columns are handled. The default value is 1. If the value is 0, columns are not converted to Nullable — default values of corresponding data types are inserted instead of NULL values. This also applies to NULL values in arrays.

Example

In ADQM, create a table based on the PostgreSQL engine and read its data:

CREATE TABLE psql_table (id Int32, name String)
ENGINE = PostgreSQL('<adpg_host>:5432', 'adpg_db1', 'table1', 'adqm_user', '<adqm_user_secret_password>');

SELECT * FROM psql_table;

┌─id─┬─name──┐
│  1 │ one   │
│  2 │ two   │
│  3 │ three │
└────┴───────┘

Insert data into the ADQM table:

INSERT INTO psql_table VALUES (4, 'four');

On the ADPG host, query the adpg_db1.table1 table to check that data from ADQM was successfully imported into it:

SELECT * FROM table1;

 id | name
----+-------
  1 | one
  2 | two
  3 | three
  4 | four
(4 rows)

The PostgreSQL database engine

The PostgreSQL database engine allows you to connect to a database on a PostgreSQL server and perform read/write operations (SELECT/INSERT queries) to exchange data between ClickHouse and PostgreSQL, as well as receive real-time information about database tables from the PostgreSQL server and their structures using the SHOW TABLES and DESCRIBE TABLE queries.

Create a database

The syntax of a query that creates a PostgreSQL database in ADQM is:

CREATE DATABASE <database_name>
ENGINE = PostgreSQL('<postgres_host>:5432', '<postgres_database>', '<postgres_user>', '<user_password>'[, `<schema>`, `use_table_cache`]);

See descriptions of basic parameters above. The optional use_table_cache parameter specifies whether structures of tables are cached. If the parameter is set to 1, a table structure is cached and its changes are not tracked — you can update the table structure manually running the DETACH TABLE and ATTACH TABLE queries sequentially.

Example

Create a database based on the PostgreSQL engine:

CREATE DATABASE psql_database
ENGINE = PostgreSQL('<adpg_host>:5432', 'adpg_db1', 'adqm_user', '<adqm_user_secret_password>', 1);

Check that this database has been added to the ADQM database list:

SHOW DATABASES;

┌─name───────────────┐
│ INFORMATION_SCHEMA │
│ default            │
│ information_schema │
│ psql_database      │
│ system             │
└────────────────────┘

View tables that are in the psql_database database:

SHOW TABLES FROM psql_database;

┌─name───┐
│ table1 │
└────────┘

Query data from a table:

SELECT * FROM psql_database.table1;

┌─id─┬─name──┐
│  1 │ one   │
│  2 │ two   │
│  3 │ three │
│  4 │ four  │
└────┴───────┘

Add new data to the table on the ADPG side:

INSERT INTO table1 (id, name) VALUES (5, 'five');

Check that this data is available from the ADQM table:

SELECT * FROM psql_database.table1;

┌─id─┬─name──┐
│  1 │ one   │
│  2 │ two   │
│  3 │ three │
│  4 │ four  │
│  5 │ five  │
└────┴───────┘

Add a new column to the adpg_db1.table1 table in ADPG to change its structure. To do this, run the following command on the ADPG host:

ALTER TABLE table1 ADD COLUMN value SMALLINT;

The table structure on the ADQM side will not change, since the database’s use_table_cache parameter is set to 1 and ADQM caches table structures:

DESCRIBE TABLE psql_database.table1;

┌─name─┬─type─────────────┬─
│ id   │ Int32            │
│ name │ Nullable(String) │
└──────┴──────────────────┴─

To update the table structure in ADQM, detach and re-attach the table:

DETACH TABLE psql_database.table1;

ATTACH TABLE psql_database.table1;

DESCRIBE TABLE psql_database.table1;

┌─name──┬─type─────────────┬─
│ id    │ Int32            │
│ name  │ Nullable(String) │
│ value │ Nullable(Int16)  │
└───────┴──────────────────┴─

The MaterializedPostgreSQL table engine

The MaterializedPostgreSQL table engine creates a ClickHouse table with an initial data dump of a PostgreSQL table and starts the replication process — applies changes made to the PostgreSQL table to the ADQM table in the background.

If you need to replicate more than one PostgreSQL table to ADQM, it is highly recommended to use the MaterializedPostgreSQL database engine instead of the table engine and use the materialized_postgresql_tables_list setting to specify the PostgreSQL tables to be replicated. It is much better in terms of CPU, fewer connections, and fewer replication slots in the PostgreSQL database.

CAUTION

Replication of TOAST values is not supported. The default value for the data type is used.

Requirements

The MaterializedPostgreSQL table engine requires the following conditions to be met:

In the PostgreSQL configuration file, the wal_level parameter should be set to logical, and the max_replication_slots parameter value should be at least 2 (for more information on these parameters and how to configure them in ADPG, see the Logical replication article).
A MaterializedPostgreSQL table should have the same primary key as the replication identity index (usually a primary key) of a PostgreSQL table.
A MaterializedPostgreSQL table can be created in an Atomic database only.

Create a table

The basic syntax of a query to create a table based on the MaterializedPostgreSQL engine is:

CREATE TABLE <table_name> (<key_column_name> <key_column_type>, <column_name> <column_type>, ...)
ENGINE = MaterializedPostgreSQL('<postgres_host>:5432', '<postgres_database>', '<postgres_table>', '<postgres_user>', '<user_password>')
PRIMARY KEY <key_column_name>;

Parameter descriptions are provided above.

Virtual columns

MaterializedPostgreSQL tables include the following virtual columns that you can use in SELECT queries:

_version (UInt64) — transaction counter (LSN value in WAL);
_sign (Int8) — deletion mark. Possible values:
- 1 — row is not deleted;
- -1 — row is deleted.

Example

Configure ADPG

In ADPG, configure the following parameters required for logical replication, based on which MaterializedPostgreSQL-engine replication is implemented:

set the wal_level parameter to logical;
set the max_replication_slots parameter to a value no less than 2.

Create a MaterializedPostgreSQL table

In ADQM, create a MaterializedPostgreSQL table:

CREATE TABLE psql_table_replica (id Int32, name String)
ENGINE = MaterializedPostgreSQL('<adpg_host>:5432', 'adpg_db2', 'table1', 'adqm_user', '<adqm_user_secret_password>')
PRIMARY KEY id;

Data from the source ADPG table is automatically inserted into this ADQM table:

SELECT * FROM psql_table_replica;

┌─id─┬─name──┐
│  1 │ one   │
│  2 │ two   │
└────┴───────┘

Test replication

Add new rows to the adpg_db2.table1 table in ADPG:

INSERT INTO table1 (id, name) VALUES (11, 'eleven'), (22, 'twenty two');

In ADQM, ensure that these rows are added to the table:

SELECT * FROM psql_table_replica;

┌─id─┬─name──┐
│  1 │ one   │
│  2 │ two   │
└────┴───────┘
┌─id─┬─name───────┐
│ 11 │ eleven     │
│ 22 │ twenty two │
└────┴────────────┘

The MaterializedPostgreSQL database engine

The MaterializedPostgreSQL database engine allows you to create an ADQM database with tables replicating tables from a PostgreSQL database. Firstly, it creates a snapshot of the PostgreSQL database and loads necessary tables into the MaterializedPostgreSQL database (it can be any subset of tables from any subset of schemas in a PostgreSQL database). The MaterializedPostgreSQL database receives the initial data dump of PostgreSQL tables and starts the replication process — reads WAL and executes DML queries, applying data changes made to the PostgreSQL database.

The MaterializedPostgreSQL-based replication is implemented with the PostgreSQL logical replication protocol, which does not support replication of DDL queries, but can detect changes breaking replication (for example, changing column types, adding/removing columns). If such changes are found, corresponding tables stop receiving updates. In this case, use the ATTACH TABLE/DETACH TABLE queries to reload tables. If a DDL query does not break replication (for example, renaming a column), a table continues to receive updates (insertion is performed by a position, not by a column name).

IMPORTANT

The MaterializedPostgreSQL database engine is currently experimental. To use it, enable the option before creating a database: SET allow_experimental_database_materialized_postgresql = 1.
Replication of TOAST values is not supported. The default value for the data type is used.

Requirements

The MaterializedPostgreSQL database engine requires the following conditions to be met:

In the PostgreSQL configuration file, the wal_level parameter should be set to logical, and the max_replication_slots parameter value should be at least 2 (to get more information on these parameters and learn how to configure them in ADPG, refer to the Logical replication article in the ADPG documentation).
Each replicated table should have a replica identity — a primary key (by default) or index defined as REPLICA IDENTITY.

Create a database

Below is the basic syntax of a query for creating a database based on the MaterializedPostgreSQL engine (parameter descriptions are provided above):

CREATE DATABASE <database_name>
ENGINE = MaterializedPostgreSQL('<postgres_host>:5432', '<postgres_database>', '<postgres_user>', '<user_password>')
[SETTINGS ...];

Settings

Using the SETTINGS clause, you can specify the following parameters for the MaterializedPostgreSQL database.

materialized_postgresql_tables_list

Comma-separated list of PostgreSQL database tables that will be replicated via the MaterializedPostgreSQL database engine.

In this list, you can specify tables from different schemas of a PostgreSQL database. To do this, specify each table name along with its schema name (separated by a dot), and also set the setting materialized_postgresql_tables_list_with_schema = 1.

In this case, to query a table of the MaterializedPostgreSQL database in ADQM, you also need to specify the table name with the schema name. For example:
SELECT * FROM <database_name>.`<schema_name>.<table_name>`;
If the MaterializedPostgreSQL database replicates tables from a single PostgreSQL schema (set via the materialized_postgresql_schema parameter or default), it is enough to specify table names (without schema names) in the materialized_postgresql_tables_list list. In this case, a table of the MaterializedPostgreSQL database can also be accessed by its name only. For example:
SELECT * FROM <database_name>.<table_name>;
If the table list is not specified, all tables from a PostgreSQL schema (set as materialized_postgresql_schema or default) will be replicated.

materialized_postgresql_schema

PostgreSQL database schema whose tables will be replicated

materialized_postgresql_schema_list

Comma-separated list of PostgreSQL database schemas — full table sets of these schemas will be replicated.

In case of replicating tables from multiple PostgreSQL schemas into the MaterializedPostgreSQL database, specify a schema name and a table name (separated by a dot) when accessing a table of the MaterializedPostgreSQL database in ADQM. For example:

SELECT * FROM <database_name>.`<schema_name>.<table_name>`;

materialized_postgresql_max_block_size

Number of rows collected in memory before flushing data into a table of a PostgreSQL database

materialized_postgresql_replication_slot

materialized_postgresql_snapshot

User-created replication slot and text string identifying a snapshot, from which an initial dump of PostgreSQL tables will be performed.

Both these settings should be specified together. Use them only if there is a real need — for example, if it is required to define a permanent replication slot to avoid breaking PostgreSQL data replication in case of a primary server failure and switching to another server (to get more details and an example, see Failover of the logical replication slot in the ClickHouse documentation). Typically, you do not need to configure a replication slot yourself — a table creates and manages its own replication slot

Add/remove tables to/from replication dynamically

Once a MaterializedPostgreSQL database is created, new tables added to a PostgreSQL database are not automatically replicated — however, you can add them manually via the ATTACH TABLE query. Use also this query if you need to extend the materialized_postgresql_tables_list list specified on the MaterializedPostgreSQL database creation.

To remove specific tables from replication, use the DETACH TABLE query.

Example

Configure ADPG

The MaterializedPostgreSQL database engine requires the same settings for logical replication in ADPG as the MaterializedPostgreSQL table engine. In ADPG, check:

the wal_level parameter is set to logical;
the max_replication_slots parameter value is not less than 2.

Create a MaterializedPostgreSQL database

In ADQM, enable the support of the MaterializedPostgreSQL database engine and create a database:

SET allow_experimental_database_materialized_postgresql=1;

CREATE DATABASE psql_db_replica
ENGINE = MaterializedPostgreSQL('<adpg_host>:5432', 'adpg_db2', 'adqm_user', '<adqm_user_secret_password>')
SETTINGS materialized_postgresql_tables_list = 'table1,table2';

View which tables the database contains:

SHOW TABLES FROM psql_db_replica;

┌─name───┐
│ table1 │
│ table2 │
└────────┘

Read one of these tables (for example, table2):

SELECT * FROM psql_db_replica.table2;

┌─id─┬─name──┐
│  3 │ three │
│  4 │ four  │
└────┴───────┘

Test replication

Add a new data row to the adpg_db2.table2 table in ADPG:

INSERT INTO table2 (id, name) VALUES (30, 'thirty');

Make sure this record is sent to the replica table in ADQM:

SELECT * FROM psql_db_replica.table2;

┌─id─┬─name───┐
│  3 │ three  │
│  4 │ four   │
│ 30 │ thirty │
└────┴────────┘

Add a new table to replication

Use the ATTACH TABLE query to specify an additional table from the ADPG database to be replicated into the ADQM database (in other words, to modify the materialized_postgresql_tables_list list of the MaterializedPostgreSQL database):

ATTACH TABLE psql_db_replica.table3;

Now table3 is also replicated to the psql_db_replica database in ADQM:

SHOW TABLES FROM psql_db_replica;

┌─name───┐
│ table1 │
│ table2 │
│ table3 │
└────────┘

SELECT * FROM psql_db_replica.table3;

┌─id─┬─name─┐
│  5 │ five │
│  6 │ six  │
└────┴──────┘

Dictionary data source of the POSTGRESQL type

You can set a PostgreSQL table as a data source for a dictionary in ClickHouse/ADQM. To do this, when creating a dictionary with the CREATE DICTIONARY query, specify the POSTGRESQL source type in the SOURCE clause and configure parameters for connecting to a PostgreSQL table as follows:

CREATE DICTIONARY <dictionary_name> (...)
...
SOURCE(POSTGRESQL(
    port 5432
    host '<postgres_host>'
    user '<postgres_user>'
    password '<postgres_user_password>'
    replica(host 'postgres_host_1' port 5432 priority 1)
    replica(host 'postgres_host_2' port 5432 priority 2)
    db '<postgres_database>'
    table '<postgres_table>'
    where '<filter_expr>'
    invalidate_query '<query_to_check_dictionary_status>'
    query '<custom_query>'))
...

Configuration of the POSTGRESQL dictionary data source includes the following fields:

port — PostgreSQL port.
host — PostgreSQL host.
user — PostgreSQL user name used to connect to PostgreSQL from ClickHouse.
password — PostgreSQL user password.
replica — settings for connecting to a PostgreSQL replica.

When defining a data source for a dictionary, you can configure general settings for connecting to PostgreSQL (host, port, user, and password) or list multiple replicas and configure each one separately — in the replica field, specify a host and port for connecting to a replica, and also set the replica priority, which ClickHouse will take into account when trying to connect to PostgreSQL (the lower the priority value, the higher the priority of the replica).
db — PostgreSQL database name.
table — name of a table in the PostgreSQL database.
where — selection criteria (optional).
invalidate_query — query to check the dictionary status (optional).
query — custom query (optional).

A data source should be declared through one of the fields: table or query. The table and where fields cannot be used together with query.

Example

In ADQM, create a dictionary that reads data from the adpg_db2.roles_dict table in ADPG:

CREATE DICTIONARY roles_dict_psql (id Int32, role String DEFAULT 'no role assigned')
PRIMARY KEY id
SOURCE(POSTGRESQL(port 5432 host '<adpg_host_name>' user 'adqm_user' password '<adqm_user_secret_password>' db 'adpg_db2' table 'roles_dict'))
LIFETIME(MIN 10 MAX 20)
LAYOUT(FLAT());

Create the users table with users and identifiers of their assigned roles:

CREATE TABLE users (user_id Int32, name String, role_id Int32) ENGINE = MergeTree ORDER BY user_id;

INSERT INTO users VALUES (1, 'john', 10), (2, 'mary', 30), (3, 'andrew', 40), (4, 'harry', 70), (5, 'ann', 50);

--user_id---name-----role_id--
│       1 │ john   │      10 │
│       2 │ mary   │      30 │
│       3 │ andrew │      40 │
│       4 │ harry  │      70 │
│       5 │ ann    │      50 │
------------------------------

Run the following query to read usernames from the users table and find their role names in the roles_dict_psql dictionary by role_id:

SELECT  name AS user, dictGet('roles_dict_psql', 'role', toUInt64(role_id)) AS role FROM users;

The result:

┌─user───┬─role─────────────┐
│ john   │ admin            │
│ mary   │ author           │
│ andrew │ reviewer         │
│ harry  │ no role assigned │
│ ann    │ editor           │
└────────┴──────────────────┘

Found a mistake? Seleсt text and press Ctrl+Enter to report it