Integration between ADQM and ADPG
Overview
ClickHouse supports a set of tools to integrate with PostgreSQL:
-
the postgresql table function | the PostgreSQL table engine | the PostgreSQL database engine — provide the ability to send queries from ClickHouse to tables/databases on a PostgreSQL server for real-time data exchange. They support read and write operations on PostgreSQL data. When executing a
SELECT
orINSERT
query, ClickHouse connects to a remote PostgreSQL table/database and retrieves up-to-date data from PostgreSQL or exports new data to it. In other words, the data is stored in PostgreSQL, and you can access it from ClickHouse. -
the MaterializedPostgreSQL table engine | the MaterializedPostgreSQL database engine — allow using ClickHouse as a PostgreSQL replica. When you create a MaterializedPostgreSQL table/database, ClickHouse creates an initial data dump of a table or tables of the PostgreSQL database and starts the replication process — in the background, it applies data changes performed on the PostgreSQL side to the corresponding MaterializedPostgreSQL table/database. Thus, when MaterializedPostgreSQL engines are used, unlike PostgreSQL engines, data is stored in both PostgreSQL and ClickHouse.
-
the POSTGRESQL type of a data source for dictionaries — allows you to assign a PostgreSQL table to be a data source for a dictionary in ClickHouse.
This article describes ways to use these tools on the example of ADQM and ADPG (Arenadata Postgres — a relational database management system based on PostgreSQL). It provides the syntax of queries to call a table function and create tables/databases based on the mentioned integration engines, as well as sample queries to read/write ADPG data from ADQM.
On the ADPG host, run the psql terminal client as the default user (postgres
) using the following commands:
$ sudo su - postgres
$ psql
Now you can enter commands in psql to work with PostgreSQL databases and tables:
psql (14.1) Type "help" for help. postgres=#
NOTE
For an overview on how to work with PostgreSQL tables in psql, see Example of working with tables via psql in the ADPG documentation. |
Create a PostgreSQL user
Create a user for connecting to ADPG from ADQM (for demonstration purposes, grant superuser privileges):
CREATE ROLE adqm_user SUPERUSER LOGIN PASSWORD '<adqm_user_secret_password>';
Create an ADPG database and table to use in examples with the table function and PostgreSQL engines
-
Create the
adpg_db1
database in ADPG:CREATE DATABASE adpg_db1;
Switch to this database:
\c adpg_db1
The command result:
You are now connected to database "adpg_db1" as user "postgres". adpg_db1=#
-
Create a table:
CREATE TABLE table1 (id INTEGER PRIMARY KEY, name VARCHAR(10));
-
Add test data to the table:
INSERT INTO table1 (id, name) VALUES (1, 'one'), (2, 'two');
Create an ADPG database and tables to use in examples with MaterializedPostgreSQL engines
-
Create a new database in ADPG (
adpg_db2
) and connect to it:CREATE DATABASE adpg_db2;
\c adpg_db2
-
In the
adpg_db2
database, create three tables with the same structure:CREATE TABLE table1 (id INTEGER PRIMARY KEY, name VARCHAR(10));
CREATE TABLE table2 (id INTEGER PRIMARY KEY, name VARCHAR(10));
CREATE TABLE table3 (id INTEGER PRIMARY KEY, name VARCHAR(10));
-
Add test data to these tables:
INSERT INTO table1 (id, name) VALUES (1, 'one'), (2, 'two');
INSERT INTO table2 (id, name) VALUES (3, 'three'), (4, 'four');
INSERT INTO table3 (id, name) VALUES (5, 'five'), (6, 'six');
Create an ADPG table to be a data source for a dictionary
In the adpg_db2
database, create the roles_dict
table to be a data source for a dictionary:
CREATE TABLE roles_dict (id INTEGER PRIMARY KEY, role VARCHAR(10));
INSERT INTO roles_dict (id, role) VALUES (10, 'admin'), (20, 'owner'), (30, 'author'), (40, 'reviewer'), (50, 'editor'), (60, 'view only');
Set up authentication for connecting to ADPG
To allow the adqm_user
user to connect to the adpg_db1
and adpg_db2
databases from an ADQM host and set password authentication, add the following records to the pg_hba.conf file (for ADPG, you can do this in the ADCM interface — see the Password authentication article for details):
host adpg_db1 adqm_user <adqm_host_address> password host adpg_db2 adqm_user <adqm_host_address> password
where <adqm_host_address>
is an address of the ADQM host (for example, if an IPv4 address of the ADQM host is 10.92.40.25
, replace <adqm_host_address>
with 10.92.40.25/32
). You can find description of the address
field and other fields of records in the pg_hba.conf file in the article The pg_hba.conf file.
Basic parameters for integration
To call the postgresql
table function and create tables/databases based on engines for integration with PostgreSQL (the basic syntax of corresponding queries is provided in sections below), it is required to specify the following parameters.
<postgres_host> |
PostgreSQL server address (IP or host name) |
<postgres_database> |
PostgreSQL database name |
<postgres_table> |
Name of a table in the PostgreSQL database (specify it for the |
<postgres_user> |
PostgreSQL user name used to connect to PostgreSQL from ClickHouse |
<user_password> |
PostgreSQL user password |
<schema> |
PostgreSQL schema (you can optionally specify it for the |
TIP
For production environments, it is recommended to pass parameters using named collections.
|
The postgresql table function
The postgresql
table function allows you to send SELECT
and INSERT
queries to tables in PostgreSQL databases.
The function syntax is:
postgresql('<postgres_host>:5432', '<postgres_database>', '<postgres_table>', '<postgres_user>', '<user_password>'[, '<schema>'])
Function parameters are described above.
Example
Read data from ADPG
Run the following query to get ADPG data from the adpg_db1.table1
table in ADQM using the postgresql
function:
SELECT id, name FROM postgresql('<adpg_host>:5432', 'adpg_db1', 'table1', 'adqm_user', '<adqm_user_secret_password>');
where <adpg_host>
is an ADPG host address (IP or host name).
The result:
┌─id─┬─name──┐ │ 1 │ one │ │ 2 │ two │ └────┴───────┘
Write data to ADPG
Run the following query to send a new record to the ADPG table from ADQM:
INSERT INTO TABLE FUNCTION postgresql('<adpg_host>:5432', 'adpg_db1', 'table1', 'adqm_user', '<adqm_user_secret_password>') (id, name)
VALUES (3, 'three');
Make sure that data is written to the destination table. To do this, run the following command on the ADPG host (connect to the adpg_db1
database first):
SELECT * FROM table1;
id | name ----+------- 1 | one 2 | two 3 | three (3 rows)
The PostgreSQL table engine
The PostgreSQL table engine allows you to send SELECT
and INSERT
queries to tables on a remote PostgreSQL server.
Create a table
Below is the basic syntax of a query that creates a PostgreSQL table (see parameter descriptions above):
CREATE TABLE <table_name> (<column_name> <column_type> [DEFAULT|MATERIALIZED|ALIAS <expr>] [TTL <expr>], ...)
ENGINE = PostgreSQL('<postgres_host>:5432', '<postgres_database>', '<postgres_table>', '<postgres_user>', '<user_password>'[, '<schema>']);
The structure of an ADQM table can differ from the structure of an original PostgreSQL table you are connecting to:
-
Column names in the ADQM table should be the same as in the original PostgreSQL table, but you can use just some of these columns and in any order.
-
Column types in the ADQM table may differ from types in the original PostgreSQL table — ADQM tries to cast values to ClickHouse data types.
-
The external_table_functions_use_nulls setting defines how Nullable columns are handled. The default value is
1
. If the value is0
, columns are not converted to Nullable — default values of corresponding data types are inserted instead of NULL values. This also applies to NULL values in arrays.
Example
In ADQM, create a table based on the PostgreSQL engine and read its data:
CREATE TABLE psql_table (id Int32, name String)
ENGINE = PostgreSQL('<adpg_host>:5432', 'adpg_db1', 'table1', 'adqm_user', '<adqm_user_secret_password>');
SELECT * FROM psql_table;
┌─id─┬─name──┐ │ 1 │ one │ │ 2 │ two │ │ 3 │ three │ └────┴───────┘
Insert data into the ADQM table:
INSERT INTO psql_table VALUES (4, 'four');
On the ADPG host, query the adpg_db1.table1
table to check that data from ADQM was successfully imported into it:
SELECT * FROM table1;
id | name ----+------- 1 | one 2 | two 3 | three 4 | four (4 rows)
The PostgreSQL database engine
The PostgreSQL database engine allows you to connect to a database on a PostgreSQL server and perform read/write operations (SELECT
/INSERT
queries) to exchange data between ClickHouse and PostgreSQL, as well as receive real-time information about database tables from the PostgreSQL server and their structures using the SHOW TABLES
and DESCRIBE TABLE
queries.
Create a database
The syntax of a query that creates a PostgreSQL database in ADQM is:
CREATE DATABASE <database_name>
ENGINE = PostgreSQL('<postgres_host>:5432', '<postgres_database>', '<postgres_user>', '<user_password>'[, `<schema>`, `use_table_cache`]);
See descriptions of basic parameters above. The optional use_table_cache
parameter specifies whether structures of tables are cached. If the parameter is set to 1
, a table structure is cached and its changes are not tracked — you can update the table structure manually running the DETACH TABLE
and ATTACH TABLE
queries sequentially.
Example
Create a database based on the PostgreSQL engine:
CREATE DATABASE psql_database
ENGINE = PostgreSQL('<adpg_host>:5432', 'adpg_db1', 'adqm_user', '<adqm_user_secret_password>', 1);
Check that this database has been added to the ADQM database list:
SHOW DATABASES;
┌─name───────────────┐ │ INFORMATION_SCHEMA │ │ default │ │ information_schema │ │ psql_database │ │ system │ └────────────────────┘
View tables that are in the psql_database
database:
SHOW TABLES FROM psql_database;
┌─name───┐ │ table1 │ └────────┘
Query data from a table:
SELECT * FROM psql_database.table1;
┌─id─┬─name──┐ │ 1 │ one │ │ 2 │ two │ │ 3 │ three │ │ 4 │ four │ └────┴───────┘
Add new data to the table on the ADPG side:
INSERT INTO table1 (id, name) VALUES (5, 'five');
Check that this data is available from the ADQM table:
SELECT * FROM psql_database.table1;
┌─id─┬─name──┐ │ 1 │ one │ │ 2 │ two │ │ 3 │ three │ │ 4 │ four │ │ 5 │ five │ └────┴───────┘
Add a new column to the adpg_db1.table1
table in ADPG to change its structure. To do this, run the following command on the ADPG host:
ALTER TABLE table1 ADD COLUMN value SMALLINT;
The table structure on the ADQM side will not change, since the database’s use_table_cache
parameter is set to 1
and ADQM caches table structures:
DESCRIBE TABLE psql_database.table1;
┌─name─┬─type─────────────┬─ │ id │ Int32 │ │ name │ Nullable(String) │ └──────┴──────────────────┴─
To update the table structure in ADQM, detach and re-attach the table:
DETACH TABLE psql_database.table1;
ATTACH TABLE psql_database.table1;
DESCRIBE TABLE psql_database.table1;
┌─name──┬─type─────────────┬─ │ id │ Int32 │ │ name │ Nullable(String) │ │ value │ Nullable(Int16) │ └───────┴──────────────────┴─
The MaterializedPostgreSQL table engine
The MaterializedPostgreSQL table engine creates a ClickHouse table with an initial data dump of a PostgreSQL table and starts the replication process — applies changes made to the PostgreSQL table to the ADQM table in the background.
If you need to replicate more than one PostgreSQL table to ADQM, it is highly recommended to use the MaterializedPostgreSQL database engine instead of the table engine and use the materialized_postgresql_tables_list
setting to specify the PostgreSQL tables to be replicated. It is much better in terms of CPU, fewer connections, and fewer replication slots in the PostgreSQL database.
CAUTION
Replication of TOAST values is not supported. The default value for the data type is used.
|
Requirements
The MaterializedPostgreSQL table engine requires the following conditions to be met:
-
In the PostgreSQL configuration file, the
wal_level
parameter should be set tological
, and themax_replication_slots
parameter value should be at least2
(for more information on these parameters and how to configure them in ADPG, see the Logical replication article). -
A MaterializedPostgreSQL table should have the same primary key as the replication identity index (usually a primary key) of a PostgreSQL table.
-
A MaterializedPostgreSQL table can be created in an Atomic database only.
Create a table
The basic syntax of a query to create a table based on the MaterializedPostgreSQL engine is:
CREATE TABLE <table_name> (<key_column_name> <key_column_type>, <column_name> <column_type>, ...)
ENGINE = MaterializedPostgreSQL('<postgres_host>:5432', '<postgres_database>', '<postgres_table>', '<postgres_user>', '<user_password>')
PRIMARY KEY <key_column_name>;
Parameter descriptions are provided above.
Virtual columns
MaterializedPostgreSQL tables include the following virtual columns that you can use in SELECT
queries:
-
_version
(UInt64) — transaction counter (LSN value in WAL); -
_sign
(Int8) — deletion mark. Possible values:-
1
— row is not deleted; -
-1
— row is deleted.
-
Example
Configure ADPG
In ADPG, configure the following parameters required for logical replication, based on which MaterializedPostgreSQL-engine replication is implemented:
-
set the
wal_level
parameter tological
; -
set the
max_replication_slots
parameter to a value no less than2
.
Create a MaterializedPostgreSQL table
In ADQM, create a MaterializedPostgreSQL table:
CREATE TABLE psql_table_replica (id Int32, name String)
ENGINE = MaterializedPostgreSQL('<adpg_host>:5432', 'adpg_db2', 'table1', 'adqm_user', '<adqm_user_secret_password>')
PRIMARY KEY id;
Data from the source ADPG table is automatically inserted into this ADQM table:
SELECT * FROM psql_table_replica;
┌─id─┬─name──┐ │ 1 │ one │ │ 2 │ two │ └────┴───────┘
Test replication
Add new rows to the adpg_db2.table1
table in ADPG:
INSERT INTO table1 (id, name) VALUES (11, 'eleven'), (22, 'twenty two');
In ADQM, ensure that these rows are added to the table:
SELECT * FROM psql_table_replica;
┌─id─┬─name──┐ │ 1 │ one │ │ 2 │ two │ └────┴───────┘ ┌─id─┬─name───────┐ │ 11 │ eleven │ │ 22 │ twenty two │ └────┴────────────┘
The MaterializedPostgreSQL database engine
The MaterializedPostgreSQL database engine allows you to create an ADQM database with tables replicating tables from a PostgreSQL database. Firstly, it creates a snapshot of the PostgreSQL database and loads necessary tables into the MaterializedPostgreSQL database (it can be any subset of tables from any subset of schemas in a PostgreSQL database). The MaterializedPostgreSQL database receives the initial data dump of PostgreSQL tables and starts the replication process — reads WAL and executes DML queries, applying data changes made to the PostgreSQL database.
The MaterializedPostgreSQL-based replication is implemented with the PostgreSQL logical replication protocol, which does not support replication of DDL queries, but can detect changes breaking replication (for example, changing column types, adding/removing columns). If such changes are found, corresponding tables stop receiving updates. In this case, use the ATTACH TABLE
/DETACH TABLE
queries to reload tables. If a DDL query does not break replication (for example, renaming a column), a table continues to receive updates (insertion is performed by a position, not by a column name).
IMPORTANT
|
Requirements
The MaterializedPostgreSQL database engine requires the following conditions to be met:
-
In the PostgreSQL configuration file, the
wal_level
parameter should be set tological
, and themax_replication_slots
parameter value should be at least2
(to get more information on these parameters and learn how to configure them in ADPG, refer to the Logical replication article in the ADPG documentation). -
Each replicated table should have a replica identity — a primary key (by default) or index defined as REPLICA IDENTITY.
Create a database
Below is the basic syntax of a query for creating a database based on the MaterializedPostgreSQL engine (parameter descriptions are provided above):
CREATE DATABASE <database_name>
ENGINE = MaterializedPostgreSQL('<postgres_host>:5432', '<postgres_database>', '<postgres_user>', '<user_password>')
[SETTINGS ...];
Settings
Using the SETTINGS
clause, you can specify the following parameters for the MaterializedPostgreSQL database.
materialized_postgresql_tables_list |
Comma-separated list of PostgreSQL database tables that will be replicated via the MaterializedPostgreSQL database engine.
|
materialized_postgresql_schema |
PostgreSQL database schema whose tables will be replicated |
materialized_postgresql_schema_list |
Comma-separated list of PostgreSQL database schemas — full table sets of these schemas will be replicated. In case of replicating tables from multiple PostgreSQL schemas into the MaterializedPostgreSQL database, specify a schema name and a table name (separated by a dot) when accessing a table of the MaterializedPostgreSQL database in ADQM. For example:
|
materialized_postgresql_max_block_size |
Number of rows collected in memory before flushing data into a table of a PostgreSQL database |
materialized_postgresql_replication_slot materialized_postgresql_snapshot |
User-created replication slot and text string identifying a snapshot, from which an initial dump of PostgreSQL tables will be performed. Both these settings should be specified together. Use them only if there is a real need — for example, if it is required to define a permanent replication slot to avoid breaking PostgreSQL data replication in case of a primary server failure and switching to another server (to get more details and an example, see Failover of the logical replication slot in the ClickHouse documentation). Typically, you do not need to configure a replication slot yourself — a table creates and manages its own replication slot |
Add/remove tables to/from replication dynamically
Once a MaterializedPostgreSQL database is created, new tables added to a PostgreSQL database are not automatically replicated — however, you can add them manually via the ATTACH TABLE
query. Use also this query if you need to extend the materialized_postgresql_tables_list
list specified on the MaterializedPostgreSQL database creation.
To remove specific tables from replication, use the DETACH TABLE
query.
Example
Configure ADPG
The MaterializedPostgreSQL database engine requires the same settings for logical replication in ADPG as the MaterializedPostgreSQL table engine. In ADPG, check:
-
the
wal_level
parameter is set tological
; -
the
max_replication_slots
parameter value is not less than2
.
Create a MaterializedPostgreSQL database
In ADQM, enable the support of the MaterializedPostgreSQL database engine and create a database:
SET allow_experimental_database_materialized_postgresql=1;
CREATE DATABASE psql_db_replica
ENGINE = MaterializedPostgreSQL('<adpg_host>:5432', 'adpg_db2', 'adqm_user', '<adqm_user_secret_password>')
SETTINGS materialized_postgresql_tables_list = 'table1,table2';
View which tables the database contains:
SHOW TABLES FROM psql_db_replica;
┌─name───┐ │ table1 │ │ table2 │ └────────┘
Read one of these tables (for example, table2
):
SELECT * FROM psql_db_replica.table2;
┌─id─┬─name──┐ │ 3 │ three │ │ 4 │ four │ └────┴───────┘
Test replication
Add a new data row to the adpg_db2.table2
table in ADPG:
INSERT INTO table2 (id, name) VALUES (30, 'thirty');
Make sure this record is sent to the replica table in ADQM:
SELECT * FROM psql_db_replica.table2;
┌─id─┬─name───┐ │ 3 │ three │ │ 4 │ four │ │ 30 │ thirty │ └────┴────────┘
Add a new table to replication
Use the ATTACH TABLE
query to specify an additional table from the ADPG database to be replicated into the ADQM database (in other words, to modify the materialized_postgresql_tables_list
list of the MaterializedPostgreSQL database):
ATTACH TABLE psql_db_replica.table3;
Now table3
is also replicated to the psql_db_replica
database in ADQM:
SHOW TABLES FROM psql_db_replica;
┌─name───┐ │ table1 │ │ table2 │ │ table3 │ └────────┘
SELECT * FROM psql_db_replica.table3;
┌─id─┬─name─┐ │ 5 │ five │ │ 6 │ six │ └────┴──────┘
Dictionary data source of the POSTGRESQL type
You can set a PostgreSQL table as a data source for a dictionary in ClickHouse/ADQM. To do this, when creating a dictionary with the CREATE DICTIONARY
query, specify the POSTGRESQL
source type in the SOURCE
clause and configure parameters for connecting to a PostgreSQL table as follows:
CREATE DICTIONARY <dictionary_name> (...)
...
SOURCE(POSTGRESQL(
port 5432
host '<postgres_host>'
user '<postgres_user>'
password '<postgres_user_password>'
replica(host 'postgres_host_1' port 5432 priority 1)
replica(host 'postgres_host_2' port 5432 priority 2)
db '<postgres_database>'
table '<postgres_table>'
where '<filter_expr>'
invalidate_query '<query_to_check_dictionary_status>'
query '<custom_query>'))
...
Configuration of the POSTGRESQL dictionary data source includes the following fields:
-
port
— PostgreSQL port. -
host
— PostgreSQL host. -
user
— PostgreSQL user name used to connect to PostgreSQL from ClickHouse. -
password
— PostgreSQL user password. -
replica
— settings for connecting to a PostgreSQL replica.When defining a data source for a dictionary, you can configure general settings for connecting to PostgreSQL (
host
,port
,user
, andpassword
) or list multiple replicas and configure each one separately — in thereplica
field, specify a host and port for connecting to a replica, and also set the replica priority, which ClickHouse will take into account when trying to connect to PostgreSQL (the lower thepriority
value, the higher the priority of the replica). -
db
— PostgreSQL database name. -
table
— name of a table in the PostgreSQL database. -
where
— selection criteria (optional). -
invalidate_query
— query to check the dictionary status (optional). -
query
— custom query (optional).
A data source should be declared through one of the fields: table
or query
. The table
and where
fields cannot be used together with query
.
Example
In ADQM, create a dictionary that reads data from the adpg_db2.roles_dict
table in ADPG:
CREATE DICTIONARY roles_dict_psql (id Int32, role String DEFAULT 'no role assigned')
PRIMARY KEY id
SOURCE(POSTGRESQL(port 5432 host '<adpg_host_name>' user 'adqm_user' password '<adqm_user_secret_password>' db 'adpg_db2' table 'roles_dict'))
LIFETIME(MIN 10 MAX 20)
LAYOUT(FLAT());
Create the users
table with users and identifiers of their assigned roles:
CREATE TABLE users (user_id Int32, name String, role_id Int32) ENGINE = MergeTree ORDER BY user_id;
INSERT INTO users VALUES (1, 'john', 10), (2, 'mary', 30), (3, 'andrew', 40), (4, 'harry', 70), (5, 'ann', 50);
--user_id---name-----role_id-- │ 1 │ john │ 10 │ │ 2 │ mary │ 30 │ │ 3 │ andrew │ 40 │ │ 4 │ harry │ 70 │ │ 5 │ ann │ 50 │ ------------------------------
Run the following query to read usernames from the users
table and find their role names in the roles_dict_psql
dictionary by role_id
:
SELECT name AS user, dictGet('roles_dict_psql', 'role', toUInt64(role_id)) AS role FROM users;
The result:
┌─user───┬─role─────────────┐ │ john │ admin │ │ mary │ author │ │ andrew │ reviewer │ │ harry │ no role assigned │ │ ann │ editor │ └────────┴──────────────────┘