Debezium overview

Debezium features

Debezium is a distributed open source platform based on the change data capture (CDC) pattern.

The Debezium connector collects changes in databases and transmits them to external applications.

Debezium is used by external applications to record and process data changes in the DBMS: insertion, update, and deletion events.

The figure below shows the capabilities of the Debezium platform.

Debezium features
Debezium features
Debezium features
Debezium features

Debezium connectors can be launched using the following methods:

  • Via Kafka Connect, using Debezium ready plugins to create connectors. In this case, all changes are written to Kafka topics. This method is recommended if the system uses event streaming to/from Kafka to bypass third-party intermediaries.

  • Using Debezium server — a ready-made application that transmits change events from the source database to various messaging infrastructures.

  • Using Debezium engine, which allows you to embed Debezium connectors directly into the application space using a special module debezium-api. At the same time, there is no such level of fault tolerance and reliability as when using the Kafka Connect service, but there are no intermediaries such as Kafka brokers and the Kafka Connect service.

Debezium and Kafka Connect

The figure below shows what the architecture of a change data capture pipeline looks like using the Debezium connector, for example, when transferring events from a PostgreSQL server to any database that supports JBDC.

Change data capture architecture based on Debezium
Change data capture architecture based on Debezium
Change data capture architecture based on Debezium
Change data capture architecture based on Debezium

The Debezium connector, created with Kafka Connect, writes change events from PostgreSQL to a Kafka topic whose default name consists of a user-specified prefix, the database schema name, and the table name from which changes are collected. Once change event records are stored in Kafka, the various sinks created by Kafka Connect can push the records to other databases and data stores.

Debezium provides a set of connectors for transferring data to Kafka topics from various DBMS.

Features of Debezium connectors for different DBMS

The following describes the functionality of the connectors available by default in ADS Control.

PostgreSQL

Capture of data changes on the PostgreSQL server is provided using a logical replication mechanism based on the write ahead log (WAL). This log is stored on disk and stores all data change events for INSERT, UPDATE, and DELETE queries. These changes are processed using the output plugin:

  • decoderbufs — plugin based on Protobuf;

  • pgoutput — standard logical decoding output plugin in PostgreSQL 10+, used by PostgreSQL itself for logical replication.

The Debezium connector for PostgreSQL has limitations, including those related to the PostgreSQL logical decoding function used:

  • DDL changes are not supported — events associated with CREATE, ALTER, DROP, TRUNCATE requests.

  • If PostgreSQL is organized as a cluster, the connector can only work with the main server, and if it fails, the connector stops. After the primary server is restored, you can restart the connector. If another PostgreSQL server has been promoted to primary, you must manually configure the connector before restarting.

  • The pgoutput logic decoding output module does not capture values ​​for generated columns.

  • Currently, Debezium only supports databases with UTF-8 character encoding. A single-byte character encoding cannot correctly process strings containing extended ASCII characters.

When the PostgreSQL connector first connects to a PostgreSQL database, it creates a consistent snapshot of each of the database schemas. The connector then streams changes from the location where the snapshot was created.

Streaming replication parameters of the PostgreSQL server that are important for the operation of the connector are listed below:

  • wal_level — to enable support for logical decoding, it must be set to logical;

  • parameters affecting the number of connectors that can simultaneously access the server:

    • max_replication_slots — the maximum number of replication slots that the server can support.

    • max_wal_senders — the maximum number of simultaneous connections from backup servers or streaming basic backup clients.

  • wal_keep_size — the maximum WAL size limit that a replication slot will keep.

Configuring PostgreSQL Server to run the Debezium connector when using the pgoutput plugin requires a database user who has the following privileges:

  • REPLICATION

  • LOGIN

Additionally, when using the pgoutput plugin, the user requires the following privileges:

  • CREATE in the database to add publications.

  • SELECT on tables to copy the original table data.

MS SQL

To allow the Debezium Connector for SQL Server to capture change event records for database operations, you must first enable the change data capture feature available starting with SQL Server 2016. CDC must be enabled on the database and on each table that needs to be captured.

The connector tracks INSERT, UPDATE, and DELETE operations at the row level and writes event records for each table to a separate Kafka topic.

When the connector first connects to an SQL Server database, it takes a consistent schema snapshot of all the tables specified in its configuration and passes it to Kafka. Next, the connector continuously monitors changes at the row level.

Create the Debezium connector

Debezium connectors, like other connectors, can be created using multiple ways through Kafka Connect.

On the Kafka Connect page of the ADS Control user interface, plugins for creating some Debezium connectors for transferring data to Kafka cluster topics are available for selection ADS, as well as the ability to control the operation of connectors:

The remaining connectors can be installed independently. To do this, you need to create a JAR file using the connector code downloaded from the Debezium repository and save it in the folder, which path is specified as the plugin.path parameter in the connect-distributed.properties group on the configuration page of the Kafka Connect service.

Found a mistake? Seleсt text and press Ctrl+Enter to report it