KRaft overview
Overview
Kafka Raft (KRaft) is a protocol based on the Raft consensus algorithm that manages Kafka cluster metadata. KRaft organizes the storage of information about the cluster configuration, ensures synchronization of Kafka brokers during data replication without using leader election and metadata storage mechanisms provided by ZooKeeper.
NOTE
The ability to enable KRaft mode has been available since Kafka version 3.6.2. |
KRaft benefits
The main characteristic of KRaft is the absence of an intermediary service, which entails the following advantages:
-
Fast and efficient metadata storage and distribution — updates are sent directly to the controller quorum, eliminating an additional transmission node (ZooKeeper).
-
Brokers and controllers can store metadata locally in their cache and access it without using a connection to an external service.
-
Reducing delays when broker stops their work.
-
It is expected that the KRaft protocol will significantly increase the limit on the number of partitions for one node and an entire cluster.
Controller quorum
The main concept of KRaft is based on the creation of a controller quorum — a set of specialized brokers that are responsible for storing and timely updating cluster metadata.
For controller assignment in Kafka the process.roles parameter with the controller
value is used.
The controller quorum only serves the internal __cluster_metadata
topic with Kafka metadata. The topic contains one partition, which records all information about the current state of Kafka brokers, topics, and partitions.
The leader of the partition — active controller — is selected based on the Raft consensus algorithm by voting for the leader epoch. The active controller is responsible for writing metadata changes to the __cluster_metadata
topic.
Controllers that are not the leader store an exact copy of the topic with metadata as in-sync replicas (ISR) and have the right to vote for the new leader epoch. As a result of the elections, each of these voters can become a leader.
Brokers whose process.roles parameter is set to broker
cannot vote for the leader epoch, they are observers. Observers work like regular Kafka brokers, but they store a copy of the __cluster_metadata
topic, each time rewriting the cluster metadata state into their own cache whenever the topic changes.
The active controller also monitors the state of all controllers and brokers using timeouts and heartbeat messages sent by brokers. When a broker stops, the leader redistributes partitions to other brokers and updates information about them in the metadata topic.
To coordinate replicas, the high water mark (HWM) marker is used — the largest offset in the log that was written to all ISRs. Followers cache its value and update it whenever the leader’s log changes, thus maintaining a consistent state of the cluster. When a leader changes, all follower partitions truncate their logs to the HWM recorded by the new leader, and all messages above the HWM become unreadable. After the log is truncated, the follower can continue to replicate any new messages from the leader. Once all new messages are replicated to a majority of nodes, the HWM is increased.
In addition to the process.roles
parameter, the following parameters must be defined for Kafka cluster brokers configured to work in KRaft mode:
-
controller.quorum.voters.roles — addresses of nodes participating in the controller quorum.
-
node.id — a node ID, must be a unique integer and is assigned during installation/extension. Cannot be the same as the ID of any other broker or controller in the cluster.
-
controller.listener.names — a comma-separated list of listener names used by the controller. When communicating with a controller quorum, a broker in KRaft mode will always use the first listener in this list.
-
metadata.log.dir — a directory for storing cluster metadata. The directory name must not be the same as any broker directory; the default is /kafka-meta.
Raft algorithm
Overview
The Raft and ZAB (used in ZooKeeper) consensus algorithms have the following conceptual similarities:
-
the only leader responsible for data processing at any time;
-
separate processes for leader elections and log replication;
-
general focus on ensuring data consistency in distributed systems.
The Raft architecture has a simpler implementation, in which the following roles are allocated for the operation of nodes:
-
Leader of the cluster, which processes client requests and controls data replication.
-
Candidate trying to become a leader.
-
Follower recording messages and processing requests from the leader and candidates.
Vote
The leader is chosen randomly on the first run, and the remaining nodes start as followers.
The leader monitors the health of followers using timeouts and periodic heartbeats, and also sends its heartbeats with or without messages (if any).
If a follower does not receive heartbeats from the leader during the timeout, it becomes a leader candidate. Each candidate sends a request for leadership to the others and accepts the same requests from the others. The candidate who receives the most votes becomes the leader. The leader is selected for a period, the called term. The term is identified as epoch — a number that i incremented with each new leader.
Message replication
The leader sends a request to all followers to add a new record and considers the transaction successful only after receiving confirmation of the message record from the majority of nodes.
Messages are always added sequentially.
If an external client connects to the cluster through a follower, the request is still forwarded to the leader.
KRaft
Additionally, the KRaft protocol introduces a method for storing metadata based on the Kafka log replication protocol, where controller quorum is used. The basic principle is that each cluster broker strives to store a replica of the metadata topic and does not require a connection to ZooKeeper to obtain the state of the cluster (topic partition leaders, new brokers, etc.), and therefore does not require latency and a significant amount of memory, since the process of storing metadata in the log is part of the operation of the Kafka algorithms.
Metadata management
Write metadata
Each new broker or controller that joins the cluster requests a copy of the internal topic __cluster_metadata and creates a replica of it, simultaneously caching the topic data, i.e. saves data about the complete state of the cluster.
If the broker starts after a failure, it reads the most recent changes available in its metadata cache.
After that, it queries the active controller for events that have occurred since the last update in __cluster_metadata
and in response receives the latest changes in the form of snapshots of the cluster state.
Metadata replication
For Kafka running in KRaft mode, HWM is tracked and taken into account by the leader for the total number of metadata topic replicas (not just for ISR replicas, as in regular Kafka).
The HWM cannot be increased until all messages written by the leader have been stored by a majority of the replicas. If there is a change of leader, the epoch increases. To reconcile their logs with the leader, followers send a request containing the last offset and the last epoch during which messages were recorded.
Based on such requests, the leader monitors the performance of brokers and in response sends information about the new epoch and the last offset. Next, the replicas catch up with the leader or cut off their log to the HWM transmitted by the leader. After a majority of brokers confirm the last offset entry, the HWM is incremented. After the log is truncated, the follower can continue to replicate any new messages from the leader.
KRaft in ADS
Starting from ADS version 3.6.2.2.b1, installation of the Kafka service in KRaft mode is available via the ADCM interface. This method automatically sets the necessary parameters for brokers.
CAUTION
|