Kafka Tiered Storage overview
Tiered Storage overview
Kafka Tiered Storage is an option that enables a tiered way of the data storage in a Kafka cluster, available since Kafka version 3.6.0.
After the option activation in Kafka, two levels are set for storage:
The partition leader, which uses threads managed via special Remote Manager components, copies all closed segments to the storage (the period after which closed segments are sent to the external storage is determined using the remote.log.manager.task.interval.ms broker parameter). Next, the leader saves and publishes a link containing indexes and metadata of the remote segment — a closed segment stored in the storage (leader epoch, producer, offset, and data about the storage of the segment in the storage). Closed segments are removed from the broker when the local retention time expires (local.retention.ms) or when the maximum local storage size is reached (local.retention.bytes).
Followers, seeking to become ISR, replicate data written to the leader, while caching links to remote segments.
Writing and deleting messages are idempotent processes and are performed as transactions.
The interaction of producers and consumers with Kafka brokers does not change after enabling the Tiered Storage option. Messages are written to active segments located at the local level, and for each consumer request to read a message from the remote level, a separate thread is created, which is served by Remote Manager components, as well as threads for writing segments to storage.
Benefits of Tiered Storage
-
Possibility of long-term data storage in external storages.
-
Reducing the amount of data on Kafka brokers, and therefore the amount of data that needs to be copied during recovery and rebalancing.
-
Simplification of the operation of large Kafka clusters with long-term data storage.
Tiered Storage architecture
After activating the Tiered Storage option, the components responsible for managing remote segments are launched on the Kafka broker.
RemoteLogManager is an internal component that runs on each Kafka broker. Through it, the broker interacts with remote segments. RemoteLogManager does not have a public API.
The ReplicaManager persistent component that manages partition replicas calls RemoteLogManager to manage the partitions.
And RemoteLogManager, in turn, passes the command to copy segments to storage or delete them to the RemoteStorageManager component and maintains the corresponding remote segment metadata for RemoteLogMetadataManager.
RemoteStorageManager is an interface that provides the life cycle of remote log segments and indexes.
Remote Log Metadata Manager is an interface that provides the life cycle of remote log segment metadata with strictly consistent semantics. There is a default implementation that uses the internal topic __remote_log_metadata
. Users can plug in their own implementation if they intend to use a different system to store remote segment metadata.
Tiered Storage limitations
-
Once enabled for a specific topic, the option cannot be disabled.
-
The option cannot be used for topics in which the cleanup policy is set to compact.
-
For an existing topic, it is not possible to migrate segments back to local brokers, increasing the local data storage time.
-
Cannot be used with a disk array (JBOD) as a local broker.
Kafka Tiered Storage in ADS
The Tiered Storage option in ADS is based on the Aiven implementation, which adds the following capabilities to the built-in Kafka RemoteStorageManager:
-
Segment compression using the chunking mechanism — dividing the original segments into parts and compressing them before sending them to storage.
-
Segment encryption with key rotation support.
-
Optimization of the number of accesses to remote storage.
-
Abstraction of remote storage in the form of a StorageBackend — a container of byte arrays based on key/value messages.
Additionally, ADS implements support for HDFS as remote storage, for which StorageBackend has been developed based on the Aiven solution, providing:
-
Support for configuration XML files.
-
Support for authentication via Kerberos.
-
HDFS client metrics export.
Tiered Storage configuration
The Tiered Storage option is configured in the /etc/kafka/conf/server.properties configuration file and consists of several blocks of parameters:
-
Enabling the Tiered Storage option using broker remote.log.storage.system.enable or topic remote.storage.enable parameters.
-
Configuring Remote Manager components.
-
Setting up the storage backend.
-
Configuring the chunking cache — if supported in the storage backend.
NOTE
Configuring Tiered Storage in ADS using the ADCM interface is described in the Configure and use Kafka Tiered Storage article. |