ZooKeeper

ZooKeeper is a centralized coordinating service performing the following functions in Hadoop clusters:

  • Maintaining configuration information.

  • Providing a naming registry.

  • Ensuring distributed synchronization.

  • Avoiding such errors, as race conditions and deadlocks.

  • Providing group services.

It is a Java-based service and has bindings for Java and C. Its simple architecture and personalized API helps to prevent possible errors when managing large distributed systems.

Architecture

Zookeeper uses the cient-server architecture and contains the following components:

  • Ensemble is the collection of all ZooKeeper servers. The ensemble requires at least three nodes for its reliability. When one server goes down, the other two make up a quorum that keeps working. It is recommended to use an odd number of servers in the cluster: 3, 5, 7, and so on.

  • Server is one of the nodes in the Zookeeper ensemble. Its duty is to provide all kinds of services to clients. There are two possible types of servers:

    • Server Leader performs automatic recovery in case other servers fail. The leader is elected at the ZooKeeper startup.

    • Server Follower is each server in the ensemble other than the leader. It follows the instructions sent by the Leader.

  • Client is a consumer of ZooKeeper services.

ZooKeeper components are displayed in the following diagram.

arch dark
ZooKeeper components
arch light
ZooKeeper components

Data model

Znodes

ZooKeeper allows distributed processes to communicate with each other through a shared hierarchal namespace, which is organized similar to a standard file system. Every node in this namespace is called znode and identified by its path. Unlike a typical file system, znodes can perform the functions of both a file and a directory simultaneously. All data is kept in memory, which allows ZooKeeper to provide high throughput and low latency.

CAUTION
Since ZooKeeper was designed to keep coordination data (configurations, statuses, location information, and so on), it expects that the stored data will not be too large. Znode size cannot exceed 1 MB.

The following diagram presents an example structure containing one root znode identified by / and two children: /z1 and /z2. These znodes, in turn, can have their own children, and those — their own, and so on.

data model dark
ZooKeeper data model
data model light
ZooKeeper data model

In each znode, data is always read and written atomically:

  • A read request gets all the data bytes associated with the specified znode.

  • A write request replaces all data in the znode.

Each znode has metadata in a structure called stat. It includes version numbers, timestamps, related transaction IDs, and other metadata. Znodes are protected by ACLs (Access Control Lists) containing a list of users with privileges assigned to them.

The possible types of znodes are described in the following table.

Types of znodes
Type Description

Persistent

The default type of znodes. Such a znode stays alive even after the client created it is disconnected

Ephemeral

Less durable type. Such a znode stays alive while the client that created it is connected to ZooKeeper. It cannot have children

Sequential

When a znode is marked as sequential during its creation, ZooKeeper appends a 10-digit sequence number to the original name, for example, /top/myspace/mynode0000000001. Sequential znodes play an important role in locking and synchronization processes. A sequential znode must be persistent or ephemeral, which means we have not one sequential type but two: persistent-sequential and ephemeral-sequential

Container

The type created to simplify cleaning of unused nodes (parent nodes left without child nodes) by ZooKeeper. A container node is similar to a persistent node, but with an additional function of being put into candidates for deletion by the ZooKeeper server as soon as its last child znode is deleted

Watches

Watches is a special mechanism that enables clients to receive notifications about changes in the specified znodes. A watch will be triggered and removed when the monitored znode is changed.

Watches can be set for those operations that read the state of ZooKeeper: checking a znode existence, reading data from a znode, and retrieving a list of znode children.

CAUTION
Watches are based on one-time triggers. After a watch notification is triggered, no further watch notifications will be sent unless the client configures a new watch.

How it works

Sessions

ZooKeeper uses the session mechanism. A session is a period of time wherein a user interacts with a server to receive the requested service. Requests in each session are executed in the FIFO order. When a client connects to a server, a session is established and the client receives the session ID.

A client connects to exactly one server to submit its requests. It can be either the leader or a follower. If the client does not receive confirmation of the successful connection, it tries to connect to another server in the ZooKeeper ensemble.

After the connection is established, the client sends a heartbeat to the server to keep the session alive. If the ZooKeeper ensemble does not receive a heartbeat within the session timeout specified when the service was started, it assumes the client is unavailable.

Read and write operations

Each server in the ZooKeeper ensemble has a local replica of the ZooKeeper database. It is an in-memory database containing the entire data tree. Updates to the znode tree are logged to the disk storage so that they can be restored.

With the replicated database, all read requests are processed directly by the servers without involving the leader. Requests to change data are processed by the leader using a special agreement protocol.

The typical read operation includes the following steps:

  1. The client starts a session with one of the ensemble servers.

  2. The client sends a read request to the server. This request includes the path to the znode containing the required data.

  3. The server provides the client with the required information stored in the database replica on that server. In this operation, communication with the leader is not required.

read dark
Typical read sequence
read light
Typical read sequence

The typical write sequence looks as follows:

  1. The client starts a session with one of the ensemble servers.

  2. The client sends a write request to the server. The request includes the path to the znode where the data is to be changed.

  3. The server sends the write request to the ensemble leader.

  4. The ensemble leader forwards the write command to all the followers using the atomic broadcast. The write request is only processed if the most part of the followers (called quorum) returns a positive response. In this case, the requested changes are committed to all database replicas.

  5. The response is returned to the client.

write dark
Typical write sequence
write light
Typical write sequence
NOTE
Found a mistake? Seleсt text and press Ctrl+Enter to report it