YARN high availability overview

This is an overview of YARN ResourceManager (RM) High Availability (HA). RM is responsible for tracking resources in a Hadoop cluster and scheduling applications such as MapReduce jobs.

Prior to Hadoop 2.4, the ResourceManager was the single point of failure in a YARN cluster. This means that if the ResourceManager is down for any reason, the whole system gets disturbed due to interruption in resource allocation and job management, and thus we cannot run any jobs on the cluster.

To avoid this issue, we need to enable the HA feature in YARN. When HA is enabled, we run another ResourceManager in parallel on another node called StandbyResourceManager. When the first RM called ActiveResourceManager is down, the StandbyResourceManager becomes active, seamlessly takes over resource management, and ensures that processes continue in the cluster.

ResourceManager HA requires ZooKeeper and HDFS services to be running.

Architecture

yarn ha architecture dark
Failover components
yarn ha architecture light
Failover components

Failover is a process in which a system transfers control to a secondary system when it detects some kind of malfunction or failure.

The ResourceManager HA is implemented as an active-standby architecture, where at any given time one of the RMs is active and one or more RMs are in the standby mode, waiting for something to happen to the active RM. There are two ways to switch from standby to active mode:

  • manually by the administrator;

  • automatically through the integrated failover controller.

Manual failover

To manually switch control from a failed RM to the standby RM, the administrator must first bring the active RM to the standby mode and then transition a standby RM to the active mode. They should do this using the yarn rmadmin command.

After the administrator completes the failover through CLI, the standby RM becomes active, and the resource management system will continue to operate. This is very costly in a dynamic environment, and thus it is not a recommended option.

Automatic failover

If automatic failover is enabled, the system transfers the control from the failed RM to a standby RM in case of any failure. This option is preferable to manual failover, yet in this case, we always need to ensure that there is no split-brain scenario.

A split-brain scenario is a situation where multiple RMs can potentially assume the active role. This is undesirable because multiple RMs could proceed to manage some of the nodes of the cluster simultaneously. This would cause inconsistency in resource management and job scheduling, potentially leading the entire system to failure.

So, at any given time, we want to make sure that there is one and only one RM active in the system. We can configure ZooKeeper to coordinate this functionality using the Leader Election method considered in the next section.

Leader Election method

A ZooKeeper client (a user, a service, or an application) can create znodes in ZooKeeper. A znode is an entity (or a construct similar to both a directory and a file in a file system) provided by ZooKeeper where a client stores configuration or identification data (its size is less than 1 MB). A znode can also contain child znodes, and this hierarchy is similar to that of Linux file systems.

yarn ha znodes dark
Znode tree
yarn ha znodes light
Znode tree

In the above diagram, / is the parent znode of /zoo. duck is a child znode of the zoo znode, and it is denoted as /zoo/duck. Similarly, /zoo is the parent znode of /zoo/goat and /zoo/cow.

There are two types of znodes:

  • Ephemeral. An ephemeral znode gets deleted if the session in which the znode was created has disconnected. For example, if RM1 has created an ephemeral znode named eph_znode, this znode gets deleted as soon as the RM1’s session with ZooKeeper is closed.

  • Persistent. A persistent znode exists within ZooKeeper until it is deleted. Unlike ephemeral znodes, a persistent znode is not affected by the existence of the process that has created it. For example, if RM1 creates a persistent znode, then even if the RM1’s session with ZooKeeper gets disconnected, the persistent znode created by RM1 still exists and is not deleted upon RM1 going down. This znode can be removed only when it is deleted explicitly.

Suppose the cluster has two RMs – RM1 and RM2 – competing to become active. We configure ZooKeeper to handle this through Leader Election. The RM that first succeeds in creating an ephemeral znode named ActiveStandbyElectorLock will get the active role. It is important that only one RM will be able to create the znode named ActiveStandbyElectorLock.

If RM1 was the first that created the ActiveStandbyElectorLock znode, RM2 will no longer be able to create such a znode. Thus, RM1 wins the election and becomes active, while RM2 gets the standby role. Before going to the standby mode, RM2 places a watch for the ActiveStandbyElectorLock znode on its update or removal.

yarn ha failover 1 dark
System state before failure
yarn ha failover 1 light
System state before failure

If RM1 then goes down, the ActiveStandbyElectorLock znode gets deleted automatically due to being ephemeral. We now expect the standby RM2 to become active.

RM2 is notified of removal of the znode it is monitoring, so it tries to create its own ActiveStandbyElectorLock ephemeral znode, succeeds, and gets the active role.

yarn ha failover 2 dark
System state after failover
yarn ha failover 2 light
System state after failover

Sometimes RM gets disconnected only temporarily and will try to connect back to ZooKeeper. Yet ZooKeeper will not be able to differentiate the cause of RM disconnection, and hence it deletes the ephemeral znodes created by that RM. This can lead to a split-brain scenario, where both the RMs act as active RMs.

To avoid this, RM2 reads the information about the active RM from the persistent ActiveBreadCrumb znode, where RM1 has already written its information (such as its hostname and TCP port). RM2 realizes that RM1 is alive, so it tries to make RM1 fenced. On success, RM2 overwrites the data in the persistent ActiveBreadCrumb znode with its own configuration (that is, the hostname and TCP port of RM2), since RM2 is now the active RM.

yarn ha failover 3 dark
Fencing a competing resource manager
yarn ha failover 3 light
Fencing a competing resource manager

Client, ApplicationMaster, and NodeManager on RM failover

When there are multiple RMs, the configuration in the yarn-site.xml file used by clients and nodes must list all these RMs.

Clients, ApplicationMasters, and NodeManagers try connecting to the RMs in a round-robin order until they hit the active RM. If the active RM goes down, they resume the round-robin polling until they hit the "new" active RM. The failover policy for them is defined by one of the following properties:

  • When HA is used, the retry logic is specified by the yarn.client.failover-proxy-provider property that defaults to org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider. You can override the logic by a class implementing org.apache.hadoop.yarn.client.RMFailoverProxyProvider and setting the property to this class name.

  • When running in the non-HA mode, the retry logic is specified by the yarn.client.failover-no-ha-proxy-provider property.

Found a mistake? Seleсt text and press Ctrl+Enter to report it