YARN high availability overview
This is an overview of YARN ResourceManager (RM) High Availability (HA). RM is responsible for tracking resources in a Hadoop cluster and scheduling applications such as MapReduce jobs.
Prior to Hadoop 2.4, the ResourceManager was the single point of failure in a YARN cluster. This means that if the ResourceManager is down for any reason, the whole system gets disturbed due to interruption in resource allocation and job management, and thus we cannot run any jobs on the cluster.
To avoid this issue, we need to enable the HA feature in YARN. When HA is enabled, we run another ResourceManager in parallel on another node called StandbyResourceManager. When the first RM called ActiveResourceManager is down, the StandbyResourceManager becomes active, seamlessly takes over resource management, and ensures that processes continue in the cluster.
ResourceManager HA requires ZooKeeper and HDFS services to be running.
Architecture
Failover is a process in which a system transfers control to a secondary system when it detects some kind of malfunction or failure.
The ResourceManager HA is implemented as an active-standby architecture, where at any given time one of the RMs is active and one or more RMs are in the standby mode, waiting for something to happen to the active RM. There are two ways to switch from standby to active mode:
-
manually by the administrator;
-
automatically through the integrated failover controller.
Manual failover
To manually switch control from a failed RM to the standby RM, the administrator must first bring the active RM to the standby mode and then transition a standby RM to the active mode. They should do this using the yarn rmadmin
command.
After the administrator completes the failover through CLI, the standby RM becomes active, and the resource management system will continue to operate. This is very costly in a dynamic environment, and thus it is not a recommended option.
Automatic failover
If automatic failover is enabled, the system transfers the control from the failed RM to a standby RM in case of any failure. This option is preferable to manual failover, yet in this case, we always need to ensure that there is no split-brain scenario.
A split-brain scenario is a situation where multiple RMs can potentially assume the active role. This is undesirable because multiple RMs could proceed to manage some of the nodes of the cluster simultaneously. This would cause inconsistency in resource management and job scheduling, potentially leading the entire system to failure.
So, at any given time, we want to make sure that there is one and only one RM active in the system. We can configure ZooKeeper to coordinate this functionality using the Leader Election method considered in the next section.
Leader Election method
A ZooKeeper client (a user, a service, or an application) can create znodes in ZooKeeper. A znode is an entity (or a construct similar to both a directory and a file in a file system) provided by ZooKeeper where a client stores configuration or identification data (its size is less than 1 MB). A znode can also contain child znodes, and this hierarchy is similar to that of Linux file systems.
In the above diagram, /
is the parent znode of /zoo
. duck
is a child znode of the zoo
znode, and it is denoted as /zoo/duck
. Similarly, /zoo
is the parent znode of /zoo/goat
and /zoo/cow
.
There are two types of znodes:
-
Ephemeral. An ephemeral znode gets deleted if the session in which the znode was created has disconnected. For example, if RM1 has created an ephemeral znode named
eph_znode
, this znode gets deleted as soon as the RM1’s session with ZooKeeper is closed. -
Persistent. A persistent znode exists within ZooKeeper until it is deleted. Unlike ephemeral znodes, a persistent znode is not affected by the existence of the process that has created it. For example, if RM1 creates a persistent znode, then even if the RM1’s session with ZooKeeper gets disconnected, the persistent znode created by RM1 still exists and is not deleted upon RM1 going down. This znode can be removed only when it is deleted explicitly.
Suppose the cluster has two RMs – RM1 and RM2 – competing to become active. We configure ZooKeeper to handle this through Leader Election. The RM that first succeeds in creating an ephemeral znode named ActiveStandbyElectorLock
will get the active role. It is important that only one RM will be able to create the znode named ActiveStandbyElectorLock
.
If RM1 was the first that created the ActiveStandbyElectorLock
znode, RM2 will no longer be able to create such a znode. Thus, RM1 wins the election and becomes active, while RM2 gets the standby role. Before going to the standby mode, RM2 places a watch for the ActiveStandbyElectorLock
znode on its update or removal.
If RM1 then goes down, the ActiveStandbyElectorLock
znode gets deleted automatically due to being ephemeral. We now expect the standby RM2 to become active.
RM2 is notified of removal of the znode it is monitoring, so it tries to create its own ActiveStandbyElectorLock
ephemeral znode, succeeds, and gets the active role.
Sometimes RM gets disconnected only temporarily and will try to connect back to ZooKeeper. Yet ZooKeeper will not be able to differentiate the cause of RM disconnection, and hence it deletes the ephemeral znodes created by that RM. This can lead to a split-brain scenario, where both the RMs act as active RMs.
To avoid this, RM2 reads the information about the active RM from the persistent ActiveBreadCrumb
znode, where RM1 has already written its information (such as its hostname and TCP port). RM2 realizes that RM1 is alive, so it tries to make RM1 fenced. On success, RM2 overwrites the data in the persistent ActiveBreadCrumb
znode with its own configuration (that is, the hostname and TCP port of RM2), since RM2 is now the active RM.
Client, ApplicationMaster, and NodeManager on RM failover
When there are multiple RMs, the configuration in the yarn-site.xml file used by clients and nodes must list all these RMs.
Clients, ApplicationMasters, and NodeManagers try connecting to the RMs in a round-robin order until they hit the active RM. If the active RM goes down, they resume the round-robin polling until they hit the "new" active RM. The failover policy for them is defined by one of the following properties:
-
When HA is used, the retry logic is specified by the
yarn.client.failover-proxy-provider
property that defaults toorg.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider
. You can override the logic by a class implementingorg.apache.hadoop.yarn.client.RMFailoverProxyProvider
and setting the property to this class name. -
When running in the non-HA mode, the retry logic is specified by the
yarn.client.failover-no-ha-proxy-provider
property.