YARN architecture

YARN is used for scheduling application launches and organizing the cluster resource distribution among various types of data processing mechanisms, such as batch, interactive, and stream processing of data stored in HDFS.

YARN components
YARN components
YARN components
YARN components

YARN includes four main components:

  • Resource Manager runs as a master daemon and manages resource allocation in the cluster.

  • Node Manager is installed on every DataNode and is responsible for allocating local resources to application tasks running on those nodes. Node Managers are subordinate to the Resource Manager.

  • Application Master coordinates an application lifecycle. It interacts with the Resource Manager and Node Managers to run and monitor the application tasks.

  • Container is a logical bundle of resources including RAM, CPU, network, hard drive, and other countable resources on a single node. It is allocated to an application task.

Resource Manager

This component is the ultimate authority of resource allocation in the cluster. The Resource Manager inspects the cluster resources and makes decisions on allocation of available resources to competing applications. When it receives a request, the Resource Manager splits it into separate resource requests for the respective Node Managers where the actual processing will take place. Resource Manager contains two major components: Scheduler and Application Manager.

Scheduler

This is a part of Resource Manager with the following features:

  • The Scheduler is responsible for allocating resources to applications with various restrictions such as resource limits, queues, and so on.

  • This is a pure scheduler, which means it does not do any monitoring or application status tracking. If an application fails due to hardware failure or other reasons, the Scheduler does not guarantee failover.

  • It performs scheduling based on resource requirements coming from applications.

  • This is a plugin object. You can use either Capacity Scheduler or Fair Scheduler plugin in your Resource Manager. By default, the Capacity Scheduler is used.

Application Manager

This is another part of Resource Manager with the following features:

  • It is responsible for accepting job submissions.

  • It queries the Resource Manager for creating the first container to run the application-specific Application Master.

  • It manages the running Application Masters and provides a service for restarting the Application Master container on failure.

Node Manager

Node Manager takes care of individual nodes in the Hadoop cluster and manages application tasks on the given node. Its primary goal is to manage application containers that it creates by requests from the Resource Manager.

The Node Manger performs the following operations:

  • Reports to the Resource Manager about its health and resource capacities of the node.

  • Creates and monitors containers and runs processes in them as requested by the Resource Manager and Application Master.

  • Controls resource usage by individual containers.

  • Provides the log service for container processes.

  • Destroys containers as directed by the Application Master or Resource Manager.

Application Master

An application is submitted as a single job in YARN. Each application has a unique Application Master associated with it. Each type of applications has its own Application Master implemented as a JAR file, which is one of resources required by applications. For example, MapReduce applications require the MRAppMaster Application Master.

The Application Master coordinates the application processes in the cluster and manages task failures. It negotiates about resources with the Resource Manager and works with Node Managers to create containers, execute application tasks in them, and monitor the task processes. Once started, it sends a heartbeat to the Resource Manager to affirm its health and update the record of its resource demands.

Container

This is a collection of physical resources such as RAM, CPU cores, and disk space allocated to a specific application task on a cluster node. Each YARN container has its own launch context used during the container lifecycle. This record contains an environment variable map, security tokens, a payload for Node Manager services, the command required to create the process in the container, and other parameters. It grants the task rights to use a specific amount of resources (memory, CPU, and others) on a specific node.

Found a mistake? Seleсt text and press Ctrl+Enter to report it