CapacityScheduler
CapacityScheduler is a pluggable scheduler that you can use in a Hadoop framework. It improves multi-tenancy of a shared cluster by allocating a certain capacity of the entire cluster to each organization using this cluster.
Security system
CapacityScheduler enables each organization to get its own queue with a portion of the cluster capacity allocated specifically for that queue.
In a shared cluster, security becomes very important. For each queue, there is an access control list (ACL) containing users that can submit applications to individual queues. It also ensures that users cannot view or modify applications run by other users in other queues. Also per-queue and system administrator roles are supported.
Hierarchical queues
CapacityScheduler supports hierarchical queues. A queue is hierarchical if subqueues can be created within this queue. The portion of the cluster resources allocated to the queue can be further distributed among its subqueues. CapacityScheduler has a predefined queue called root
. All queues in the system are the children of the root
queue.
An added benefit is that an organization can exceed its queue capacity and use more cluster resources than it was assigned, yet only if the extra capacity is available and not being used by others. This gives organizations more flexibility in a cost-effective manner.
Configuration
To use CapacityScheduler in YARN, start with assigning the appropriate scheduler class in yarn-site.xml:
<property>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
</property>
Setting up queues
To set up queues, edit the capacity-scheduler.xml configuration file:
-
In the
yarn.scheduler.capacity.root.queues
property, specify a list of comma-separated top-level queues. -
To set up subqueues in a queue defined as <queue-path>, in the
yarn.scheduler.capacity.<queue-path>.queues
property of this queue, specify a list of subqueues. -
To configure queue capacity in percentage (%), use the
yarn.scheduler.capacity.<queue-path>.capacity
property of that queue. The sum of capacities for all queues (at each level) must be equal to 100. -
To set up the maximum capacity for a queue in percentage, use the
yarn.scheduler.capacity.<queue-path>.maximum-capacity
property. This property limits the elasticity for applications in the queue. It defaults to-1
, which disables this limitation.
Queue configuration example
Let us configure two top-level queues (descending directly from root
):
-
sales
-
finance
Within the sales
queue, there are two subqueues:
-
apac
-
emea
The hierarchy of queues in the CapacityScheduler configuration looks as follows:
<property>
<name>yarn.scheduler.capacity.root.queues</name>
<value>sales, finance</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.sales.queues</name>
<value>apac,emea</value>
</property>
To give 70% of the queue capacity to sales
and 30% to finance
, add the following settings:
<property>
<name>yarn.scheduler.capacity.root.sales.capacity</name>
<value>70</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.finance.capacity</name>
<value>30</value>
</property>
To allocate 65% of the sales
queue capacity to apac
and 35% to emea
, add the following settings:
<property>
<name>yarn.scheduler.capacity.root.sales.apac.capacity</name>
<value>65</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.sales.emea.capacity</name>
<value>35</value>
</property>
To ensure that the sales
queue doesn’t use more than 80% of the cluster resources (even if the resources are available), add the following:
<property>
<name>yarn.scheduler.capacity.root.sales.maximum-capacity</name>
<value>80</value>
</property>
If a user, for example, yarn
supplies some applications without specifying a queue and there is no default
queue, then you must specify a default queue for that user. The following snippet makes the marketing
queue the default one for the yarn
user:
<property>
<name>yarn.scheduler.capacity.queue-mappings</name>
<value>u:yarn:marketing</value>
</property>
In this configuration:
-
u
means that this a user property (g
means a property for a group); -
yarn
is the user name; -
marketing
is the name of the default queue for the specified user.