CapacityScheduler

CapacityScheduler is a pluggable scheduler that you can use in a Hadoop framework. It improves multi-tenancy of a shared cluster by allocating a certain capacity of the entire cluster to each organization using this cluster.

Security system

CapacityScheduler enables each organization to get its own queue with a portion of the cluster capacity allocated specifically for that queue.

In a shared cluster, security becomes very important. For each queue, there is an access control list (ACL) containing users that can submit applications to individual queues. It also ensures that users cannot view or modify applications run by other users in other queues. Also per-queue and system administrator roles are supported.

Hierarchical queues

CapacityScheduler supports hierarchical queues. A queue is hierarchical if subqueues can be created within this queue. The portion of the cluster resources allocated to the queue can be further distributed among its subqueues. CapacityScheduler has a predefined queue called root. All queues in the system are the children of the root queue.

An added benefit is that an organization can exceed its queue capacity and use more cluster resources than it was assigned, yet only if the extra capacity is available and not being used by others. This gives organizations more flexibility in a cost-effective manner.

Configuration

To use CapacityScheduler in YARN, start with assigning the appropriate scheduler class in yarn-site.xml:

<property>
  <name>yarn.resourcemanager.scheduler.class</name>
  <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
</property>

Setting up queues

To set up queues, edit the capacity-scheduler.xml configuration file:

  • In the yarn.scheduler.capacity.root.queues property, specify a list of comma-separated top-level queues.

  • To set up subqueues in a queue defined as <queue-path>, in the yarn.scheduler.capacity.<queue-path>.queues property of this queue, specify a list of subqueues.

  • To configure queue capacity in percentage (%), use the yarn.scheduler.capacity.<queue-path>.capacity property of that queue. The sum of capacities for all queues (at each level) must be equal to 100.

  • To set up the maximum capacity for a queue in percentage, use the yarn.scheduler.capacity.<queue-path>.maximum-capacity property. This property limits the elasticity for applications in the queue. It defaults to -1, which disables this limitation.

Queue configuration example

Let us configure two top-level queues (descending directly from root):

  • sales

  • finance

Within the sales queue, there are two subqueues:

  • apac

  • emea

The hierarchy of queues in the CapacityScheduler configuration looks as follows:

<property>
  <name>yarn.scheduler.capacity.root.queues</name>
  <value>sales, finance</value>
</property>
<property>
  <name>yarn.scheduler.capacity.root.sales.queues</name>
  <value>apac,emea</value>
</property>

To give 70% of the queue capacity to sales and 30% to finance, add the following settings:

<property>
  <name>yarn.scheduler.capacity.root.sales.capacity</name>
  <value>70</value>
</property>
<property>
  <name>yarn.scheduler.capacity.root.finance.capacity</name>
  <value>30</value>
</property>

To allocate 65% of the sales queue capacity to apac and 35% to emea, add the following settings:

<property>
  <name>yarn.scheduler.capacity.root.sales.apac.capacity</name>
  <value>65</value>
</property>
<property>
  <name>yarn.scheduler.capacity.root.sales.emea.capacity</name>
  <value>35</value>
</property>

To ensure that the sales queue doesn’t use more than 80% of the cluster resources (even if the resources are available), add the following:

<property>
  <name>yarn.scheduler.capacity.root.sales.maximum-capacity</name>
  <value>80</value>
</property>

If a user, for example, yarn supplies some applications without specifying a queue and there is no default queue, then you must specify a default queue for that user. The following snippet makes the marketing queue the default one for the yarn user:

<property>
   <name>yarn.scheduler.capacity.queue-mappings</name>
   <value>u:yarn:marketing</value>
</property>

In this configuration:

  • u means that this a user property (g means a property for a group);

  • yarn is the user name;

  • marketing is the name of the default queue for the specified user.

Found a mistake? Seleсt text and press Ctrl+Enter to report it