Pluggable shuffle and sort in MapReduce

The sort and shuffle phases in MapReduce are performed by default Hadoop services, but you can configure a custom shuffle and sort implementation for your cluster.

One of the possible use cases is configuring a different application protocol, other than HTTP, such as RDMA for shuffling data; or replacing the sort logic with custom algorithms that enable hash aggregation and Limit-N query.

For more information on MapReduce, see the MapReduce overview and MapReduce architecture articles.

CAUTION

The pluggable shuffle and sort capabilities are experimental and unstable. This means that provided APIs may change and break compatibility in future versions of Hadoop.

The shuffling is done by the NodeManager’s auxiliary service and the shuffle consumer plugin.

You can change the default auxiliary service, by changing the NodeManager configuration properties listed in the table below.

NodeManager’s auxiliary service parameters
Parameter Description Default value

yarn.nodemanager.aux-services

The auxiliary service name

…​,mapreduce_shuffle

yarn.nodemanager.aux-services.mapreduce_shuffle.class

The auxiliary service class name

org.apache.hadoop.mapred.ShuffleHandler

yarn.nodemanager.aux-services.<class_name>.classpath

The path to the local directory with the auxiliary service JAR files as well as the dependencies. The path can lead to a single JAR file, use the {local_dir_to_jar}/* pattern to load all JARs under the directory, or contain multiple values ​​separated by :

 — 

yarn.nodemanager.aux-services.<class_name>.remote-classpath

The parameter to use instead of the yarn.nodemanager.aux-services.<class_name>.classpath if the auxiliary service JAR files are located in the remote file system

 — 

To change these properties on all hosts via ADCM:

  1. On the Clusters page, select the desired cluster.

  2. Navigate to Services and click at YARN.

  3. Toggle the Show advanced option and find yarn-site.xml.

  4. Open the parameter drop-down list, select yarn.nodemanager.aux-services, and add a new property.

  5. Select the yarn.nodemanager.aux-services.mapreduce_shuffle.class and enter a new value.

  6. Find the Custom yarn-site.xml parameter, select Add property, and enter the new parameter name depending on the location of your auxiliary service JAR files: either yarn.nodemanager.aux-services.<class_name>.classpath or yarn.nodemanager.aux-services.<class_name>.remote-classpath.

  7. Confirm changes to YARN configuration by clicking Save.

  8. In the Actions drop-down menu, select Restart, make sure the Apply configs from ADCM option is set to true, and click Run.

To change the default shuffle consumer plugin, set a new value for the mapreduce.job.reduce.shuffle.consumer.plugin.class property.

Setting a new value for a single node’s environment will change the shuffle method on a per-job basis. To set a new shuffle consumer plugin for all nodes, edit the same property in the mapred-site.xml configuration file via ADCM.

NOTE

If you change both the auxiliary service and the default shuffle consumer plugin, add a new service key to the yarn.nodemanager.aux-services property. For example, for a service with the name mapred.shufflex, the property defining the corresponding class must be yarn.nodemanager.aux-services.mapreduce_shufflex.class.

To change the default sorting method, set a new value for the output collector (mapreduce.job.map.output.collector.class) property.

You can specify a comma-separated list of output collector implementations. In this case, the map task will attempt to instantiate each in turn until one of the implementations successfully initializes. This feature can be useful if a given output collector implementation is only compatible with certain types of keys or values.

Setting a new value for a single node’s environment will change the sorting method on a per-job basis. To set a new output collector for all nodes, edit the same property in the mapred-site.xml configuration file via ADCM.

To edit mapred-site.xml on all hosts via ADCM:

  1. On the Clusters page, select the desired cluster.

  2. Navigate to Services and click at YARN.

  3. Toggle the Show advanced option and find Custom mapred-site.xml.

  4. Open the parameter drop-down list and select Add property.

  5. Enter the parameter name and the desired value.

  6. Click Apply and confirm changes to YARN configuration by clicking Save.

  7. In the Actions drop-down menu, select Restart, make sure the Apply configs from ADCM option is set to true, and click Run.

Found a mistake? Seleсt text and press Ctrl+Enter to report it