Pluggable shuffle and sort in MapReduce
The sort and shuffle phases in MapReduce are performed by default Hadoop services, but you can configure a custom shuffle and sort implementation for your cluster.
One of the possible use cases is configuring a different application protocol, other than HTTP, such as RDMA for shuffling data; or replacing the sort logic with custom algorithms that enable hash aggregation and Limit-N query.
For more information on MapReduce, see the MapReduce overview and MapReduce architecture articles.
CAUTION
The pluggable shuffle and sort capabilities are experimental and unstable. This means that provided APIs may change and break compatibility in future versions of Hadoop. |
The shuffling is done by the NodeManager’s auxiliary service and the shuffle consumer plugin.
You can change the default auxiliary service, by changing the NodeManager configuration properties listed in the table below.
Parameter | Description | Default value |
---|---|---|
yarn.nodemanager.aux-services |
The auxiliary service name |
…,mapreduce_shuffle |
yarn.nodemanager.aux-services.mapreduce_shuffle.class |
The auxiliary service class name |
org.apache.hadoop.mapred.ShuffleHandler |
yarn.nodemanager.aux-services.<class_name>.classpath |
The path to the local directory with the auxiliary service JAR files as well as the dependencies. The path can lead to a single JAR file, use the |
— |
yarn.nodemanager.aux-services.<class_name>.remote-classpath |
The parameter to use instead of the |
— |
To change these properties on all hosts via ADCM:
-
On the Clusters page, select the desired cluster.
-
Navigate to Services and click at YARN.
-
Toggle the Show advanced option and find yarn-site.xml.
-
Open the parameter drop-down list, select yarn.nodemanager.aux-services, and add a new property.
-
Select the yarn.nodemanager.aux-services.mapreduce_shuffle.class and enter a new value.
-
Find the Custom yarn-site.xml parameter, select Add property, and enter the new parameter name depending on the location of your auxiliary service JAR files: either
yarn.nodemanager.aux-services.<class_name>.classpath
oryarn.nodemanager.aux-services.<class_name>.remote-classpath
. -
Confirm changes to YARN configuration by clicking Save.
-
In the Actions drop-down menu, select Restart, make sure the Apply configs from ADCM option is set to
true
, and click Run.
To change the default shuffle consumer plugin, set a new value for the mapreduce.job.reduce.shuffle.consumer.plugin.class
property.
Setting a new value for a single node’s environment will change the shuffle method on a per-job basis. To set a new shuffle consumer plugin for all nodes, edit the same property in the mapred-site.xml configuration file via ADCM.
NOTE
If you change both the auxiliary service and the default shuffle consumer plugin, add a new service key to the |
To change the default sorting method, set a new value for the output collector (mapreduce.job.map.output.collector.class
) property.
You can specify a comma-separated list of output collector implementations. In this case, the map task will attempt to instantiate each in turn until one of the implementations successfully initializes. This feature can be useful if a given output collector implementation is only compatible with certain types of keys or values.
Setting a new value for a single node’s environment will change the sorting method on a per-job basis. To set a new output collector for all nodes, edit the same property in the mapred-site.xml configuration file via ADCM.
To edit mapred-site.xml on all hosts via ADCM:
-
On the Clusters page, select the desired cluster.
-
Navigate to Services and click at YARN.
-
Toggle the Show advanced option and find Custom mapred-site.xml.
-
Open the parameter drop-down list and select Add property.
-
Enter the parameter name and the desired value.
-
Click Apply and confirm changes to YARN configuration by clicking Save.
-
In the Actions drop-down menu, select Restart, make sure the Apply configs from ADCM option is set to
true
, and click Run.