Encrypted shuffle in MapReduce
Overview
In Hadoop, data encryption is possible during the shuffle phase in MapReduce and YARN. This feature employs HTTPS with optional client authentication, also referred to as bi-directional HTTPS or HTTPS with client certificates.
CAUTION
Encrypted shuffle significantly affects cluster performance. Consider reserving additional resources when using encrypted shuffle. |
The encrypted shuffle function includes several optional security settings for your cluster:
-
Configuration settings that switch the shuffle between HTTP and HTTPS.
-
Configuration settings that define the keystore and truststore properties (location, type, passwords).
-
A method to reload truststores across the cluster when a node is added or removed.
To enable encrypted shuffle for MapReduce and YARN, you need to update their configuration files and provide SSL certificate information in keystore and truststore settings.
The MapReduce and YARN configuration files (core-site.xml and mapred-site.xml) can be edited via ADCM. The keystore and truststore settings have to be updated manually.
MapReduce configuration
To configure the encrypted shuffle for MapReduce via ADCM:
-
On the Clusters page, select the desired cluster.
-
Navigate to Services and click at HDFS.
-
Toggle the Show advanced option and find core-site.xml.
-
Open the parameter drop-down list, select and edit the necessary properties from the table below.
-
Confirm changes to the cluster configuration by clicking Save.
Parameter | Description | Default value |
---|---|---|
hadoop.ssl.require.client.cert |
The boolean value representing whether client certificates are required |
false |
hadoop.ssl.hostname.verifier |
The type of hostname verification to use. Accepts the following values: |
DEFAULT |
hadoop.ssl.keystores.factory.class |
The KeyStoresFactory implementation to use. Currently, only the default implementation is available. It uses the properties located in ssl-server.xml and ssl-client.xml |
org.apache.hadoop.security.ssl.FileBasedKeyStoresFactory |
hadoop.ssl.server.conf |
The resource file with the SSL server keystore information |
ssl-server.xml |
hadoop.ssl.client.conf |
The resource file with the SSL client keystore information |
ssl-client.xml |
hadoop.ssl.enabled.protocols |
The supported SSL protocols. This parameter is used only by DataNode HTTP Server |
TLSv1.2 |
The hadoop.ssl.hostname.verifier
parameter supports the following verification types:
-
DEFAULT
— the hostname must match the first common name (CN) or any of the subject alt names (SAN). Names can include wildcards, for example, a hostname *.example.com will match all subdomains, including beta.test.example.com. -
DEFAULT_AND_LOCALHOST
— same asDEFAULT
but also allows all hostnames of the type:localhost
,localhost.example
, or127.0.0.1
. -
STRICT
— same asDEFAULT
but allows only wildcards of the same level. For instance, a hostname with a wildcard *.example.com will only match subdomains at the same level, such as test.example.com, but not beta.test.example.com. -
STRICT_IE6
— same asSTRICT
, but it also permits hostnames that match any of the common names (CN) within the server’s X.509 certificate, not just the first one. -
ALLOW_ALL
— disables the hostname verifier mechanism.
YARN configuration
You can enable the encrypted shuffle for YARN as well. It works only if the MapReduce parameters have been already configured.
To configure the encrypted shuffle for YARN via ADCM:
-
On the Clusters page, select the desired cluster.
-
Navigate to Services and click at YARN.
-
Toggle the Show advanced option.
-
Find the mapreduce.shuffle.ssl.enabled parameter and set its value to
true
. -
Confirm changes to the YARN configuration by clicking Save.
Keystore and truststore settings
Enabling the encrypted shuffle requires changing the keystore and truststore settings used by the MapReduce shuffle service.
These settings are located in the ssl-server.xml and ssl-client.xml files, in the etc/hadoop/conf/ directory on your cluster hosts.
After you edit the properties, make sure that the following is true:
-
the
mapred
user is the owner of both ssl-server.xml and ssl-client.xml files; -
the
mapred
user has exclusive read access to the SSL server configuration file; -
the
mapred
user has default permissions for the SSL client configuration file.
Client parameter | Server parameter | Description | Default value |
---|---|---|---|
ssl.client.keystore.type |
ssl.server.keystore.type |
Keystore file type |
jks |
ssl.client.keystore.location |
ssl.server.keystore.location |
Keystore file location. The |
— |
ssl.client.keystore.password |
ssl.server.keystore.password |
Keystore file password |
— |
ssl.client.truststore.type |
ssl.server.truststore.type |
Truststore file type |
jks |
ssl.client.truststore.location |
ssl.server.truststore.location |
Truststore file location. The |
— |
ssl.client.truststore.password |
ssl.server.truststore.password |
Truststore file password |
— |
ssl.client.truststore.reload.interval |
ssl.server.truststore.reload.interval |
Defines, how often the truststores reload their configuration, in milliseconds |
10000 |
If you copy a new truststore file over the old one, the system will re-read it, replacing the old certificates with the new ones.
This mechanism is beneficial for adding or removing nodes from the cluster or trusted clients. In such cases, the client or NodeManager certificate is added to (or removed from) all the truststore files in the system. The new configuration will then be implemented without the need to restart the NodeManagers.
<?xml version="1.0"?>
<configuration>
<property>
<name>ssl.client.keystore.type</name>
<value>jks</value>
</property>
<property>
<name>ssl.client.truststore.reload.interval</name>
<value>10000</value>
</property>
<property>
<name>ssl.client.truststore.type</name>
<value>jks</value>
</property>
</configuration>
<?xml version="1.0"?>
<configuration>
<property>
<name>ssl.server.keystore.type</name>
<value>jks</value>
</property>
<property>
<name>ssl.server.truststore.reload.interval</name>
<value>10000</value>
</property>
<property>
<name>ssl.server.truststore.type</name>
<value>jks</value>
</property>
</configuration>
Activation
After you have configured all necessary properties, activate the encrypted shuffle by restarting all NodeManagers via ADCM:
-
On the Clusters page, select the desired cluster.
-
Navigate to Services.
-
Expand the Actions drop-down menu for YARN and select Restart
-
Make sure the Apply configs from ADCM option is set to
true
, and click Run.