Конференция Arenadata
Новое время — новый Greenplum
Мы приглашаем вас принять участие в конференции, посвященной будущему Open-Source Greenplum 19 сентября в 18:00:00 UTC +3. Встреча будет проходить в гибридном формате — и офлайн, и онлайн. Онлайн-трансляция будет доступна для всех желающих.
Внезапное закрытие Greenplum его владельцем — компанией Broadcom - стало неприятным сюрпризом для всех, кто использует или планирует начать использовать решения на базе этой технологии. Многие ожидают выхода стабильной версии Greenplum 7 и надеются на её дальнейшее активное развитие.
Arenadata не могла допустить, чтобы разрабатываемый годами Open-Source проект Greenplum прекратил своё существование, поэтому 19 сентября мы представим наш ответ на данное решение Broadcom, а участники сообщества получат исчерпывающие разъяснения на все вопросы о дальнейшей судьбе этой технологии.

На конференции вас ждёт обсуждение следующих тем:

  • План возрождения Greenplum;
  • Дорожная карта;
  • Экспертное обсуждение и консультации.
Осталось до события

Encrypted shuffle in MapReduce

Overview

In Hadoop, data encryption is possible during the shuffle phase in MapReduce and YARN. This feature employs HTTPS with optional client authentication, also referred to as bi-directional HTTPS or HTTPS with client certificates.

CAUTION

Encrypted shuffle significantly affects cluster performance. Consider reserving additional resources when using encrypted shuffle.

The encrypted shuffle function includes several optional security settings for your cluster:

  • Configuration settings that switch the shuffle between HTTP and HTTPS.

  • Configuration settings that define the keystore and truststore properties (location, type, passwords).

  • A method to reload truststores across the cluster when a node is added or removed.

To enable encrypted shuffle for MapReduce and YARN, you need to update their configuration files and provide SSL certificate information in keystore and truststore settings.

The MapReduce and YARN configuration files (core-site.xml and mapred-site.xml) can be edited via ADCM. The keystore and truststore settings have to be updated manually.

Client certificates

 

The client certificate keystore file, which contains the private key, must be accessible to all users submitting jobs to the cluster. This means a rogue job could potentially access these keystore files and use the client certificates to establish a secure connection with a shuffle server.

A better safeguard for the data is the JobToken mechanism provided by the Hadoop environment. Each job uses its unique JobToken to retrieve only the shuffle data it owns. Without a valid JobToken, a rogue job cannot access shuffle data from the shuffle server.

If your cluster necessitates client certificates, make sure that browsers connecting to the web UIs have appropriately signed certificates. If your certificates are signed by a certificate authority (CA), ensure that the complete chain of CA certificates is included in the server’s keystore.

MapReduce configuration

To configure the encrypted shuffle for MapReduce via ADCM:

  1. On the Clusters page, select the desired cluster.

  2. Navigate to Services and click at HDFS.

  3. Toggle the Show advanced option and find core-site.xml.

  4. Open the parameter drop-down list, select and edit the necessary properties from the table below.

  5. Confirm changes to the cluster configuration by clicking Save.

Encrypted shuffle properties
Parameter Description Default value

hadoop.ssl.require.client.cert

The boolean value representing whether client certificates are required

false

hadoop.ssl.hostname.verifier

The type of hostname verification to use. Accepts the following values: DEFAULT, STRICT, STRICT_IE6, DEFAULT_AND_LOCALHOST, and ALLOW_ALL

DEFAULT

hadoop.ssl.keystores.factory.class

The KeyStoresFactory implementation to use. Currently, only the default implementation is available. It uses the properties located in ssl-server.xml and ssl-client.xml

org.apache.hadoop.security.ssl.FileBasedKeyStoresFactory

hadoop.ssl.server.conf

The resource file with the SSL server keystore information

ssl-server.xml

hadoop.ssl.client.conf

The resource file with the SSL client keystore information

ssl-client.xml

hadoop.ssl.enabled.protocols

The supported SSL protocols. This parameter is used only by DataNode HTTP Server

TLSv1.2

The hadoop.ssl.hostname.verifier parameter supports the following verification types:

  • DEFAULT — the hostname must match the first common name (CN) or any of the subject alt names (SAN). Names can include wildcards, for example, a hostname *.example.com will match all subdomains, including beta.test.example.com.

  • DEFAULT_AND_LOCALHOST — same as DEFAULT but also allows all hostnames of the type: localhost, localhost.example, or 127.0.0.1.

  • STRICT — same as DEFAULT but allows only wildcards of the same level. For instance, a hostname with a wildcard *.example.com will only match subdomains at the same level, such as test.example.com, but not beta.test.example.com.

  • STRICT_IE6 — same as STRICT, but it also permits hostnames that match any of the common names (CN) within the server’s X.509 certificate, not just the first one.

  • ALLOW_ALL — disables the hostname verifier mechanism.

YARN configuration

You can enable the encrypted shuffle for YARN as well. It works only if the MapReduce parameters have been already configured.

To configure the encrypted shuffle for YARN via ADCM:

  1. On the Clusters page, select the desired cluster.

  2. Navigate to Services and click at YARN.

  3. Toggle the Show advanced option.

  4. Find the mapreduce.shuffle.ssl.enabled parameter and set its value to true.

  5. Confirm changes to the YARN configuration by clicking Save.

Keystore and truststore settings

Enabling the encrypted shuffle requires changing the keystore and truststore settings used by the MapReduce shuffle service.

These settings are located in the ssl-server.xml and ssl-client.xml files, in the etc/hadoop/conf/ directory on your cluster hosts.

After you edit the properties, make sure that the following is true:

  • the mapred user is the owner of both ssl-server.xml and ssl-client.xml files;

  • the mapred user has exclusive read access to the SSL server configuration file;

  • the mapred user has default permissions for the SSL client configuration file.

SSL client and server properties
Client parameter Server parameter Description Default value

ssl.client.keystore.type

ssl.server.keystore.type

Keystore file type

jks

ssl.client.keystore.location

ssl.server.keystore.location

Keystore file location. The mapred must be the owner of this file and have exclusive read access to it

 — 

ssl.client.keystore.password

ssl.server.keystore.password

Keystore file password

 — 

ssl.client.truststore.type

ssl.server.truststore.type

Truststore file type

jks

ssl.client.truststore.location

ssl.server.truststore.location

Truststore file location. The mapred must be the owner of this file and have exclusive read access to it

 — 

ssl.client.truststore.password

ssl.server.truststore.password

Truststore file password

 — 

ssl.client.truststore.reload.interval

ssl.server.truststore.reload.interval

Defines, how often the truststores reload their configuration, in milliseconds

10000

If you copy a new truststore file over the old one, the system will re-read it, replacing the old certificates with the new ones.

This mechanism is beneficial for adding or removing nodes from the cluster or trusted clients. In such cases, the client or NodeManager certificate is added to (or removed from) all the truststore files in the system. The new configuration will then be implemented without the need to restart the NodeManagers.

Default ssl-client.xml
<?xml version="1.0"?>
<configuration>
        <property>
                <name>ssl.client.keystore.type</name>
                <value>jks</value>
        </property>
        <property>
                <name>ssl.client.truststore.reload.interval</name>
                <value>10000</value>
        </property>
        <property>
                <name>ssl.client.truststore.type</name>
                <value>jks</value>
        </property>
</configuration>
Default ssl-server.xml
<?xml version="1.0"?>
<configuration>
        <property>
                <name>ssl.server.keystore.type</name>
                <value>jks</value>
        </property>
        <property>
                <name>ssl.server.truststore.reload.interval</name>
                <value>10000</value>
        </property>
        <property>
                <name>ssl.server.truststore.type</name>
                <value>jks</value>
        </property>
</configuration>

Activation

After you have configured all necessary properties, activate the encrypted shuffle by restarting all NodeManagers via ADCM:

  1. On the Clusters page, select the desired cluster.

  2. Navigate to Services.

  3. Expand the Actions actions2 drop-down menu for YARN and select Restart

  4. Make sure the Apply configs from ADCM option is set to true, and click Run.

Found a mistake? Seleсt text and press Ctrl+Enter to report it