HDFS integration with S3

Elena Kostyuchenko

Contents

Overview
Configure S3 integration
Connect to the storage
Copy data

Overview

Simple Storage Service (S3) is an object storage that can be accessed via the S3-type protocol. A single storage container within S3 object storage is called a bucket. Buckets can be public, which means that anyone can access the storage, or private, which require authentication.

Hadoop supports S3A protocol for connecting to S3 storage. This enables using HDFS command line tools for interacting with remote object storages.

This article provides an example of connecting to an external S3 storage to copy data to local HDFS using the distcp command.

For more information on HDFS configuration for working with an S3 storage, see the Hadoop-AWS module: Integration with Amazon Web Services article. S3 storage providers may support different instruments for connecting to the storage. Consult with your S3 provider for information on other ways of integrating with the storage.

Configure S3 integration

Preparation steps

On the storage side, make sure that the user for HDFS is created and has the right type of access. Passing authentication requires credentials: the access key ID and secret access key.

If the bucket is public, you can access it anonymously.

Dependencies

ADH has a built-in object storage connector — S3A. It requires two additional libraries in the classpath:

Add the libraries to the usr/lib/hadoop/lib/ directory for each host.

Authentication

To connect to a private S3 bucket, a user must provide the correct credentials. When connecting to a storage, HDFS tries the following types of authentication in this order:

URI arguments (triggers a warning);
HDFS configuration;
AWS Environment Variables;
Amazon EC2 Instance Metadata Service.

For more information on how to configure each type of authentication, see the Hadoop-AWS module: Integration with Amazon Web Services article.

CAUTION

Loss of credentials can compromise the account. Protect your credentials from leaking into logs and bug reports. For this reason, avoid embedding credentials in S3A URLs.

To authenticate using HDFS configuration, update the core-site.xml according to the table below.

Parameter

Description

Default value

fs.s3a.access.key

Access key ID. Do not specify for anonymous connection

—

fs.s3a.secret.key

Secret access key. Do not specify for anonymous connection

—

fs.s3a.endpoint

S3 endpoint to connect to. Consult with your S3 provider for information on what endpoint to use

s3.amazonaws.com

Additional properties are described in the Apache Hadoop documentation — Hadoop-AWS module: Integration with Amazon Web Services.

To configure these parameters using ADCM:

Navigate to the Clusters section and select the desired cluster.
Go to the Services tab and click at HDFS.
Toggle the Show advanced option, expand the Custom core-site.xml properties list, and select Add property.
Add the required parameter with its value and click Apply.
Confirm changes to HDFS configuration by clicking Save.
In the Actions drop-down menu, select Restart, make sure the Apply configs from ADCM option is set to true and click Run.

Edit S3 connection properties in ADCM

Anonymous connection

You can connect to a public S3 bucket anonymously. To do this, specify org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider as the value of the fs.s3a.aws.credentials.provider parameter.

This option can be used to verify that an object storage does not permit unauthenticated access. To do this, run:

$ hadoop fs -D fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider \
 -ls s3a://<bucket>/

Where <bucket> is the name of the bucket with which to establish a connection.

If unauthenticated access is prohibited, the connection attempt should fail like this:

ls: s3a://develop-test-bucket/: getFileStatus on s3a://develop-test-bucket/: com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: dc82c760b5ab707e; S3 Extended Request ID: null; Proxy: null), S3 Extended Request ID: null:AccessDenied

Connect to the storage

To test the connection to the storage, run the -ls command:

$ hadoop fs -ls s3a://<bucket>/

Where <bucket> is the name of the bucket with which to establish a connection.

Output example:

Found 3 items
drwxrwxrwx   - admin admin          0 2023-11-14 17:28 s3a://develop-test-bucket/adbm_n1
drwxrwxrwx   - admin admin          0 2023-11-14 17:28 s3a://develop-test-bucket/adbm_n2
drwxrwxrwx   - admin admin          0 2023-11-14 17:28 s3a://develop-test-bucket/repo

Copy data

You can copy data between HDFS and S3 using the distcp command. To copy data from S3, run:

$ hadoop distcp \
       s3a://<bucket>/<path>/ \
       hdfs://<host>/<path>/

Where:

<bucket> — the name of the bucket with which to establish a connection;
<host> — the cluster nameservice or the NameNode host FQDN/IP address;
<path> — the source and destination paths.

Found a mistake? Seleсt text and press Ctrl+Enter to report it