Connect to HDFS

Elena Kostyuchenko

Contents

Command line
HTTP REST API
C libhdfs library
Java API

This page describes how to connect to HDFS using CLI, WebHDFS API, and HDFS libraries. Additionally, you can interact with HDFS via web interface. For more information, see HDFS UI overview.

Command line

Using CLI, you can access HDFS from any host in a cluster. The following steps describe how to connect to HDFS using the command line. It’s implied that access by SSH has already been configured for the cluster and it doesn’t require a password.

Connect to an ADH cluster host via SSH:
```
$ ssh <USER>@<HOST>
```
Where:
- <USER> — name of a host user;
- <HOST> — host’s IP address.

When connection to the host is established, run a desired HDFS command. For example:

$ hdfs dfs -ls /

Possible output of the ls command:

Found 4 items
drwxrwxrwt   - yarn hadoop          0 2023-07-24 13:47 /logs
drwxr-xr-x   - hdfs hadoop          0 2023-08-03 17:49 /system
drwxrwxrwx   - hdfs hadoop          0 2023-08-30 19:51 /tmp
drwxr-xr-x   - hdfs hadoop          0 2023-07-24 13:47 /user

For more information on the diferences between hadoop fs, hadoop dfs, and hdfs dfs commands, see the Hadoop command-line article.

Some commands require admin-level access (user hdfs). If you try to login as hdfs user, a password will be required. Since there is no default password for the hdfs user, log in as root on your local host. It will make it possible to authorize as the hdfs user without a password.

HTTP REST API

You can access HDFS using WebHDFS REST API.

To make a query using curl or a browser, structure the request as follows:

http://<HOST>:<HTTP_PORT>/webhdfs/v1/<PATH>?<AUTH>op=<COMMAND>

Where:

<HOST> — host’s IP address.
<HTTP_PORT> — HTTP port of the active NameNode.
<PATH> — query’s target directory.
<AUTH> — authentication request in the following format: user.name=<USER>&, where <USER> is the name of the host’s user. This is an optional parameter. If not set, the request will be sent from the default user if it’s configured. Otherwise, the server will return an error.
<COMMAND> — a file system command.

If you have SSL enabled, replace webhdfs with swebhdfs.

HDFS REST API authentication

If security is on, authentication method is different for each security model.

When Kerberos SPNEGO is on, use the --negotiate and -u options:

$ curl -i --negotiate -u : "http://<HOST>:<HTTP_PORT>/webhdfs/v1/<PATH>?op=<COMMAND>"

When authentication with the Hadoop delegation token is on, pass the token in place of the username:

$ curl -i "http://<HOST>:<HTTP_PORT>/webhdfs/v1/<PATH>?delegation=<TOKEN>&op=..."

For more information about authentication methods, see Authentication for Hadoop HTTP web-consoles.

An example command for connecting to HDFS to run the ls command can look like this:

$ curl -i "http://127.0.0.1:14000/webhdfs/v1/?user.name=admin&op=LISTSTATUS"

And the possible output of the command:

HTTP/1.1 200 OK
Date: Mon, 04 Sep 2023 11:50:00 GMT
Cache-Control: no-cache
Expires: Mon, 04 Sep 2023 11:50:00 GMT
Date: Mon, 04 Sep 2023 11:50:00 GMT
Pragma: no-cache
Content-Type: application/json
Set-Cookie: hadoop.auth="u=admin&p=admin&t=simple-dt&e=6593757/Hvhgbi/PxjQ="; Path=/; HttpOnly
Transfer-Encoding: chunked

{"FileStatuses":{"FileStatus":[{"pathSuffix":"logs","type":"DIRECTORY","length":0,"owner":"yarn","group":"hadoop","permission":"1777","accessTime":0,"modificationTime":1690206465850,"blockSize":0,"replication":0},{"pathSuffix":"system","type":"DIRECTORY","length":0,"owner":"hdfs","group":"hadoop","permission":"755","accessTime":0,"modificationTime":1691084968551,"blockSize":0,"replication":0},{"pathSuffix":"tmp","type":"DIRECTORY","length":0,"owner":"hdfs","group":"hadoop","permission":"777","accessTime":0,"modificationTime":1693425076362,"blockSize":0,"replication":0},{"pathSuffix":"user","type":"DIRECTORY","length":0,"owner":"hdfs","group":"hadoop","permission":"755","accessTime":0,"modificationTime":1690206432153,"blockSize":0,"replication":0}]}}

C libhdfs library

To connect to a remote HDFS host using the libhdfs library, you need to set up an environment and install the dependencies. In this guide, we’ll create a program for connecting to a remote HDFS host. The developer tools used in the process are Vi text editor, virtual machine with CentOS 7, and GCC compiler. A different setup will require calibration.

Preparation steps

To build and run the following example, install the following:

ADH or build one yourself (your local ADH version must match version of the remote one).
Text editor of your choice.
The C compiler of your choice.
JDK.
libhdfs.so with the header files or build it yourself from source.

Write a connector

The example code below creates a connection to a remote HDFS, outputs confirmation upon successful connection, and returns an error message if it was unable to connect. If you successfully connected to HDFS but no longer need it, terminate the connection.

#include <hdfs.h> (1)
#include <stdio.h>

int main(int argc, char **argv) {
    struct hdfsBuilder *builder = hdfsNewBuilder(); (2)
    hdfsBuilderSetNameNode(builder, "hdfs://127.0.0.1:8020"); (3)
    hdfsFS fs = hdfsBuilderConnect(builder); (4)
    if (fs != NULL) {
      printf("Successfully connected to HDFS ");
    } else {
      printf("Failed to connect to HDFS ");
    }
}

1	Import the libhdfs library.
2	Initialize the builder object for connecting to a remote HDFS using the `hdfsBuilder` method. It’s preferable to use this method instead of the deprecated `hdfsConnect`.
3	Set parameters for connection in the URI format `hdfs://<NN-IP>:<PORT>`, where `<NN-IP>` is the IP address of the active NameNode, and `<PORT>` is the NameNode metadata service port.
4	Call `builder` to create a connection.

Compile and run

Create a compilation script for GCC and fill in the paths variables:

#!/bin/bash

export HADOOP_HOME=/usr/lib/hadoop (1)
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.382.b05-1.el7_9.x86_64/jre (2)
export CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath --glob` (3)

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$JAVA_HOME/jre/lib/amd64/server

gcc -std=c99 -lpthread main.c -I$HADOOP_HOME/include -L$HADOOP_HOME/lib/native -lhdfs -I$JAVA_HOME/include -I$JAVA_HOME/include/linux -L$JAVA_HOME/jre/lib/amd64/server -ljvm -o my_hdfs_program (4)

./my_hdfs_program (5)

1	Set the Hadoop libraries' path.
2	Set the JDK libraries' path.
3	Generate the correct `CLASSPATH` for your deployment using the classpath command.
4	Compile the program using the correct paths and additional options.
5	Run the program.

Run the compilation script:
```
$ ./<script-name>.sh
```
Where <script-name> is the name of your compilation script.

Java API

To connect to a remote HDFS host using the Java API, you need to set up an environment and install the dependencies. The example code below creates a connection to a remote HDFS, outputs confirmation upon successful connection, and returns an error message if it was unable to connect.

The developer tools used in the process are Vi text editor, virtual machine with CentOS 7, and Maven compiler. A different setup will require calibration.

Install Maven.

Write a connector class:

package my.example.hdfs;

import org.apache.hadoop.conf.Configuration; (1)
import org.apache.hadoop.fs.FileSystem;

import java.net.URI;

public class Main {
   public static void main(String[] args) throws Exception {
      String hdfs_uri = "hdfs://127.0.0.1:8020"; (2)

      Configuration conf = new Configuration(); (3)

      conf.set("fs.defaultFS", hdfs_uri);

      try {
          FileSystem fs = FileSystem.get(URI.create(hdfs_uri), conf); (4)
          System.out.println("Successfully connected to HDFS ");

          fs.close(); (5)
      } catch (Exception e) {
          System.out.print("Failed to connect to HDFS: ");
          System.out.println(e.getMessage());
      }
   }
}

1	Import the necessary libraries.
2	Set parameters for connection in the URI format `hdfs://<NN-IP>:<PORT>`, where `<NN-IP>` is the IP address of the active NameNode, and `<PORT>` is the NameNode metadata service port.
3	Initialize the filesystem object.
4	Call FileSystem.get to create a connection.
5	Terminate a connection if it’s no longer needed.

Create a pom.xml file for building the project with Maven.

Sample pom.xml file

<project>
	<modelVersion>4.0.0</modelVersion>
	<groupId>my.example.hdfs</groupId>
	<artifactId>my-hdfs-program</artifactId>
	<version>0.1.0-SNAPSHOT</version>
	<packaging>jar</packaging>
  <properties>
    <maven.compiler.source>1.8</maven.compiler.source>
    <maven.compiler.target>1.8</maven.compiler.target>
  </properties>
  <dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
            <version>2.7.1</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.7.1</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-core</artifactId>
            <version>1.2.1</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.7.1</version>
        </dependency>
  </dependencies>

  <build>
    <plugins>
      <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-assembly-plugin</artifactId>
            <version>3.1.0</version>
            <configuration>
              <archive>
                <manifest>
                  <addClasspath>true</addClasspath>
                  <mainClass>my.example.hdfs.Main</mainClass>
                </manifest>
              </archive>
                <descriptorRefs>
                    <descriptorRef>jar-with-dependencies</descriptorRef>
                </descriptorRefs>
            </configuration>
            <executions>
                <execution>
                    <id>assemble-all</id>
                    <phase>package</phase>
                    <goals>
                        <goal>single</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>
      <plugin>
        <!-- Build an executable JAR -->
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-jar-plugin</artifactId>
        <version>3.1.0</version>
        <configuration>
          <archive>
            <manifest>
              <addClasspath>true</addClasspath>
              <mainClass>my.example.hdfs.Main</mainClass>
            </manifest>
          </archive>
        </configuration>
      </plugin>
    </plugins>
  </build>

</project>

Build the project:
```
$ mvn package
```
Run the created Java class:
```
$ java -jar <java-class>.jar
```
Where <java-class> is the name of your Java connector program.

Found a mistake? Seleсt text and press Ctrl+Enter to report it