Connect to HDFS
This page describes how to connect to HDFS using CLI, WebHDFS API, and HDFS libraries. Additionally, you can interact with HDFS via web interface. For more information, see HDFS UI overview.
Command line
Using CLI, you can access HDFS from any host in a cluster. The following steps describe how to connect to HDFS using the command line. It’s implied that access by SSH has already been configured for the cluster and it doesn’t require a password.
-
Connect to an ADH cluster host via SSH:
$ ssh <USER>@<HOST>
Where:
-
<USER>
— name of a host user; -
<HOST>
— host’s IP address.
-
-
When connection to the host is established, run a desired HDFS command. For example:
$ hdfs dfs -ls /
Possible output of the
ls
command:Found 4 items drwxrwxrwt - yarn hadoop 0 2023-07-24 13:47 /logs drwxr-xr-x - hdfs hadoop 0 2023-08-03 17:49 /system drwxrwxrwx - hdfs hadoop 0 2023-08-30 19:51 /tmp drwxr-xr-x - hdfs hadoop 0 2023-07-24 13:47 /user
For more information on the diferences between hadoop fs
, hadoop dfs
, and hdfs dfs
commands, see the Hadoop command-line article.
Some commands require admin-level access (user hdfs
). If you try to login as hdfs
user, a password will be required.
Since there is no default password for the hdfs
user, log in as root
on your local host.
It will make it possible to authorize as the hdfs
user without a password.
HTTP REST API
You can access HDFS using WebHDFS REST API.
To make a query using curl or a browser, structure the request as follows:
http://<HOST>:<HTTP_PORT>/webhdfs/v1/<PATH>?<AUTH>op=<COMMAND>
Where:
-
<HOST>
— host’s IP address. -
<HTTP_PORT>
— HTTP port of the active NameNode. -
<PATH>
— query’s target directory. -
<AUTH>
— authentication request in the following format:user.name=<USER>&
, where<USER>
is the name of the host’s user. This is an optional parameter. If not set, the request will be sent from the default user if it’s configured. Otherwise, the server will return an error. -
<COMMAND>
— a file system command.
If you have SSL enabled, replace webhdfs
with swebhdfs
.
An example command for connecting to HDFS to run the ls
command can look like this:
$ curl -i "http://127.0.0.1:14000/webhdfs/v1/?user.name=admin&op=LISTSTATUS"
And the possible output of the command:
HTTP/1.1 200 OK Date: Mon, 04 Sep 2023 11:50:00 GMT Cache-Control: no-cache Expires: Mon, 04 Sep 2023 11:50:00 GMT Date: Mon, 04 Sep 2023 11:50:00 GMT Pragma: no-cache Content-Type: application/json Set-Cookie: hadoop.auth="u=admin&p=admin&t=simple-dt&e=6593757/Hvhgbi/PxjQ="; Path=/; HttpOnly Transfer-Encoding: chunked {"FileStatuses":{"FileStatus":[{"pathSuffix":"logs","type":"DIRECTORY","length":0,"owner":"yarn","group":"hadoop","permission":"1777","accessTime":0,"modificationTime":1690206465850,"blockSize":0,"replication":0},{"pathSuffix":"system","type":"DIRECTORY","length":0,"owner":"hdfs","group":"hadoop","permission":"755","accessTime":0,"modificationTime":1691084968551,"blockSize":0,"replication":0},{"pathSuffix":"tmp","type":"DIRECTORY","length":0,"owner":"hdfs","group":"hadoop","permission":"777","accessTime":0,"modificationTime":1693425076362,"blockSize":0,"replication":0},{"pathSuffix":"user","type":"DIRECTORY","length":0,"owner":"hdfs","group":"hadoop","permission":"755","accessTime":0,"modificationTime":1690206432153,"blockSize":0,"replication":0}]}}
C libhdfs library
To connect to a remote HDFS host using the libhdfs library, you need to set up an environment and install the dependencies. In this guide, we’ll create a program for connecting to a remote HDFS host. The developer tools used in the process are Vi text editor, virtual machine with CentOS 7, and GCC compiler. A different setup will require calibration.
Preparation steps
To build and run the following example, install the following:
-
ADH or build one yourself (your local ADH version must match version of the remote one).
-
Text editor of your choice.
-
The C compiler of your choice.
-
JDK.
-
libhdfs.so
with the header files or build it yourself from source.
Write a connector
The example code below creates a connection to a remote HDFS, outputs confirmation upon successful connection, and returns an error message if it was unable to connect. If you successfully connected to HDFS but no longer need it, terminate the connection.
#include <hdfs.h> (1)
#include <stdio.h>
int main(int argc, char **argv) {
struct hdfsBuilder *builder = hdfsNewBuilder(); (2)
hdfsBuilderSetNameNode(builder, "hdfs://127.0.0.1:8020"); (3)
hdfsFS fs = hdfsBuilderConnect(builder); (4)
if (fs != NULL) {
printf("Successfully connected to HDFS ");
} else {
printf("Failed to connect to HDFS ");
}
}
1 | Import the libhdfs library. |
2 | Initialize the builder object for connecting to a remote HDFS using the hdfsBuilder method. It’s preferable to use this method instead of the deprecated hdfsConnect . |
3 | Set parameters for connection in the URI format hdfs://<NN-IP>:<PORT> , where <NN-IP> is the IP address of the active NameNode, and <PORT> is the NameNode metadata service port. |
4 | Call builder to create a connection. |
Compile and run
-
Create a compilation script for GCC and fill in the paths variables:
#!/bin/bash export HADOOP_HOME=/usr/lib/hadoop (1) export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.382.b05-1.el7_9.x86_64/jre (2) export CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath --glob` (3) export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$JAVA_HOME/jre/lib/amd64/server gcc -std=c99 -lpthread main.c -I$HADOOP_HOME/include -L$HADOOP_HOME/lib/native -lhdfs -I$JAVA_HOME/include -I$JAVA_HOME/include/linux -L$JAVA_HOME/jre/lib/amd64/server -ljvm -o my_hdfs_program (4) ./my_hdfs_program (5)
1 Set the Hadoop libraries' path. 2 Set the JDK libraries' path. 3 Generate the correct CLASSPATH
for your deployment using the classpath command.4 Compile the program using the correct paths and additional options. 5 Run the program. -
Run the compilation script:
$ ./<script-name>.sh
Where
<script-name>
is the name of your compilation script.
Java API
To connect to a remote HDFS host using the Java API, you need to set up an environment and install the dependencies. The example code below creates a connection to a remote HDFS, outputs confirmation upon successful connection, and returns an error message if it was unable to connect.
The developer tools used in the process are Vi text editor, virtual machine with CentOS 7, and Maven compiler. A different setup will require calibration.
-
Install Maven.
-
Write a connector class:
package my.example.hdfs; import org.apache.hadoop.conf.Configuration; (1) import org.apache.hadoop.fs.FileSystem; import java.net.URI; public class Main { public static void main(String[] args) throws Exception { String hdfs_uri = "hdfs://127.0.0.1:8020"; (2) Configuration conf = new Configuration(); (3) conf.set("fs.defaultFS", hdfs_uri); try { FileSystem fs = FileSystem.get(URI.create(hdfs_uri), conf); (4) System.out.println("Successfully connected to HDFS "); fs.close(); (5) } catch (Exception e) { System.out.print("Failed to connect to HDFS: "); System.out.println(e.getMessage()); } } }
1 Import the necessary libraries. 2 Set parameters for connection in the URI format hdfs://<NN-IP>:<PORT>
, where<NN-IP>
is the IP address of the active NameNode, and<PORT>
is the NameNode metadata service port.3 Initialize the filesystem object. 4 Call FileSystem.get to create a connection. 5 Terminate a connection if it’s no longer needed. -
Create a pom.xml file for building the project with Maven.
Sample pom.xml file<project> <modelVersion>4.0.0</modelVersion> <groupId>my.example.hdfs</groupId> <artifactId>my-hdfs-program</artifactId> <version>0.1.0-SNAPSHOT</version> <packaging>jar</packaging> <properties> <maven.compiler.source>1.8</maven.compiler.source> <maven.compiler.target>1.8</maven.compiler.target> </properties> <dependencies> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</artifactId> <version>2.7.1</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>2.7.1</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-core</artifactId> <version>1.2.1</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>2.7.1</version> </dependency> </dependencies> <build> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-assembly-plugin</artifactId> <version>3.1.0</version> <configuration> <archive> <manifest> <addClasspath>true</addClasspath> <mainClass>my.example.hdfs.Main</mainClass> </manifest> </archive> <descriptorRefs> <descriptorRef>jar-with-dependencies</descriptorRef> </descriptorRefs> </configuration> <executions> <execution> <id>assemble-all</id> <phase>package</phase> <goals> <goal>single</goal> </goals> </execution> </executions> </plugin> <plugin> <!-- Build an executable JAR --> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-jar-plugin</artifactId> <version>3.1.0</version> <configuration> <archive> <manifest> <addClasspath>true</addClasspath> <mainClass>my.example.hdfs.Main</mainClass> </manifest> </archive> </configuration> </plugin> </plugins> </build> </project>
-
Build the project:
$ mvn package
-
Run the created Java class:
$ java -jar <java-class>.jar
Where
<java-class>
is the name of your Java connector program.