Quick start with Sqoop
This guide demonstrates how to start using Sqoop for transferring data between a database and HDFS.
Install Sqoop
To install Sqoop in an existing cluster via ADCM:
-
Add the Sqoop service to the cluster as described in the Add services article.
-
Map Sqoop components to the cluster hosts by following steps from the Add components article.
-
If necessary, edit the default Sqoop configuration as demonstrated in the Configure services.
-
Install the service.
Connect to a database
Sqoop uses JDBC for database connections. In this example, we will be connecting to an Arenadata PostgreSQL 14 database, which means we need to download a PostgreSQL JDBC driver of the appropriate version.
-
Make the Sqoop DB drivers directory available for read and write access:
$ chmod -R 777 /var/lib/sqoop
-
Download the driver:
$ curl -L 'https://jdbc.postgresql.org/download/postgresql-42.7.3.jar' -o postgresql-42.7.3.jar.jdbc4.jar
-
Run a test query to check connection. For example:
$ sqoop list-tables \ --connect jdbc:postgresql://127.0.0.1:5432/test_db \ --username <username> -P
Example output:
2024-05-29 11:47:29,608 (main) [INFO - org.apache.sqoop.Sqoop.<init>(Sqoop.java:94)] Running Sqoop version: 1.4.7_arenadata2 2024-05-29 11:47:29,743 (main) [INFO - org.apache.sqoop.manager.SqlManager.initOptionDefaults(SqlManager.java:98)] Using default fetchSize of 1000 author book
Import data to HDFS
To copy data from a specific table to HDFS, run the command:
$ sqoop import \
--connect <JDBC-URL> \
--username <username> -P \
--table <table-name> \
--target-dir <HDFS-directory>
Where:
-
JDBC-URL
— a JDBC connection string; -
username
— the name of the user who has access to the requested table; -
table-name
— the name of the table to be imported to HDFS; -
HDFS-directory
— a new or existing directory in HDFS.
For example:
$ sqoop import \
--connect jdbc:postgresql://127.0.0.1:5432/test_db \
--username user1 -P \
--table book \
--target-dir books/
To make sure the data has been copied successfully, run:
$ hdfs dfs -ls hdfs://adhdev/books
Example output:
Found 5 items -rw-r--r-- 3 hdfs hadoop 0 2024-05-29 07:45 hdfs://adhdev/user/sqoop/books/_SUCCESS -rw-r--r-- 3 hdfs hadoop 50 2024-05-29 07:44 hdfs://adhdev/user/sqoop/books/part-m-00000 -rw-r--r-- 3 hdfs hadoop 57 2024-05-29 07:44 hdfs://adhdev/user/sqoop/books/part-m-00001 -rw-r--r-- 3 hdfs hadoop 31 2024-05-29 07:45 hdfs://adhdev/user/sqoop/books/part-m-00002 -rw-r--r-- 3 hdfs hadoop 35 2024-05-29 07:45 hdfs://adhdev/user/sqoop/books/part-m-00003
Create a job
Sqoop job is a tool that enables saving import and export commands for future use.
To create a job, run:
$ sqoop job --create myjob \
--import \
--connect jdbc:postgresql://127.0.0.1:5432/test_db \
--username user1 -P \
--table book \
--target-dir books/
The above command creates a job called myjob
that imports data from the book
table in PostgreSQL to the books directory in HDFS.
Check that the job has been created successfully by running:
$ sqoop job --list
The output:
Available jobs: myjob
To run the saved job, use the command:
$ sqoop job --exec myjob
Export data from HDFS
You can export data back from HDFS to a database using the export
command. The table for the data must already be created in the target database.
To export data to a DB, run the command:
$ sqoop export \
--connect <JDBC-URL> \
--username <username> -P \
--table <table-name> \
--export-dir <HDFS-directory>
Where:
-
JDBC-URL
— a JDBC connection string; -
username
— the name of the user who has access to the requested table; -
table-name
— the name of the table to which the data needs to be transferred; -
HDFS-directory
— the data source directory in HDFS.
For example:
$ sqoop export \
--connect jdbc:postgresql://127.0.0.1:5432/test_db \
--username user1 -P \
--table book \
--target-dir books/
To make sure the data has been copied successfully:
-
Log in as the
postgres
user:$ sudo su - postgres
-
Start the psql client:
$ psql terminal client
-
Request information about the table:
$ \d book
Example output:
Table "public.book" Column | Type | Collation | Nullable | Default -------------+------------------------+-----------+----------+---------------------------------- id | integer | | not null | nextval('book_id_seq'::regclass) title | character varying(255) | | not null | author_id | integer | | not null | public_year | smallint | | | Indexes: "book_pkey" PRIMARY KEY, btree (id) Foreign-key constraints: "fk_author" FOREIGN KEY (author_id) REFERENCES author(id)