Quick start with Sqoop

Elena Kostyuchenko

Contents

Install Sqoop
Connect to a database
Import data to HDFS
Create a job
Export data from HDFS

This guide demonstrates how to start using Sqoop for transferring data between a database and HDFS.

Install Sqoop

To install Sqoop in an existing cluster via ADCM:

Add the Sqoop service to the cluster as described in the Add services article.
Map Sqoop components to the cluster hosts by following steps from the Add components article.
If necessary, edit the default Sqoop configuration as demonstrated in the Configure services.
Install the service.

Connect to a database

Sqoop uses JDBC for database connections. In this example, we will be connecting to an Arenadata PostgreSQL 14 database, which means we need to download a PostgreSQL JDBC driver of the appropriate version.

Make the Sqoop DB drivers directory available for read and write access:
```
$ chmod -R 777 /var/lib/sqoop
```

Download the driver:

$ curl -L 'https://jdbc.postgresql.org/download/postgresql-42.7.3.jar' -o postgresql-42.7.3.jar.jdbc4.jar

Run a test query to check connection. For example:

$ sqoop list-tables \
--connect jdbc:postgresql://127.0.0.1:5432/test_db \
--username <username> -P

Example output:

2024-05-29 11:47:29,608 (main) [INFO - org.apache.sqoop.Sqoop.<init>(Sqoop.java:94)] Running Sqoop version: 1.4.7_arenadata2
2024-05-29 11:47:29,743 (main) [INFO - org.apache.sqoop.manager.SqlManager.initOptionDefaults(SqlManager.java:98)] Using default fetchSize of 1000
author
book

Import data to HDFS

To copy data from a specific table to HDFS, run the command:

$ sqoop import \
--connect <JDBC-URL> \
--username <username> -P \
--table <table-name> \
--target-dir <HDFS-directory>

Where:

JDBC-URL — a JDBC connection string;
username — the name of the user who has access to the requested table;
table-name — the name of the table to be imported to HDFS;
HDFS-directory — a new or existing directory in HDFS.

For example:

$ sqoop import \
--connect jdbc:postgresql://127.0.0.1:5432/test_db \
--username user1 -P \
--table book \
--target-dir books/

To make sure the data has been copied successfully, run:

$ hdfs dfs -ls hdfs://adhdev/books

Example output:

Found 5 items
-rw-r--r--   3 hdfs hadoop          0 2024-05-29 07:45 hdfs://adhdev/user/sqoop/books/_SUCCESS
-rw-r--r--   3 hdfs hadoop         50 2024-05-29 07:44 hdfs://adhdev/user/sqoop/books/part-m-00000
-rw-r--r--   3 hdfs hadoop         57 2024-05-29 07:44 hdfs://adhdev/user/sqoop/books/part-m-00001
-rw-r--r--   3 hdfs hadoop         31 2024-05-29 07:45 hdfs://adhdev/user/sqoop/books/part-m-00002
-rw-r--r--   3 hdfs hadoop         35 2024-05-29 07:45 hdfs://adhdev/user/sqoop/books/part-m-00003

Create a job

Sqoop job is a tool that enables saving import and export commands for future use.

To create a job, run:

$ sqoop job --create myjob \
--import \
--connect jdbc:postgresql://127.0.0.1:5432/test_db \
--username user1 -P \
--table book \
--target-dir books/

The above command creates a job called myjob that imports data from the book table in PostgreSQL to the books directory in HDFS.

Check that the job has been created successfully by running:

$ sqoop job --list

The output:

Available jobs:
   myjob

To run the saved job, use the command:

$ sqoop job --exec myjob

Export data from HDFS

You can export data back from HDFS to a database using the export command. The table for the data must already be created in the target database.

To export data to a DB, run the command:

$ sqoop export \
--connect <JDBC-URL> \
--username <username> -P \
--table <table-name> \
--export-dir <HDFS-directory>

Where:

JDBC-URL — a JDBC connection string;
username — the name of the user who has access to the requested table;
table-name — the name of the table to which the data needs to be transferred;
HDFS-directory — the data source directory in HDFS.

For example:

$ sqoop export \
--connect jdbc:postgresql://127.0.0.1:5432/test_db \
--username user1 -P \
--table book \
--target-dir books/

To make sure the data has been copied successfully:

Log in as the postgres user:
```
$ sudo su - postgres
```
Start the psql client:
```
$ psql terminal client
```

Request information about the table:

$ \d book

Example output:

                                      Table "public.book"
   Column    |          Type          | Collation | Nullable |             Default
-------------+------------------------+-----------+----------+----------------------------------
 id          | integer                |           | not null | nextval('book_id_seq'::regclass)
 title       | character varying(255) |           | not null |
 author_id   | integer                |           | not null |
 public_year | smallint               |           |          |
Indexes:
    "book_pkey" PRIMARY KEY, btree (id)
Foreign-key constraints:
    "fk_author" FOREIGN KEY (author_id) REFERENCES author(id)

Found a mistake? Seleсt text and press Ctrl+Enter to report it