Use Sqoop to transfer data between PostgreSQL and Hive
Overview
The Sqoop service enables simplified data transfer between Hadoop, relational databases, and some other data storages. It provides the following CLI commands for transferring data, with a number of options:
-
sqoop import
— copies an individual table from a database using MapReduce or YARN. Each row from a table is represented as a separate record in HDFS. Records can be stored as text files (one record per line) or in binary representation as Avro or SequenceFiles. When importing data to Hive, Sqoop reads the source table row-by-row in parallel and copies it to the target table in Hive. -
sqoop export
— reads a set of delimited text files from HDFS in parallel, using MapReduce or YARN, parses them into records, and inserts them as new rows into a target database table.
The syntax for both commands is as follows:
$ sqoop import [generic-args] [import-args]
$ sqoop export [generic-args] [import-args]
This guide demonstrates how to start using Sqoop for transferring data between PostgreSQL and Hive.
PostgreSQL connectors
Sqoop supports several PostgreSQL connectors that enable additional import/export options: the default PostgreSQL connector, direct connector, and pg_bulkload (not supported in ADPG).
PostgreSQL connector
By default, Sqoop import/export commands work with tables located in the public
schema. The PostgreSQL connector enables import/export operations on data stored in schemas other than the default one.
To select a schema, append the --schema
command option and provide the name of the chosen schema.
PostgreSQL direct connector
The PostgreSQL direct connector allows faster import and export by using PostgreSQL’s COPY
command.
To use the direct connector, specify the --direct
option in your import or export requests.
When importing from PostgreSQL this way, you can split the imported data into separate files of a certain size. This size limit is controlled with the --direct-split-size
argument.
The direct connector provides new options, which are described in the table below.
Option | Description |
---|---|
--direct |
Enables the direct connector |
--direct-split-size |
Enables splitting the data into separate files of a certain size |
--boolean-true-string |
A string that will be used to encode the |
--boolean-false-string |
A string that will be used to encode the |
NOTE
The direct connector does not support importing large object columns (BLOB and CLOB), views, or importing to HBase and Accumulo. |
Requirements
When importing or exporting data with Sqoop, pay attention to:
-
Data types
Check that the data imported from PostgreSQL doesn’t contain data types that are not supported in Hive.
-
Table and column names
If table and column names have characters that are not valid Java identifier characters or Avro/Parquet identifiers, Sqoop will translate these characters to
_
. Any column name starting with an underscore will be translated to the same name with two underscore characters.
Preparations
The example commands in this article were executed in Arenadata Postgres (ADPG), a relational database management system based on PostgreSQL, and in Sqoop and Hive services, available in Arenadata Hadoop (ADH).
-
On the ADPG host, run the psql terminal client as the default user (
postgres
) using the following commands:$ sudo su - postgres $ psql
Now you can enter commands in psql to work with PostgreSQL databases and tables:
psql (14.1) Type "help" for help. postgres=#
-
Create a user for connecting to ADPG (for demonstration purposes, grant superuser privileges):
CREATE ROLE sqoop SUPERUSER LOGIN PASSWORD '<sqoop_user_password>';
-
Create a test database:
CREATE DATABASE books_store;
-
Switch to this database:
\c books_store
The command result:
You are now connected to database "books_store" as user "postgres". books_store=#
-
Create a table:
CREATE TABLE books \ (id SERIAL PRIMARY KEY, \ title VARCHAR(255) NOT NULL, \ author VARCHAR(255) NOT NULL, \ public_year SMALLINT NULL);
-
Add test data to the table:
INSERT INTO books(title, author, public_year) VALUES ('Crime and Punishment','Fyodor Dostoevsky',1866), ('Frankenstein','Mary Shelley',1823), ('The Master and Margarita','Mikhail Bulgakov',1966), ('Nineteen Eighty-Four','George Orwell',1949), ('Dracula','Bram Stoker',NULL);
-
To allow the
sqoop
user to connect to the test database, add the following record to the pg_hba.conf file (for ADPG, you can do this in the ADCM interface — see the Password authentication article for details):host books_store sqoop <sqoop_host_address> password
Where
<sqoop_host_address>
is the address of the Sqoop host. You can find the description of theaddress
field and other fields of records in the pg_hba.conf file in the The pg_hba.conf file article.
NOTE
For an overview of how to work with PostgreSQL tables in psql, see Example of working with tables via psql in the ADPG documentation. |
-
On the host with the Hive Client component, run:
$ sudo -u hive hive
At the end of the output, you’ll see the Hive CLI invitation:
hive>
-
Create a table:
CREATE TABLE books( id INT, title STRING, author STRING, public_year INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
Example output:
OK Time taken: 0.074 seconds
NOTE
For an overview of how to work with Hive, see Work with Hive tables. |
-
Install Sqoop by following the quickstart guide.
-
Check the connection to the test database in PostgreSQL:
$ sqoop list-tables --connect jdbc:postgresql://127.0.0.1:5432/books_store --username <username> -P
The output will be similar to this:
2024-05-29 11:47:29,608 (main) [INFO - org.apache.sqoop.Sqoop.<init>(Sqoop.java:94)] Running Sqoop version: 1.4.7_arenadata2 2024-05-29 11:47:29,743 (main) [INFO - org.apache.sqoop.manager.SqlManager.initOptionDefaults(SqlManager.java:98)] Using default fetchSize of 1000 books
Common arguments
Common arguments for both import and export commands are described in the table below.
Option | Description |
---|---|
--connect |
A JDBC connection string for connecting to a database that contains the full hostname or IP address of the database, which can be seen by all remote nodes in a cluster. Do not use |
--connection-manager |
A connection manager class to use |
--driver |
A JDBC driver class to use. Sqoop selects the default driver based on the provided JDBC connection URI, but you can specify a different driver using this option |
--hadoop-mapred-home |
A custom |
--password-file |
A path for a file containing the authentication password. You can save the password in a file in the users home directory with |
-P |
This option indicates that the password must be entered in a console. It’s a less safe option than using a password file |
--password |
The authentication password for connecting to a DB. This way of providing credentials is insecure, as other users may be able to read your password from the command-line arguments |
--username |
The name of a user who has access to the database |
-verbose |
Print more information in the command output. This might be useful when debugging, but the import/export commands already provide extensive output by default |
--connection-param-file |
A file containing the parameters for the JDBC connection. This feature is useful for lengthy import/export commands or when managing different periodic connections |
--relaxed-isolation |
Sets the connection transaction isolation to |
The full list of options for the commands is available in references:
Additionally, you can use the --help
option for each command to read the same information in a console.
Transfer data from PostgreSQL to Hive
Create or overwrite tables
To transfer data from PostgreSQL to Hive, use the command:
$ sqoop import \
--connect jdbc:postgresql://<postgres-host>:5432/<db-name> \
--username <username> -P \
--table <table-name> \
--hive-import
Where:
-
<postgres-host>
— the IP address of the DB host; -
<db-name>
— the name of the database from which to copy the data; -
<username>
— the name of the user who has access to the requested table; -
<table-name>
— the name of the table to be imported to Hive.
The --hive-import
option enables saving data to Hive.
For example:
$ sqoop import \
--connect jdbc:postgresql://127.0.0.1:5432/books_store \
--username sqoop -P \
--table books \
--hive-import
The output of the successful import ends with:
[INFO - org.apache.sqoop.hive.HiveImport.importTable(HiveImport.java:212)] Hive import complete.
When Sqoop imports data to Hive, it creates a table with the same name as the source table. If it already exists, it will add the data into the existing table.
You can specify a different name for the target table by using the --hive-table
option.
For example:
$ sqoop import \
--connect jdbc:postgresql://127.0.0.1:5432/books_store \
--username sqoop -P \
--table books \
--hive-import \
--hive-table books2
To overwrite data in the target table, use the --hive-overwrite
option.
For example:
$ sqoop import \
--connect jdbc:postgresql://127.0.0.1:5432/books_store \
--username sqoop -P \
--table books \
--hive-import \
--hive-overwrite
To see the imported data, run in Hive:
SELECT * FROM books;
Example output:
OK 1 Crime and Punishment Fyodor Dostoevsky 1866 2 Frankenstein Mary Shelley 1823 3 The Master and Margarita Mikhail Bulgakov 1966 4 Nineteen Eighty-Four George Orwell 1949 5 Dracula Bram Stoker NULL Time taken: 2.255 seconds, Fetched: 5 row(s)
The test table we are using in this example has a NULL
value. In this case, Sqoop will copy it to Hive as a NULL
string. To change the substitution value for NULL
, you can use the --null-string
option for string values and the --null-non-string
option for other data types.
For example:
$ sqoop import \
--connect jdbc:postgresql://127.0.0.1:5432/books_store \
--username sqoop -P \
--table books \
--hive-import \
--null-non-string 0
The result:
OK 1 Crime and Punishment Fyodor Dostoevsky 1866 2 Frankenstein Mary Shelley 1823 3 The Master and Margarita Mikhail Bulgakov 1966 4 Nineteen Eighty-Four George Orwell 1949 5 Dracula Bram Stoker 0
Column import
By default, all columns within a table are selected for import. You can import only chosen columns by using the --columns
option.
This argument must include a comma-delimited list of all the columns to be imported.
For example:
$ sqoop import \
--connect jdbc:postgresql://127.0.0.1:5432/books_store \
--username sqoop -P \
--table books \
--hive-import
--columns "title,author"
The result:
OK Crime and Punishment Fyodor Dostoevsky Frankenstein Mary Shelley The Master and Margarita Mikhail Bulgakov Nineteen Eighty-Four George Orwell Dracula Bram Stoker Time taken: 2.168 seconds, Fetched: 5 row(s)
Row import
To select which rows to import, use the --where
option. The argument can contain an arbitrary expression to select the desired rows.
For example:
$ sqoop import \
--connect jdbc:postgresql://127.0.0.1:5432/books_store \
--username sqoop -P \
--table books \
--hive-import
--where "public_year < '1900'"
The above import
command will generate statements like SELECT <column list> FROM <table name>
with the provided WHERE
clause and copy the selected data to Hive. In this case, only the books published in the 19th century will be copied.
The result:
OK 1 Crime and Punishment Fyodor Dostoevsky 1866 2 Frankenstein Mary Shelley 1823 Time taken: 2.07 seconds, Fetched: 2 row(s)
Free-form query import
Sqoop can also import the result set of an arbitrary SQL query. Instead of using the --table
, --columns
, and --where
arguments, you can specify an SQL statement with the --query
argument.
When importing a free-form query, you must:
-
Specify a destination directory with
--target-dir
. -
Include the
$CONDITIONS
token in your query. -
Select a splitting column with the
--split-by
option. Alternatively, limit the number of mappers to 1.
For example:
$ sqoop import \
--connect jdbc:postgresql://127.0.0.1:5432/books_store \
--username sqoop -P \
--hive-import
--query 'SELECT MIN(public_year) FROM books WHERE $CONDITIONS' \
--target-dir /user/hdfs/books2
--split-by id
The result:
OK 1823 Time taken: 2.051 seconds, Fetched: 1 row(s)
Control parallel tasks
When data is imported from PostgreSQL, each MapReduce job performs a query to select the requested data. Alternatively, the query can be executed once and imported serially by specifying a single map task using the --m 1
option:
$ sqoop import \
--connect jdbc:postgresql://127.0.0.1:5432/books_store \
--username sqoop -P \
--table books \
--hive-import
--query 'SELECT MIN(public_year) FROM books WHERE $CONDITIONS' \
--target-dir /user/hdfs/books2
--m 1
Transfer data from Hive to PostgreSQL
For transferring data from Hive to PostgreSQL, there’s no designated Hive export option, but you can export data to Postgres directly from the Hive warehouse directory in HDFS.
The target table with the appropriate structure must already exist in the database.
The export command syntax is:
$ sqoop export \
--connect jdbc:postgresql://<postgres-host>:5432/<db-name> \
--username <username> -P \
--table <table-name> \
--export-dir <hdfs-directory>
Where:
-
<postgres-host>
— the IP address of the DB host; -
<db-name>
— the name of the database in which to copy the data; -
<username>
— the name of the user who has access to the requested table; -
<table-name>
— the name of the target table in PostgreSQL. -
<hdfs-directory>
— the data source directory in HDFS.
For example:
$ sqoop export \
--connect jdbc:postgresql://127.0.0.1:5432/books_store \
--username sqoop -P \
--table export exportbooks \
--export-dir /apps/hive/warehouse/books
This example takes the files in /apps/hive/warehouse/books
and injects their contents into the exportbooks
table in the books_store
database. Sqoop makes a series of INSERT INTO
operations without regard for existing content. If Sqoop attempts to insert rows that violate constraints in the database (for example, a particular primary key value already exists), then the export fails.
Alternatively, you can specify the columns to be exported by providing --columns "<column1,column2,column3>"
.
NOTE
The columns that are not included in the |
To check if the data was copied correctly, run in psql:
SELECT * FROM exportbooks;
Example output:
id | title | author | public_year ----+--------------------------+-------------------+------------- 1 | Crime and Punishment | Fyodor Dostoevsky | 1866 2 | Frankenstein | Mary Shelley | 1823 3 | The Master and Margarita | Mikhail Bulgakov | 1966 4 | Nineteen Eighty-Four | George Orwell | 1949 5 | Dracula | Bram Stoker | (5 rows)
For more information about the export
command, refer to the Sqoop documentation.