Use Sqoop to transfer data between PostgreSQL and Hive

Overview

The Sqoop service enables simplified data transfer between Hadoop, relational databases, and some other data storages. It provides the following CLI commands for transferring data, with a number of options:

  • sqoop import — copies an individual table from a database using MapReduce or YARN. Each row from a table is represented as a separate record in HDFS. Records can be stored as text files (one record per line) or in binary representation as Avro or SequenceFiles. When importing data to Hive, Sqoop reads the source table row-by-row in parallel and copies it to the target table in Hive.

  • sqoop export — reads a set of delimited text files from HDFS in parallel, using MapReduce or YARN, parses them into records, and inserts them as new rows into a target database table.

The syntax for both commands is as follows:

$ sqoop import [generic-args] [import-args]
$ sqoop export [generic-args] [import-args]

This guide demonstrates how to start using Sqoop for transferring data between PostgreSQL and Hive.

PostgreSQL connectors

Sqoop supports several PostgreSQL connectors that enable additional import/export options: the default PostgreSQL connector, direct connector, and pg_bulkload (not supported in ADPG).

PostgreSQL connector

By default, Sqoop import/export commands work with tables located in the public schema. The PostgreSQL connector enables import/export operations on data stored in schemas other than the default one.

To select a schema, append the --schema command option and provide the name of the chosen schema.

PostgreSQL direct connector

The PostgreSQL direct connector allows faster import and export by using PostgreSQL’s COPY command.

To use the direct connector, specify the --direct option in your import or export requests.

When importing from PostgreSQL this way, you can split the imported data into separate files of a certain size. This size limit is controlled with the --direct-split-size argument.

The direct connector provides new options, which are described in the table below.

Additional supported PostgreSQL arguments in the direct mode
Option Description

--direct

Enables the direct connector

--direct-split-size

Enables splitting the data into separate files of a certain size

--boolean-true-string

A string that will be used to encode the TRUE value of boolean columns

--boolean-false-string

A string that will be used to encode the FALSE value of boolean columns

NOTE

The direct connector does not support importing large object columns (BLOB and CLOB), views, or importing to HBase and Accumulo.

Requirements

When importing or exporting data with Sqoop, pay attention to:

  • Data types

    Check that the data imported from PostgreSQL doesn’t contain data types that are not supported in Hive.

  • Table and column names

    If table and column names have characters that are not valid Java identifier characters or Avro/Parquet identifiers, Sqoop will translate these characters to _. Any column name starting with an underscore will be translated to the same name with two underscore characters.

Preparations

The example commands in this article were executed in Arenadata Postgres (ADPG), a relational database management system based on PostgreSQL, and in Sqoop and Hive services, available in Arenadata Hadoop (ADH).

Prepare ADPG for test examples

 

  1. On the ADPG host, run the psql terminal client as the default user (postgres) using the following commands:

    $ sudo su - postgres
    $ psql

    Now you can enter commands in psql to work with PostgreSQL databases and tables:

    psql (14.1)
    Type "help" for help.
    
    postgres=#
  2. Create a user for connecting to ADPG (for demonstration purposes, grant superuser privileges):

    CREATE ROLE sqoop SUPERUSER LOGIN PASSWORD '<sqoop_user_password>';
  3. Create a test database:

    CREATE DATABASE books_store;
  4. Switch to this database:

    \c books_store

    The command result:

    You are now connected to database "books_store" as user "postgres".
    books_store=#
  5. Create a table:

    CREATE TABLE books \
    (id SERIAL PRIMARY KEY, \
    title VARCHAR(255) NOT NULL, \
    author VARCHAR(255) NOT NULL, \
    public_year SMALLINT NULL);
  6. Add test data to the table:

    INSERT INTO books(title, author, public_year) VALUES
        ('Crime and Punishment','Fyodor Dostoevsky',1866),
        ('Frankenstein','Mary Shelley',1823),
        ('The Master and Margarita','Mikhail Bulgakov',1966),
        ('Nineteen Eighty-Four','George Orwell',1949),
        ('Dracula','Bram Stoker',NULL);
  7. To allow the sqoop user to connect to the test database, add the following record to the pg_hba.conf file (for ADPG, you can do this in the ADCM interface — see the Password authentication article for details):

    host books_store sqoop <sqoop_host_address> password

    Where <sqoop_host_address> is the address of the Sqoop host. You can find the description of the address field and other fields of records in the pg_hba.conf file in the The pg_hba.conf file article.

NOTE

For an overview of how to work with PostgreSQL tables in psql, see Example of working with tables via psql in the ADPG documentation.

Prepare Hive for test examples

 

  1. On the host with the Hive Client component, run:

    $ sudo -u hive hive

    At the end of the output, you’ll see the Hive CLI invitation:

    hive>
  2. Create a table:

    CREATE TABLE books(
        id INT,
        title STRING,
        author STRING,
        public_year INT)
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ',';

    Example output:

    OK
    Time taken: 0.074 seconds
NOTE

For an overview of how to work with Hive, see Work with Hive tables.

Prepare Sqoop for test examples

 

  1. Install Sqoop by following the quickstart guide.

  2. Check the connection to the test database in PostgreSQL:

    $ sqoop list-tables --connect jdbc:postgresql://127.0.0.1:5432/books_store --username <username> -P

    The output will be similar to this:

    2024-05-29 11:47:29,608 (main) [INFO - org.apache.sqoop.Sqoop.<init>(Sqoop.java:94)] Running Sqoop version: 1.4.7_arenadata2
    2024-05-29 11:47:29,743 (main) [INFO - org.apache.sqoop.manager.SqlManager.initOptionDefaults(SqlManager.java:98)] Using default fetchSize of 1000
    books

Common arguments

Common arguments for both import and export commands are described in the table below.

Common import/export arguments
Option Description

--connect

A JDBC connection string for connecting to a database that contains the full hostname or IP address of the database, which can be seen by all remote nodes in a cluster. Do not use localhost in the connection string as it will be used on TaskTracker nodes throughout MapReduce; if you specify the literal name localhost, each node will connect to a different database or won’t connect all

--connection-manager

A connection manager class to use

--driver

A JDBC driver class to use. Sqoop selects the default driver based on the provided JDBC connection URI, but you can specify a different driver using this option

--hadoop-mapred-home

A custom $HADOOP_MAPRED_HOME other than the default one

--password-file

A path for a file containing the authentication password. You can save the password in a file in the users home directory with 400 permissions and specify the path to that file using this option. Sqoop will read the entire contents of the file and pass it to the MapReduce cluster as a password using secure means. The file containing the password can either be on the local FS or in HDFS. For example: --password-file ${user.home}/.password This will include any trailing white space characters, such as new line characters that are added by default by most of the text editors

-P

This option indicates that the password must be entered in a console. It’s a less safe option than using a password file

--password

The authentication password for connecting to a DB. This way of providing credentials is insecure, as other users may be able to read your password from the command-line arguments

--username

The name of a user who has access to the database

-verbose

Print more information in the command output. This might be useful when debugging, but the import/export commands already provide extensive output by default

--connection-param-file

A file containing the parameters for the JDBC connection. This feature is useful for lengthy import/export commands or when managing different periodic connections

--relaxed-isolation

Sets the connection transaction isolation to READ UNCOMMITTED for the mappers. By default, Sqoop uses the READ COMMITTED transaction isolation in mappers to import data

The full list of options for the commands is available in references:

Additionally, you can use the --help option for each command to read the same information in a console.

Transfer data from PostgreSQL to Hive

Create or overwrite tables

To transfer data from PostgreSQL to Hive, use the command:

$ sqoop import \
--connect jdbc:postgresql://<postgres-host>:5432/<db-name> \
--username <username> -P \
--table <table-name> \
--hive-import

Where:

  • <postgres-host> — the IP address of the DB host;

  • <db-name> — the name of the database from which to copy the data;

  • <username> — the name of the user who has access to the requested table;

  • <table-name> — the name of the table to be imported to Hive.

The --hive-import option enables saving data to Hive.

For example:

$ sqoop import \
--connect jdbc:postgresql://127.0.0.1:5432/books_store \
--username sqoop -P \
--table books \
--hive-import

The output of the successful import ends with:

[INFO - org.apache.sqoop.hive.HiveImport.importTable(HiveImport.java:212)] Hive import complete.

When Sqoop imports data to Hive, it creates a table with the same name as the source table. If it already exists, it will add the data into the existing table.

You can specify a different name for the target table by using the --hive-table option.

For example:

$ sqoop import \
--connect jdbc:postgresql://127.0.0.1:5432/books_store \
--username sqoop -P \
--table books \
--hive-import \
--hive-table books2

To overwrite data in the target table, use the --hive-overwrite option.

For example:

$ sqoop import \
--connect jdbc:postgresql://127.0.0.1:5432/books_store \
--username sqoop -P \
--table books \
--hive-import \
--hive-overwrite

To see the imported data, run in Hive:

SELECT * FROM books;

Example output:

OK
1       Crime and Punishment    Fyodor Dostoevsky       1866
2       Frankenstein    Mary Shelley    1823
3       The Master and Margarita        Mikhail Bulgakov        1966
4       Nineteen Eighty-Four    George Orwell   1949
5       Dracula Bram Stoker     NULL
Time taken: 2.255 seconds, Fetched: 5 row(s)

The test table we are using in this example has a NULL value. In this case, Sqoop will copy it to Hive as a NULL string. To change the substitution value for NULL, you can use the --null-string option for string values and the --null-non-string option for other data types.

For example:

$ sqoop import \
--connect jdbc:postgresql://127.0.0.1:5432/books_store \
--username sqoop -P \
--table books \
--hive-import \
--null-non-string 0

The result:

OK
1       Crime and Punishment    Fyodor Dostoevsky       1866
2       Frankenstein    Mary Shelley    1823
3       The Master and Margarita        Mikhail Bulgakov        1966
4       Nineteen Eighty-Four    George Orwell   1949
5       Dracula Bram Stoker     0

Column import

By default, all columns within a table are selected for import. You can import only chosen columns by using the --columns option.

This argument must include a comma-delimited list of all the columns to be imported.

For example:

$ sqoop import \
--connect jdbc:postgresql://127.0.0.1:5432/books_store \
--username sqoop -P \
--table books \
--hive-import
--columns "title,author"

The result:

OK
Crime and Punishment    Fyodor Dostoevsky
Frankenstein    Mary Shelley
The Master and Margarita        Mikhail Bulgakov
Nineteen Eighty-Four    George Orwell
Dracula Bram Stoker
Time taken: 2.168 seconds, Fetched: 5 row(s)

Row import

To select which rows to import, use the --where option. The argument can contain an arbitrary expression to select the desired rows.

For example:

$ sqoop import \
--connect jdbc:postgresql://127.0.0.1:5432/books_store \
--username sqoop -P \
--table books \
--hive-import
--where "public_year < '1900'"

The above import command will generate statements like SELECT <column list> FROM <table name> with the provided WHERE clause and copy the selected data to Hive. In this case, only the books published in the 19th century will be copied.

The result:

OK
1       Crime and Punishment    Fyodor Dostoevsky       1866
2       Frankenstein    Mary Shelley    1823
Time taken: 2.07 seconds, Fetched: 2 row(s)

Free-form query import

Sqoop can also import the result set of an arbitrary SQL query. Instead of using the --table, --columns, and --where arguments, you can specify an SQL statement with the --query argument.

When importing a free-form query, you must:

  • Specify a destination directory with --target-dir.

  • Include the $CONDITIONS token in your query.

  • Select a splitting column with the --split-by option. Alternatively, limit the number of mappers to 1.

For example:

$ sqoop import \
--connect jdbc:postgresql://127.0.0.1:5432/books_store \
--username sqoop -P \
--hive-import
--query 'SELECT MIN(public_year) FROM books WHERE $CONDITIONS' \
--target-dir /user/hdfs/books2
--split-by id

The result:

OK
1823
Time taken: 2.051 seconds, Fetched: 1 row(s)

Control parallel tasks

When data is imported from PostgreSQL, each MapReduce job performs a query to select the requested data. Alternatively, the query can be executed once and imported serially by specifying a single map task using the --m 1 option:

$ sqoop import \
--connect jdbc:postgresql://127.0.0.1:5432/books_store \
--username sqoop -P \
--table books \
--hive-import
--query 'SELECT MIN(public_year) FROM books WHERE $CONDITIONS' \
--target-dir /user/hdfs/books2
--m 1

Transfer data from Hive to PostgreSQL

For transferring data from Hive to PostgreSQL, there’s no designated Hive export option, but you can export data to Postgres directly from the Hive warehouse directory in HDFS.

The target table with the appropriate structure must already exist in the database.

The export command syntax is:

$ sqoop export \
--connect jdbc:postgresql://<postgres-host>:5432/<db-name> \
--username <username> -P \
--table <table-name> \
--export-dir <hdfs-directory>

Where:

  • <postgres-host> — the IP address of the DB host;

  • <db-name> — the name of the database in which to copy the data;

  • <username> — the name of the user who has access to the requested table;

  • <table-name> — the name of the target table in PostgreSQL.

  • <hdfs-directory> — the data source directory in HDFS.

For example:

$ sqoop export \
--connect jdbc:postgresql://127.0.0.1:5432/books_store \
--username sqoop -P \
--table export exportbooks \
--export-dir /apps/hive/warehouse/books

This example takes the files in /apps/hive/warehouse/books and injects their contents into the exportbooks table in the books_store database. Sqoop makes a series of INSERT INTO operations without regard for existing content. If Sqoop attempts to insert rows that violate constraints in the database (for example, a particular primary key value already exists), then the export fails.

Alternatively, you can specify the columns to be exported by providing --columns "<column1,column2,column3>".

NOTE

The columns that are not included in the --columns parameter need to have either a defined default value or allow NULL values. Otherwise, your database will reject the imported data, which in turn will fail the Sqoop job.

To check if the data was copied correctly, run in psql:

SELECT * FROM exportbooks;

Example output:

 id |          title           |      author       | public_year
----+--------------------------+-------------------+-------------
  1 | Crime and Punishment     | Fyodor Dostoevsky |        1866
  2 | Frankenstein             | Mary Shelley      |        1823
  3 | The Master and Margarita | Mikhail Bulgakov  |        1966
  4 | Nineteen Eighty-Four     | George Orwell     |        1949
  5 | Dracula                  | Bram Stoker       |
(5 rows)

For more information about the export command, refer to the Sqoop documentation.

Found a mistake? Seleсt text and press Ctrl+Enter to report it