Sqoop overview

Overview

Sqoop is a tool designed to transfer data between Hadoop and relational databases, or mainframes, using YARN for parallel operations. You can use Sqoop to import data from a relational database management system (RDBMS) such as PostgreSQL or MySQL or a mainframe into the Hadoop Distributed File System (HDFS), transform the data with Hadoop MapReduce, and then export the data back into an RDBMS.

Sqoop usage
Sqoop usage
Sqoop usage
Sqoop usage

The main features of Sqoop are:

  • Parallel import and export

    Using Sqoop, you can import data from RDBMS and mainframes into HDFS, Hive, or HBase, and export data from HDFS to remote data storages. Sqoop provides several import options, such as importing all tables, specific columns and rows, the results of an SQL query, incremental imports, and several RDBMS-specific import options.

  • SQL queries

    Using Sqoop’s eval command, you can run simple SQL statements on a database and display the results.

  • Database requests

    Sqoop provides commands that allow you to get a list of available databases on a host and list available tables in a database.

  • Code generation

    The codegen Sqoop CLI command can generate DAO classes based on a table schema structure.

  • Kerberos security integration

    Sqoop supports the Kerberos computer network authentication protocol, which enables node communication over an insecure network.

Most of the features are provided as Sqoop CLI commands.

Here’s an example of the sqoop help command output:

usage: sqoop COMMAND [ARGS]

Available commands:
  codegen            Generate code to interact with database records
  create-hive-table  Import a table definition into Hive
  eval               Evaluate a SQL statement and display the results
  export             Export an HDFS directory to a database table
  help               List available commands
  import             Import a table from a database to HDFS
  import-all-tables  Import tables from a database to HDFS
  import-mainframe   Import datasets from a mainframe server to HDFS
  job                Work with saved jobs
  list-databases     List available databases on a server
  list-tables        List available tables in a database
  merge              Merge results of incremental imports
  metastore          Run a standalone Sqoop metastore
  version            Display version information

Components

In Arenadata Hadoop, the Sqoop service has two components: Sqoop Metastore and Sqoop Client.

Sqoop Metastore keeps information about the created jobs. These jobs may contain import/export command parameters and can be re-executed by invoking the job by its handle.

The Sqoop Client component enables running the Sqoop CLI commands.

Processing during import and export

For performing an import/export task, Sqoop first gathers the metadata from the target database, then creates a map-only job where each mapper connects to the database via JDBC and copies a chunk of data into HDFS.

When importing whole tables, Sqoop reads a table row-by-row into HDFS. For mainframe datasets, Sqoop reads records from each mainframe dataset into HDFS.

The import command workflow
The import command workflow
The import command workflow
The import command workflow

The result of the import process is a set of files containing a copy of the imported table or datasets. These files may be in the form of delimited text files (for example, with commas or tabs separating each field) or binary Avro or SequenceFiles containing serialized record data.

Sqoop’s export process reads a set of delimited text files from HDFS in parallel, parses them into records, and inserts them as new rows into a target database.

Integration with other services

Hive

Sqoop can create the data layout in Hive during the import process or after the data is uploaded to HDFS.

If the Hive table already exists, it can be replaced with the --hive-overwrite option during an import.

If the data was imported directly into HDFS, Sqoop can import it into Hive by generating a Hive script. This script includes a CREATE TABLE operation defining columns using Hive’s types and a LOAD DATA INPATH statement to move the data files into Hive’s warehouse directory.

Sqoop can import the data into a specific partition in Hive if the --hive-partition-key and --hive-partition-value arguments are specified.

Hive limitations

 

  • Since Hive doesn’t support escaping new-line characters and field delimiters in enclosed strings, avoid using escaping and enclosing characters as field and record-terminating delimiters. If the -escaped-by, -enclosed-by, or -optionally-enclosed-by options are specified when importing data into Hive, Sqoop will print a warning message.

  • Hive won’t work correctly if data contains string fields with Hive’s default row delimiters (\n and \r characters) or column delimiters (\01 characters). Use the -hive-drop-import-delims option to drop these characters on import, or the -hive-delims-replacement option to replace them with a specified string.

  • Sqoop passes field and record delimiters through to Hive. If you do not set any delimiters and use -hive-import, the field delimiter will be set to ^A and the record delimiter to \n to match Hive’s defaults.

For more information on how to import data to Hive using Sqoop, see the Use Sqoop to transfer data between PostgreSQL and Hive article.

HBase

Sqoop can also import data into an HBase table if the --hbase-table option is specified. Each input table row will be transformed into an HBase Put operation for a row in the output table. The key for each row is taken from a column of the input.

By default, Sqoop uses the split-by column as the row key column. If not specified, it will try to identify the source table’s primary key column. You can manually specify the row key column with --hbase-row-key. Each resulting column will be placed in the same column family, which must be specified with --column-family.

If the target table and column family do not exist, the Sqoop job will exit with an error. Specifying the --hbase-create-table option allows Sqoop to create the target table and column family using default parameters from your HBase configuration.

Sqoop currently serializes all values to HBase by converting each field to its string representation and then inserting the UTF-8 bytes of this string in the target cell.

Found a mistake? Seleсt text and press Ctrl+Enter to report it