Sqoop overview
Overview
Sqoop is a tool designed to transfer data between Hadoop and relational databases, or mainframes, using YARN for parallel operations. You can use Sqoop to import data from a relational database management system (RDBMS) such as PostgreSQL or MySQL or a mainframe into the Hadoop Distributed File System (HDFS), transform the data with Hadoop MapReduce, and then export the data back into an RDBMS.
The main features of Sqoop are:
-
Parallel import and export
Using Sqoop, you can import data from RDBMS and mainframes into HDFS, Hive, or HBase, and export data from HDFS to remote data storages. Sqoop provides several import options, such as importing all tables, specific columns and rows, the results of an SQL query, incremental imports, and several RDBMS-specific import options.
-
SQL queries
Using Sqoop’s
eval
command, you can run simple SQL statements on a database and display the results. -
Database requests
Sqoop provides commands that allow you to get a list of available databases on a host and list available tables in a database.
-
Code generation
The
codegen
Sqoop CLI command can generate DAO classes based on a table schema structure. -
Kerberos security integration
Sqoop supports the Kerberos computer network authentication protocol, which enables node communication over an insecure network.
Most of the features are provided as Sqoop CLI commands.
Here’s an example of the sqoop help
command output:
usage: sqoop COMMAND [ARGS] Available commands: codegen Generate code to interact with database records create-hive-table Import a table definition into Hive eval Evaluate a SQL statement and display the results export Export an HDFS directory to a database table help List available commands import Import a table from a database to HDFS import-all-tables Import tables from a database to HDFS import-mainframe Import datasets from a mainframe server to HDFS job Work with saved jobs list-databases List available databases on a server list-tables List available tables in a database merge Merge results of incremental imports metastore Run a standalone Sqoop metastore version Display version information
Components
In Arenadata Hadoop, the Sqoop service has two components: Sqoop Metastore and Sqoop Client.
Sqoop Metastore keeps information about the created jobs. These jobs may contain import/export command parameters and can be re-executed by invoking the job by its handle.
The Sqoop Client component enables running the Sqoop CLI commands.
Processing during import and export
For performing an import/export task, Sqoop first gathers the metadata from the target database, then creates a map-only job where each mapper connects to the database via JDBC and copies a chunk of data into HDFS.
When importing whole tables, Sqoop reads a table row-by-row into HDFS. For mainframe datasets, Sqoop reads records from each mainframe dataset into HDFS.
The result of the import process is a set of files containing a copy of the imported table or datasets. These files may be in the form of delimited text files (for example, with commas or tabs separating each field) or binary Avro or SequenceFiles containing serialized record data.
Sqoop’s export process reads a set of delimited text files from HDFS in parallel, parses them into records, and inserts them as new rows into a target database.
Integration with other services
Hive
Sqoop can create the data layout in Hive during the import process or after the data is uploaded to HDFS.
If the Hive table already exists, it can be replaced with the --hive-overwrite
option during an import.
If the data was imported directly into HDFS, Sqoop can import it into Hive by generating a Hive script. This script includes a CREATE TABLE
operation defining columns using Hive’s types and a LOAD DATA INPATH
statement to move the data files into Hive’s warehouse directory.
Sqoop can import the data into a specific partition in Hive if the --hive-partition-key
and --hive-partition-value
arguments are specified.
For more information on how to import data to Hive using Sqoop, see the Use Sqoop to transfer data between PostgreSQL and Hive article.
HBase
Sqoop can also import data into an HBase table if the --hbase-table
option is specified. Each input table row will be transformed into an HBase Put
operation for a row in the output table. The key for each row is taken from a column of the input.
By default, Sqoop uses the split-by
column as the row key column. If not specified, it will try to identify the source table’s primary key column. You can manually specify the row key column with --hbase-row-key
. Each resulting column will be placed in the same column family, which must be specified with --column-family
.
If the target table and column family do not exist, the Sqoop job will exit with an error. Specifying the --hbase-create-table
option allows Sqoop to create the target table and column family using default parameters from your HBase configuration.
Sqoop currently serializes all values to HBase by converting each field to its string representation and then inserting the UTF-8 bytes of this string in the target cell.