Sqoop overview

Sqoop is a tool designed to transfer data between Hadoop and relational databases or mainframes. You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle or a mainframe into the Hadoop Distributed File System (HDFS), transform the data with Hadoop MapReduce, and then export the data back into an RDBMS.

sqoop dark
Sqoop usage
sqoop light
Sqoop usage

Sqoop automates most of the process, relying on the database to describe the schema for the data to be imported. Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault tolerance.

With Sqoop, you can import data from a relational database system or a mainframe into HDFS. The input of the import process is either a database table or mainframe datasets. For databases, Sqoop reads the table row-by-row into HDFS. For mainframe datasets, Sqoop reads records from each mainframe dataset into HDFS.

The result of the import process is a set of files containing a copy of the imported table or datasets. The import process is performed in parallel. For this reason, the output is stored in multiple files. These files are typically delimited text files (for example, the CSV format).

Processing during import and export

A by-product of the import process is a generated Java class that can encapsulate one row of the imported table. This class is used during the import process by Sqoop itself.

This class can serialize and deserialize data to and from the SequenceFile format. It can also parse the delimited text form of a record. This allows you to quickly develop MapReduce applications that use the HDFS-stored records in your processing pipeline. You are also free to parse the delimited data by yourself, using any other tools you prefer.

After manipulating the imported records (for example, with MapReduce or Hive) you may have a result dataset that you can then export back to the relational database. Sqoop’s export process reads a set of delimited text files from HDFS in parallel, parses them into records, and inserts them as new rows in a target database table to be consumed later by external applications or users.

Customization

Sqoop includes some other commands that allow you to inspect the database you are working with. For example, you can list available database schemas with the sqoop-list-databases tool and list tables within a schema using the sqoop-list-tables tool. Sqoop also includes a primitive SQL execution shell (the sqoop-eval tool).

Most aspects of the import, code generation, and export processes can be customized. For databases, you can control the specific row range or columns imported. You can specify particular delimiters and escape characters for the file-based representation of the data, as well as the file format used. You can also control the class or package names used in generated code.

Prerequisites

Before you can use Sqoop, a release of Hadoop must be installed and configured.

Sqoop is predominantly operated and tested on Linux.

Usage

Sqoop is a collection of related tools. To use Sqoop, you specify the tool you want to use and the arguments that control the tool.

If Sqoop is compiled from source, you can run Sqoop without a formal installation process by running the bin/sqoop executable. Users of a packaged deployment of Sqoop (such as an RPM-shipped with Apache Bigtop) can see this program available under /usr/bin/sqoop.

Sqoop comes with a help tool. To display the list of all available tools, run the sqoop help command. You can display help for a specific tool by running sqoop help <tool-name>; for example, sqoop help import. You can also add the --help argument to any command: sqoop import --help.

In addition to typing the sqoop <toolname>, you can use alias scripts that specify the sqoop-<toolname> syntax. For example, sqoop-import, sqoop-export, and other scripts select a specific tool.

Generic and specific arguments

To control the operation of each Sqoop tool, you can use generic and specific arguments.

You should supply the generic arguments -conf, -D, and so on after the tool name but before any tool-specific arguments (such as --connect). Note that generic Hadoop arguments are preceded by a single dash character (-), whereas tool-specific arguments start with two dashes (--), unless they are single character arguments such as -P.

Passing arguments through options file

When using Sqoop, the command line options that do not change from invocation to invocation can be put in an options file for convenience.

An options file is a text file where each line holds an option in the order that it appears otherwise on the command line. Option files allow specifying a single option on multiple lines by using the backslash character (\\) at the end of intermediate lines. Also supported are comments within option files that begin with the hash character (#).

Comments must be specified on a new line and cannot be mixed with option text. All comments and empty lines are ignored when option files are expanded. Unless options appear as quoted strings, any leading or trailing spaces are ignored. Quoted strings (if used) must not extend beyond the line on which they are specified.

Option files can be specified anywhere on the command line as long as the options within them follow the prescribed rules of options ordering. For instance, regardless of where the options are loaded from, they must follow the ordering such that generic options appear first, tool specific options next, and finally followed by options that are intended to be passed to child programs.

To specify an options file, simply create an options file in a convenient location and pass it to the command line via --options-file argument.

Whenever an options file is specified, it is expanded on the command line before the tool is invoked. You can specify more than one option files within the same invocation if needed.

For example, the following Sqoop invocation for import can be specified alternatively as shown below:

$ sqoop import --connect jdbc:mysql://localhost/db --username foo --table TEST
$ sqoop --options-file /users/homer/work/import.txt --table TEST

where the options file /users/homer/work/import.txt contains the following:

import
--connect
jdbc:mysql://localhost/db
--username
foo
Found a mistake? Seleсt text and press Ctrl+Enter to report it