Bulk loading via built-in MapReduce jobs

Daria Barysheva

Contents

Step 1. Extract
Step 2. Transform
Step 3. Load

This article describes the simplest way of HBase bulk loading based on the built-in ImportTsv and LoadIncrementalHFiles (also known as CompleteBulkLoad) utilities.

Step 1. Extract

In the first step, it is necessary to prepare sample data and upload it to HDFS. You can use the test file people_ages.csv. This file contains randomly generated names and ages of one thousand people. Notice, that there are three duplicates in the names: McGee Isabelle, Reid Janie, and Summers Blanche occur twice. Let’s leave these names intact to demonstrate the behavior of the system in such a case.

Do the following:

Download the file to one of the servers of your HBase cluster. Make sure that the file is uploaded successfully using the following command:

$ ls -la ~

The result looks similar to this:

total 56
drwx------. 5 dasha dasha   251 Nov 26 07:46 .
drwxr-xr-x. 3 root  root     19 Aug 31 11:54 ..
drwx------. 3 dasha dasha    17 Aug 31 15:06 .ansible
-rw-------. 1 dasha dasha 14913 Nov 25 16:07 .bash_history
-rw-r--r--. 1 dasha dasha    18 Apr  1  2020 .bash_logout
-rw-r--r--. 1 dasha dasha   193 Apr  1  2020 .bash_profile
-rw-r--r--. 1 dasha dasha   231 Apr  1  2020 .bashrc
drwxrwxrwx. 2 dasha dasha    64 Nov 26 07:45 dasha
-rw-rw-r--. 1 dasha dasha 17651 Nov 26 07:46 people_ages.csv
drwx------. 2 dasha dasha    29 Sep 23 17:35 .ssh

Create a directory for your user in HDFS if it does not exist yet:

$ sudo -u hdfs hdfs dfs -mkdir /user/dasha
$ hdfs dfs -ls /user

The output looks similar to the following:

Found 5 items
drwxr-xr-x   - hdfs   hadoop          0 2021-11-26 09:56 /user/dasha
drwx------   - hdfs   hadoop          0 2021-08-31 16:15 /user/hdfs
drwxr-xr-x   - mapred hadoop          0 2021-08-31 16:22 /user/history
drwxr-xr-x   - mapred mapred          0 2021-08-31 16:21 /user/mapred
drwxr-xr-x   - yarn   yarn            0 2021-09-01 06:57 /user/yarn

Expand access rights to your folder using the following command (see more information about file protection in Protect files in HDFS):

$ sudo -u hdfs hdfs dfs -chmod 777 /user/dasha
$ hdfs dfs -ls /user

The output looks similar to the following:

Found 5 items
drwxrwxrwx   - hdfs   hadoop          0 2021-11-26 09:57 /user/dasha
drwx------   - hdfs   hadoop          0 2021-08-31 16:15 /user/hdfs
drwxr-xr-x   - mapred hadoop          0 2021-08-31 16:22 /user/history
drwxr-xr-x   - mapred mapred          0 2021-08-31 16:21 /user/mapred
drwxr-xr-x   - yarn   yarn            0 2021-09-01 06:57 /user/yarn

Copy the file from the local file system to HDFS:

$ hdfs dfs -copyFromLocal ~/people_ages.csv /user/dasha

Make sure that the file is located in HDFS:

$ hdfs dfs -ls /user/dasha

The result looks similar to this:

Found 1 items
-rw-r--r--   3 dasha hadoop      17651 2021-11-26 09:57 /user/dasha/people_ages.csv

Step 2. Transform

In the second step, you should create a table in HBase and then use the built-in MapReduce job ImportTsv to transform the previously uploaded file into multiple HFiles according to the preconfigured table:

Run HBase shell:
```
$ hbase shell
```
Create a new table people_ages with one column family basic. Four split points will be used for generating five regions depending on the row keys values:
```
create 'people_ages', {NAME => 'basic', VERSIONS => 5}, {SPLITS => ['F', 'K', 'P', 'W']}
```
The output looks similar to this:
```
Created table people_ages
Took 1.6675 seconds
=> Hbase::Table - people_ages
```

Log out from HBase shell and run the MapReduce job ImportTsv:

$ hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=',' -Dimporttsv.columns=HBASE_ROW_KEY,basic:age -Dimporttsv.bulk.output=test_output people_ages /user/dasha/people_ages.csv

The arguments of this utility are listed in the following table.

Arguments
Dimporttsv.separator	A file separator. In our example, it is a comma
Dimporttsv.columns	A full list of columns contained in the imported text. In our example, using `HBASE_ROW_KEY,basic:age` means that the first column of the input text contains unique row keys, the second one — the values of the table column `basic:age`
Dimporttsv.bulk.output	A relative path to the HDFS directory where HFiles should be written to. In our example, using `test_output` means the following path: /user/dasha/test_output

The last positions in the command arguments indicate the name of the table (people_ages) and the path to the source file in HDFS (/user/dasha/people_ages.csv).

TIP

The input argument Dimporttsv.bulk.output is optional. If you do not use it, data is loaded to HBase according to the typical write sequence. So, you can also use the MapReduce job ImportTsv for simple reading from CSV files (without making the subsequent steps). But if you need to load bulk data directly to HFiles, this argument must be defined.

Check the contents of the test_output directory:

$ hdfs dfs -ls /user/dasha/test_output

It contains a folder named basic according to the column family name in our table:

Found 2 items
-rw-r--r--   3 dasha hadoop          0 2021-11-26 09:59 /user/dasha/test_output/_SUCCESS
drwxr-xr-x   - dasha hadoop          0 2021-11-26 09:59 /user/dasha/test_output/basic

Check the contents of the basic folder:

$ hdfs dfs -ls /user/dasha/test_output/basic

It contains five HFiles — one file per region. These files, generated by ImportTsv, include data from the source file sorted alphabetically and stored in the HFile format:

Found 5 items
-rw-r--r--   3 dasha hadoop      16383 2021-11-26 09:59 /user/dasha/test_output/basic/0c17ccdbda7a465281ee063fa806ee44
-rw-r--r--   3 dasha hadoop      15511 2021-11-26 09:59 /user/dasha/test_output/basic/13d31777206148dab3fbf3bf3c2c3ad0
-rw-r--r--   3 dasha hadoop      14111 2021-11-26 09:59 /user/dasha/test_output/basic/5018aca2586840b5a983eff7a9b58e3f
-rw-r--r--   3 dasha hadoop      17110 2021-11-26 09:59 /user/dasha/test_output/basic/c5f4176da2764378826652cb24b6cb05
-rw-r--r--   3 dasha hadoop       9313 2021-11-26 09:59 /user/dasha/test_output/basic/de11b176c73a4aa29133c0064c1e5d52

NOTE

In this step, data is not loaded to the table. If you run HBase shell and apply the scan command to the table, it will return the empty result.

Step 3. Load

In this step, it is necessary to upload the prepared HFiles into HBase using the built-in LoadIncrementalHFiles utility:

In order to give HBase the permission to move the prepared files, change their owner to hbase:

$ sudo -u hdfs hdfs dfs -chown -R hbase:hbase /user/dasha/test_output

Make sure the owner of the test_output folder and all files in it is changed:

$ hdfs dfs -ls /user/dasha/

The output is similar to the following:

Found 4 items
drwx------   - dasha hadoop          0 2021-11-26 09:59 /user/dasha/.staging
drwxrwxrwx   - dasha hadoop          0 2021-11-26 09:59 /user/dasha/hbase-staging
-rw-r--r--   3 dasha hadoop      17651 2021-11-26 09:57 /user/dasha/people_ages.csv
drwxr-xr-x   - hbase hbase           0 2021-11-26 09:59 /user/dasha/test_output

The owner of the folder is changed as expected. Let’s take a look at the files in this folder:

$ hdfs dfs -ls /user/dasha/test_output/basic

The output is similar to the following:

Found 5 items
-rw-r--r--   3 hbase hbase      16383 2021-11-26 09:59 /user/dasha/test_output/basic/0c17ccdbda7a465281ee063fa806ee44
-rw-r--r--   3 hbase hbase      15511 2021-11-26 09:59 /user/dasha/test_output/basic/13d31777206148dab3fbf3bf3c2c3ad0
-rw-r--r--   3 hbase hbase      14111 2021-11-26 09:59 /user/dasha/test_output/basic/5018aca2586840b5a983eff7a9b58e3f
-rw-r--r--   3 hbase hbase      17110 2021-11-26 09:59 /user/dasha/test_output/basic/c5f4176da2764378826652cb24b6cb05
-rw-r--r--   3 hbase hbase       9313 2021-11-26 09:59 /user/dasha/test_output/basic/de11b176c73a4aa29133c0064c1e5d52

Run the LoadIncrementalHFiles utility under the hbase user. Use the path to the previously created folder (test_output) and the table name (people_ages) as input parameters:
```
$ sudo -u hbase hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles /user/dasha/test_output people_ages
```

Run HBase shell and get the number of rows in the people_ages table:

count "people_ages"

As you can see, the table contains only 997 rows, not 1000, as in the uploaded file. The reason for this is the presence of three duplicates in the file, which we discussed earlier.

997 row(s)
Took 0.3057 seconds
=> 997

CAUTION

Be careful with selection of the source file field for the table row keys. This field must contain unique values. Duplicates are not uploaded, even if the column family supports several value versions.

Scan the people_ages table to check its content:

scan 'people_ages'

The first ten rows are listed below:

ROW                          COLUMN+CELL
 Abbott Delia                column=basic:age, timestamp=1637918350575, value=62
 Abbott Howard               column=basic:age, timestamp=1637918350575, value=24
 Abbott Jack                 column=basic:age, timestamp=1637918350575, value=29
 Adams Clyde                 column=basic:age, timestamp=1637918350575, value=29
 Aguilar Myrtie              column=basic:age, timestamp=1637918350575, value=23
 Aguilar Terry               column=basic:age, timestamp=1637918350575, value=65
 Alexander Derrick           column=basic:age, timestamp=1637918350575, value=46
 Alexander Gregory           column=basic:age, timestamp=1637918350575, value=54
 Alexander Leon              column=basic:age, timestamp=1637918350575, value=42
 Allen Austin                column=basic:age, timestamp=1637918350575, value=34

Check distribution of the table data between five regions using the following command:

list_regions 'people_ages'

The output looks similar to this:

                                        SERVER_NAME |                                                   REGION_NAME |  START_KEY |    END_KEY |  SIZE |   REQ |   LOCALITY |
 -------------------------------------------------- | ------------------------------------------------------------- | ---------- | ---------- | ----- | ----- | ---------- |
 bds-adh-1.ru-central1.internal,16020,1637912629824 |  people_ages,,1637920713477.bf509f13aa0dd59a5f4e9bcfc6cfbfab. |            |          F |     0 |   953 |        1.0 |
 bds-adh-3.ru-central1.internal,16020,1637912629738 | people_ages,F,1637920713477.b6693f3894a0feddba80ff15c06e2fbf. |          F |          K |     0 |   873 |        1.0 |
 bds-adh-3.ru-central1.internal,16020,1637912629738 | people_ages,K,1637920713477.50dcc6b90c412dab1f830a1b963d21e1. |          K |          P |     0 |   781 |        1.0 |
 bds-adh-2.ru-central1.internal,16020,1637912628845 | people_ages,P,1637920713477.80f4f1f471c9129f00d8f73701159145. |          P |          W |     0 |  1019 |        1.0 |
 bds-adh-2.ru-central1.internal,16020,1637912628845 | people_ages,W,1637920713477.ef6dfb19328842ce5c4ce883378329b0. |          W |            |     0 |   369 |        1.0 |
 5 rows
Took 0.2080 seconds
hbase(main):012:0> 'people_ages'
=> "people_ages"

To check the presence of HFiles in HBase, log out from HBase shell and run the following command:

$ hdfs dfs -ls /hbase/data/default/people_ages

The output contains five folders with HFiles and two service folders:

Found 7 items
drwxr-xr-x   - hbase hbase          0 2021-11-26 09:58 /hbase/data/default/people_ages/.tabledesc
drwxr-xr-x   - hbase hbase          0 2021-11-26 09:58 /hbase/data/default/people_ages/.tmp
drwxr-xr-x   - hbase hbase          0 2021-11-26 09:58 /hbase/data/default/people_ages/50dcc6b90c412dab1f830a1b963d21e1
drwxr-xr-x   - hbase hbase          0 2021-11-26 09:58 /hbase/data/default/people_ages/80f4f1f471c9129f00d8f73701159145
drwxr-xr-x   - hbase hbase          0 2021-11-26 09:58 /hbase/data/default/people_ages/b6693f3894a0feddba80ff15c06e2fbf
drwxr-xr-x   - hbase hbase          0 2021-11-26 09:58 /hbase/data/default/people_ages/bf509f13aa0dd59a5f4e9bcfc6cfbfab
drwxr-xr-x   - hbase hbase          0 2021-11-26 09:58 /hbase/data/default/people_ages/ef6dfb19328842ce5c4ce883378329b0

All HBase data in HDFS is stored in the /hbase/data/ folder and sorted by namespaces and then by table names.

Found a mistake? Seleсt text and press Ctrl+Enter to report it