Bulk loading via built-in MapReduce jobs
This article describes the simplest way of HBase bulk loading based on the built-in ImportTsv and LoadIncrementalHFiles (also known as CompleteBulkLoad) utilities.
Step 1. Extract
In the first step, it is necessary to prepare sample data and upload it to HDFS. You can use the test file people_ages.csv. This file contains randomly generated names and ages of one thousand people. Notice, that there are three duplicates in the names: McGee Isabelle
, Reid Janie
, and Summers Blanche
occur twice. Let’s leave these names intact to demonstrate the behavior of the system in such a case.
Do the following:
-
Download the file to one of the servers of your HBase cluster. Make sure that the file is uploaded successfully using the following command:
$ ls -la ~
The result looks similar to this:
total 56 drwx------. 5 dasha dasha 251 Nov 26 07:46 . drwxr-xr-x. 3 root root 19 Aug 31 11:54 .. drwx------. 3 dasha dasha 17 Aug 31 15:06 .ansible -rw-------. 1 dasha dasha 14913 Nov 25 16:07 .bash_history -rw-r--r--. 1 dasha dasha 18 Apr 1 2020 .bash_logout -rw-r--r--. 1 dasha dasha 193 Apr 1 2020 .bash_profile -rw-r--r--. 1 dasha dasha 231 Apr 1 2020 .bashrc drwxrwxrwx. 2 dasha dasha 64 Nov 26 07:45 dasha -rw-rw-r--. 1 dasha dasha 17651 Nov 26 07:46 people_ages.csv drwx------. 2 dasha dasha 29 Sep 23 17:35 .ssh
-
Create a directory for your user in HDFS if it does not exist yet:
$ sudo -u hdfs hdfs dfs -mkdir /user/dasha $ hdfs dfs -ls /user
The output looks similar to the following:
Found 5 items drwxr-xr-x - hdfs hadoop 0 2021-11-26 09:56 /user/dasha drwx------ - hdfs hadoop 0 2021-08-31 16:15 /user/hdfs drwxr-xr-x - mapred hadoop 0 2021-08-31 16:22 /user/history drwxr-xr-x - mapred mapred 0 2021-08-31 16:21 /user/mapred drwxr-xr-x - yarn yarn 0 2021-09-01 06:57 /user/yarn
-
Expand access rights to your folder using the following command (see more information about file protection in Protect files in HDFS):
$ sudo -u hdfs hdfs dfs -chmod 777 /user/dasha $ hdfs dfs -ls /user
The output looks similar to the following:
Found 5 items drwxrwxrwx - hdfs hadoop 0 2021-11-26 09:57 /user/dasha drwx------ - hdfs hadoop 0 2021-08-31 16:15 /user/hdfs drwxr-xr-x - mapred hadoop 0 2021-08-31 16:22 /user/history drwxr-xr-x - mapred mapred 0 2021-08-31 16:21 /user/mapred drwxr-xr-x - yarn yarn 0 2021-09-01 06:57 /user/yarn
-
Copy the file from the local file system to HDFS:
$ hdfs dfs -copyFromLocal ~/people_ages.csv /user/dasha
-
Make sure that the file is located in HDFS:
$ hdfs dfs -ls /user/dasha
The result looks similar to this:
Found 1 items -rw-r--r-- 3 dasha hadoop 17651 2021-11-26 09:57 /user/dasha/people_ages.csv
Step 2. Transform
In the second step, you should create a table in HBase and then use the built-in MapReduce job ImportTsv to transform the previously uploaded file into multiple HFiles according to the preconfigured table:
-
Run HBase shell:
$ hbase shell
-
Create a new table
people_ages
with one column familybasic
. Four split points will be used for generating five regions depending on the row keys values:create 'people_ages', {NAME => 'basic', VERSIONS => 5}, {SPLITS => ['F', 'K', 'P', 'W']}
The output looks similar to this:
Created table people_ages Took 1.6675 seconds => Hbase::Table - people_ages
-
Log out from HBase shell and run the MapReduce job ImportTsv:
$ hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=',' -Dimporttsv.columns=HBASE_ROW_KEY,basic:age -Dimporttsv.bulk.output=test_output people_ages /user/dasha/people_ages.csv
The arguments of this utility are listed in the following table.
Arguments Dimporttsv.separator
A file separator. In our example, it is a comma
Dimporttsv.columns
A full list of columns contained in the imported text. In our example, using
HBASE_ROW_KEY,basic:age
means that the first column of the input text contains unique row keys, the second one — the values of the table columnbasic:age
Dimporttsv.bulk.output
A relative path to the HDFS directory where HFiles should be written to. In our example, using
test_output
means the following path: /user/dasha/test_outputThe last positions in the command arguments indicate the name of the table (
people_ages
) and the path to the source file in HDFS (/user/dasha/people_ages.csv
).TIPThe input argumentDimporttsv.bulk.output
is optional. If you do not use it, data is loaded to HBase according to the typical write sequence. So, you can also use the MapReduce job ImportTsv for simple reading from CSV files (without making the subsequent steps). But if you need to load bulk data directly to HFiles, this argument must be defined. -
Check the contents of the test_output directory:
$ hdfs dfs -ls /user/dasha/test_output
It contains a folder named basic according to the column family name in our table:
Found 2 items -rw-r--r-- 3 dasha hadoop 0 2021-11-26 09:59 /user/dasha/test_output/_SUCCESS drwxr-xr-x - dasha hadoop 0 2021-11-26 09:59 /user/dasha/test_output/basic
-
Check the contents of the basic folder:
$ hdfs dfs -ls /user/dasha/test_output/basic
It contains five HFiles — one file per region. These files, generated by ImportTsv, include data from the source file sorted alphabetically and stored in the HFile format:
Found 5 items -rw-r--r-- 3 dasha hadoop 16383 2021-11-26 09:59 /user/dasha/test_output/basic/0c17ccdbda7a465281ee063fa806ee44 -rw-r--r-- 3 dasha hadoop 15511 2021-11-26 09:59 /user/dasha/test_output/basic/13d31777206148dab3fbf3bf3c2c3ad0 -rw-r--r-- 3 dasha hadoop 14111 2021-11-26 09:59 /user/dasha/test_output/basic/5018aca2586840b5a983eff7a9b58e3f -rw-r--r-- 3 dasha hadoop 17110 2021-11-26 09:59 /user/dasha/test_output/basic/c5f4176da2764378826652cb24b6cb05 -rw-r--r-- 3 dasha hadoop 9313 2021-11-26 09:59 /user/dasha/test_output/basic/de11b176c73a4aa29133c0064c1e5d52
NOTEIn this step, data is not loaded to the table. If you run HBase shell and apply the
scan
command to the table, it will return the empty result.
Step 3. Load
In this step, it is necessary to upload the prepared HFiles into HBase using the built-in LoadIncrementalHFiles utility:
-
In order to give HBase the permission to move the prepared files, change their owner to
hbase
:$ sudo -u hdfs hdfs dfs -chown -R hbase:hbase /user/dasha/test_output
Make sure the owner of the test_output folder and all files in it is changed:
$ hdfs dfs -ls /user/dasha/
The output is similar to the following:
Found 4 items drwx------ - dasha hadoop 0 2021-11-26 09:59 /user/dasha/.staging drwxrwxrwx - dasha hadoop 0 2021-11-26 09:59 /user/dasha/hbase-staging -rw-r--r-- 3 dasha hadoop 17651 2021-11-26 09:57 /user/dasha/people_ages.csv drwxr-xr-x - hbase hbase 0 2021-11-26 09:59 /user/dasha/test_output
The owner of the folder is changed as expected. Let’s take a look at the files in this folder:
$ hdfs dfs -ls /user/dasha/test_output/basic
The output is similar to the following:
Found 5 items -rw-r--r-- 3 hbase hbase 16383 2021-11-26 09:59 /user/dasha/test_output/basic/0c17ccdbda7a465281ee063fa806ee44 -rw-r--r-- 3 hbase hbase 15511 2021-11-26 09:59 /user/dasha/test_output/basic/13d31777206148dab3fbf3bf3c2c3ad0 -rw-r--r-- 3 hbase hbase 14111 2021-11-26 09:59 /user/dasha/test_output/basic/5018aca2586840b5a983eff7a9b58e3f -rw-r--r-- 3 hbase hbase 17110 2021-11-26 09:59 /user/dasha/test_output/basic/c5f4176da2764378826652cb24b6cb05 -rw-r--r-- 3 hbase hbase 9313 2021-11-26 09:59 /user/dasha/test_output/basic/de11b176c73a4aa29133c0064c1e5d52
-
Run the LoadIncrementalHFiles utility under the
hbase
user. Use the path to the previously created folder (test_output
) and the table name (people_ages
) as input parameters:$ sudo -u hbase hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles /user/dasha/test_output people_ages
-
Run HBase shell and get the number of rows in the
people_ages
table:count "people_ages"
As you can see, the table contains only 997 rows, not 1000, as in the uploaded file. The reason for this is the presence of three duplicates in the file, which we discussed earlier.
997 row(s) Took 0.3057 seconds => 997
CAUTIONBe careful with selection of the source file field for the table row keys. This field must contain unique values. Duplicates are not uploaded, even if the column family supports several value versions.
-
Scan the
people_ages
table to check its content:scan 'people_ages'
The first ten rows are listed below:
ROW COLUMN+CELL Abbott Delia column=basic:age, timestamp=1637918350575, value=62 Abbott Howard column=basic:age, timestamp=1637918350575, value=24 Abbott Jack column=basic:age, timestamp=1637918350575, value=29 Adams Clyde column=basic:age, timestamp=1637918350575, value=29 Aguilar Myrtie column=basic:age, timestamp=1637918350575, value=23 Aguilar Terry column=basic:age, timestamp=1637918350575, value=65 Alexander Derrick column=basic:age, timestamp=1637918350575, value=46 Alexander Gregory column=basic:age, timestamp=1637918350575, value=54 Alexander Leon column=basic:age, timestamp=1637918350575, value=42 Allen Austin column=basic:age, timestamp=1637918350575, value=34
-
Check distribution of the table data between five regions using the following command:
list_regions 'people_ages'
The output looks similar to this:
SERVER_NAME | REGION_NAME | START_KEY | END_KEY | SIZE | REQ | LOCALITY | -------------------------------------------------- | ------------------------------------------------------------- | ---------- | ---------- | ----- | ----- | ---------- | bds-adh-1.ru-central1.internal,16020,1637912629824 | people_ages,,1637920713477.bf509f13aa0dd59a5f4e9bcfc6cfbfab. | | F | 0 | 953 | 1.0 | bds-adh-3.ru-central1.internal,16020,1637912629738 | people_ages,F,1637920713477.b6693f3894a0feddba80ff15c06e2fbf. | F | K | 0 | 873 | 1.0 | bds-adh-3.ru-central1.internal,16020,1637912629738 | people_ages,K,1637920713477.50dcc6b90c412dab1f830a1b963d21e1. | K | P | 0 | 781 | 1.0 | bds-adh-2.ru-central1.internal,16020,1637912628845 | people_ages,P,1637920713477.80f4f1f471c9129f00d8f73701159145. | P | W | 0 | 1019 | 1.0 | bds-adh-2.ru-central1.internal,16020,1637912628845 | people_ages,W,1637920713477.ef6dfb19328842ce5c4ce883378329b0. | W | | 0 | 369 | 1.0 | 5 rows Took 0.2080 seconds hbase(main):012:0> 'people_ages' => "people_ages"
-
To check the presence of HFiles in HBase, log out from HBase shell and run the following command:
$ hdfs dfs -ls /hbase/data/default/people_ages
The output contains five folders with HFiles and two service folders:
Found 7 items drwxr-xr-x - hbase hbase 0 2021-11-26 09:58 /hbase/data/default/people_ages/.tabledesc drwxr-xr-x - hbase hbase 0 2021-11-26 09:58 /hbase/data/default/people_ages/.tmp drwxr-xr-x - hbase hbase 0 2021-11-26 09:58 /hbase/data/default/people_ages/50dcc6b90c412dab1f830a1b963d21e1 drwxr-xr-x - hbase hbase 0 2021-11-26 09:58 /hbase/data/default/people_ages/80f4f1f471c9129f00d8f73701159145 drwxr-xr-x - hbase hbase 0 2021-11-26 09:58 /hbase/data/default/people_ages/b6693f3894a0feddba80ff15c06e2fbf drwxr-xr-x - hbase hbase 0 2021-11-26 09:58 /hbase/data/default/people_ages/bf509f13aa0dd59a5f4e9bcfc6cfbfab drwxr-xr-x - hbase hbase 0 2021-11-26 09:58 /hbase/data/default/people_ages/ef6dfb19328842ce5c4ce883378329b0
All HBase data in HDFS is stored in the /hbase/data/ folder and sorted by namespaces and then by table names.