Bulk loading in HBase
The typical write sequence in HBase includes the following steps: WAL → MemStore → HFiles. It works well in the normal mode, but when loading large amounts of data at the same time, this flow can lead to significant performance losses. Some of the possible problems are listed below:
-
Increasing the volume requirements for MemStore and WAL.
-
Expansion of compaction and flush queues.
-
As a consequence, high latency, going out of SLA.
Bulk loading is designed to avoid these problems. It represents the process of preparing and loading data directly into HFiles in HDFS, bypassing the standard write path. This way allows you to insert the same amount of data more quickly. It can be useful for initial loading of datasets to HBase from external sources or for incremental insertion of large data batches, made, for example, during long night auto tasks.
Bulk loading is an ETL process that includes three main steps:
-
Extract. At this step, the process extracts data from the external source and puts it to HDFS. The input data can come from dumps of other databases or log files and presented in the format of plain text, CSV, or other. HBase does not manage this part of the process.
-
Transform. At this step, the source data is transformed into HFiles according to the structure of the predefined HBase table: one HFile is created per one region. The data is sorted alphabetically and distributed between files, according to the specified split points. This step usually requires a MapReduce job that is responsible for dividing data into the right number of partitions, sorting, generating a key/value class, matching HFile format requirements, and other.
-
Load. At this step, the HFiles generated earlier are moved into HBase. Each file is loaded into the relevant region through the Region Server that serves it.
There are many ways to configure bulk loading in HBase including built-in MapReduce jobs, custom MapReduce jobs, Spark jobs, and other.