Data backup and restore in HBase

Vladimir Adamenkov

Contents

Overview
- Key concepts
- Architectural deployment strategies
Configuration
Limitations
Basic commands
Examples
- HBase
- Phoenix

Overview

The backup functionality for HBase has been introduced in ADH version 4.1.0 (HBase 2.6).

This functionality is designed to create "cold" copies of data, which is a critical element of the business continuity and disaster recovery strategy. Unlike replication, which creates "hot" copies, backups are stored in external storage and require manual recovery procedures, providing protection against catastrophic failures.

Key concepts

The backup functionality in HBase uses the following terms and concepts:

Full backup — contains the full state of the table at the time of creation. It is the basis for subsequent incremental backups.
Incremental backup — сontains only changes from WAL relative to the last full or incremental backup. It is effective in terms of volume and allows you to restore data at a specific point in time.
Backup set — a named group of tables to simplify the management of backup or recovery operations. It is recommended to save backups for each set in a dedicated path.
Backup ID — a unique identifier (based on Unix epoch time) assigned to each backup session.
Increment merge — the ability to combine several successful backups into one.

Architectural deployment strategies

The following architectural deployment strategies can be used for backing up data in HBase:

Inside the cluster — backups are stored on the same HDFS cluster. It is only suitable for testing and does not provide fault tolerance.
Dedicated cluster — backups are copied to a separate HDFS cluster, often with a cheaper hardware configuration. Recommended for production environments, it provides geographical separation.
Cloud Storage — backups are saved to the cloud (for example, S3) or an HDFS-compatible storage system (via S3A, WebHDFS). The most fault-tolerant option.

Configuration

To configre the backup feature in HBase, do the following:

Go to the ADCM UI and select your ADH cluster.
Navigate to Services → HBase → Primary configuration and toggle Show advanced.
Open the hbase-site.xml section and set the hbase.backup.enable parameter to true.
Review the following parameters and make sure they have the corresponding values:
- hbase.master.logcleaner.plugins — org.apache.hadoop.hbase.backup.master.BackupLogCleaner. You can also add other backup procedure manager classes to this parameter value, separating them by commas. You can do the same for all the following parameters in this list.
- hbase.procedure.master.classes — org.apache.hadoop.hbase.backup.master.LogRollMasterProcedureManager.
- hbase.procedure.regionserver.classes — org.apache.hadoop.hbase.backup.regionserver.LogRollRegionServerProcedureManager.
- hbase.coprocessor.region.classes — org.apache.hadoop.hbase.backup.BackupObserver.
- hbase.coprocessor.master.classes — org.apache.hadoop.hbase.backup.BackupMasterObserver.
- hbase.master.hfilecleaner.plugins — org.apache.hadoop.hbase.backup.BackupHFileCleaner.
Save the configuration by clicking Save → Create and restart the service by clicking Actions → Reconfig and graceful restart.

Limitations

The backup functionality in HBase has the following limitations:

Sequential execution. You cannot run multiple backup or restore operations at the same time.
No cancellation. Running backup or recovery operations cannot be canceled.
Superuser rights. All operations are performed on behalf of the HBase superuser.
Restoring to online cluster only. The target cluster for recovery must be up and running.
One destination. Backup can only be saved to one location. Copying requires manual actions.
WAL volume growth. Incremental backups prevent the deletion of WAL files before the next backup, which can lead to an increase in the amount of data in HDFS. It is necessary to closely monitor the schedule and storage.
Transparent data encryption (TDE). The functionality has not been tested on clusters with data encryption enabled.
Restoring in another cluster. When you copy a backup to another cluster, it will be possible to restore, but after that you will need a full backup to create a history.

Basic commands

HBase backup functionality exploits the following basic commands:

Create a set of tables for granular backup configuration:

$ hbase backup set add <set_name1> <table1>,<table2>
$ hbase backup set add <set_name2> <table3>

Create full and incremental backups for different sets:

$ hbase backup create full hdfs://<nn>:8020/tmp/backups-path1/ -w 3 -s <set_name1> -d
$ hbase backup create full hdfs://<nn>:8020/tmp/backups-path2/ -w 3 -s <set_name2> -d
$ hbase backup create incremental hdfs://<nn>:8020/tmp/backups-path1/ -w 3 -s <set_name1> -d
$ hbase backup create incremental hdfs://<nn>:8020/tmp/backups-path2/ -w 3 -s <set_name2> -d

View backup history:
```
$ hbase backup history
```
Merge backups:
```
$ hbase backup merge <backup_increment_id1>,<backup_increment_id2>,<backup_increment_id3>
```
IDs can be found in the history or in the root directory of backups on the filesystem.

Restore:

$ hbase restore hdfs://<nn>:8020/tmp/backups-path2/ <backup_id> -o -s <set_name2> -d

Examples

HBase

This example demonstrates the backup functionality with pure HBase, without Phoenix.

Create the original table:

$ hbase shell

create 'user_activity', 'cf1', 'cf2'
put 'user_activity', 'user1', 'cf1:name', 'Alice'
put 'user_activity', 'user1', 'cf1:email', 'alice@email.com'
put 'user_activity', 'user1', 'cf2:last_login', '2024-01-15'
put 'user_activity', 'user1', 'cf2:login_count', '5'
put 'user_activity', 'user2', 'cf1:name', 'Bob'
put 'user_activity', 'user2', 'cf1:email', 'bob@email.com'
put 'user_activity', 'user2', 'cf2:last_login', '2024-01-14'
put 'user_activity', 'user2', 'cf2:login_count', '3'
put 'user_activity', 'user3', 'cf1:name', 'Charlie'
put 'user_activity', 'user3', 'cf1:email', 'charlie@email.com'
put 'user_activity', 'user3', 'cf2:last_login', '2024-01-13'
put 'user_activity', 'user3', 'cf2:login_count', '7'

Exit hbase shell and create a set of backups:

exit

$ hbase backup set add user_backup_set user_activity

Create a full backup:

$ hbase backup create full hdfs://tmp/hbase-backup -s user_backup_set -w 3

Add new data:

$ hbase shell

put 'user_activity', 'user4', 'cf1:name', 'Diana'
put 'user_activity', 'user4', 'cf1:email', 'diana@email.com'
put 'user_activity', 'user4', 'cf2:last_login', '2024-01-16'
put 'user_activity', 'user4', 'login_count', '2'
put 'user_activity', 'user1', 'cf2:login_count', '6'
put 'user_activity', 'user2', 'cf2:last_login', '2024-01-16'

Exit hbase shell and make an incremental backup:

exit

$ hbase backup create incremental hdfs://tmp/hbase-backup -s user_backup_set -w 3

Add more data and create a second incremental backup:

$ hbase shell

put 'user_activity', 'user5', 'cf1:name', 'Eve'
put 'user_activity', 'user5', 'cf1:email', 'eve@email.com'
deleteall 'user_activity', 'user3'
put 'user_activity', 'user1', 'cf2:login_count', '7'
exit

$ hbase backup create incremental hdfs://tmp/hbase-backup -s user_backup_set -w 3

Check history and description:

$ hbase backup history -s user_backup_set
$ hbase backup describe <backup_id>

Simulate data loss:

$ hbase shell

disable 'user_activity'
drop 'user_activity'

Exit hbase shell and restore the data:
```
exit
```
```
$ hbase restore hdfs://tmp/hbase-backup <backup_id> -s user_backup_set
```
You can restore the data from any point in time when one of the previous backups was made.

Phoenix

This example demonstrates the backup functionality with Phoenix.

Run the /usr/lib/phoenix/bin/sqlline.py script and create the original table:

CREATE TABLE USER_ACTIVITY (
    USER_ID VARCHAR PRIMARY KEY,
    NAME VARCHAR,
    EMAIL VARCHAR,
    LAST_LOGIN DATE,
    LOGIN_COUNT INTEGER
) COMPRESSION='SNAPPY', SALT_BUCKETS=4;

Load values into the table:

UPSERT INTO USER_ACTIVITY VALUES ('user1', 'Alice', 'alice@email.com', TO_DATE('2024-01-15'), 5);
UPSERT INTO USER_ACTIVITY VALUES ('user2', 'Bob', 'bob@email.com', TO_DATE('2024-01-14'), 3);
UPSERT INTO USER_ACTIVITY VALUES ('user3', 'Charlie', 'charlie@email.com', TO_DATE('2024-01-13'), 7);

Press Ctrl+D and create a set of backups. Additional system tables are required for Phoenix tables. You may need more of them if, for example, indexes are used. For this reason, it is better to manage backups for Phoenix tables in one set:
```
$ hbase backup set add phoenix_backup_set USER_ACTIVITY,SYSTEM.CATALOG,SYSTEM.SEQUENCE,SYSTEM.STATS
```

Create a full backup:

$ hbase backup create full hdfs://tmp/hbase-backup -s phoenix_backup_set -w 3

Run the /usr/lib/phoenix/bin/sqlline.py script again and add new data:

UPSERT INTO USER_ACTIVITY VALUES ('user4', 'Alice', 'alice@email.com', TO_DATE('2024-01-15'), 5);
UPSERT INTO USER_ACTIVITY VALUES ('user5', 'Bob', 'bob@email.com', TO_DATE('2024-01-14'), 3);
UPSERT INTO USER_ACTIVITY VALUES ('user6', 'Charlie', 'charlie@email.com', TO_DATE('2024-01-13'), 7);

Press Ctrl+D and make an incremental backup:

$ hbase backup create incremental hdfs://tmp/hbase-backup -s phoenix_backup_set -w 3

Run the /usr/lib/phoenix/bin/sqlline.py script again and add more new data:

UPSERT INTO USER_ACTIVITY VALUES ('user7', 'Alice', 'alice@email.com', TO_DATE('2024-01-15'), 5);
UPSERT INTO USER_ACTIVITY VALUES ('user8', 'Bob', 'bob@email.com', TO_DATE('2024-01-14'), 3);
UPSERT INTO USER_ACTIVITY VALUES ('user9', 'Charlie', 'charlie@email.com', TO_DATE('2024-01-13'), 7);

Press Ctrl+D and make a second incremental backup:

$ hbase backup create incremental hdfs://tmp/hbase-backup -s phoenix_backup_set -w 3

Check history:

$ hbase backup history -s user_backup_set
$ hbase backup describe <backup_id>

Run the /usr/lib/phoenix/bin/sqlline.py script again and simulate data loss:
```
DROP TABLE USER_ACTIVITY;
```
Press Ctrl+D and recover the data:
```
$ hbase restore hdfs://tmp/hbase-backup <backup_id> -s phoenix_backup_set
```
You can restore the data from any point in time when one of the previous backups was made.

Found a mistake? Seleсt text and press Ctrl+Enter to report it