Data backup and restore in HBase
Overview
The backup functionality for HBase has been introduced in ADH version 4.1.0 (HBase 2.6).
This functionality is designed to create "cold" copies of data, which is a critical element of the business continuity and disaster recovery strategy. Unlike replication, which creates "hot" copies, backups are stored in external storage and require manual recovery procedures, providing protection against catastrophic failures.
Key concepts
The backup functionality in HBase uses the following terms and concepts:
- 
Full backup — contains the full state of the table at the time of creation. It is the basis for subsequent incremental backups. 
- 
Incremental backup — сontains only changes from WAL relative to the last full or incremental backup. It is effective in terms of volume and allows you to restore data at a specific point in time. 
- 
Backup set — a named group of tables to simplify the management of backup or recovery operations. It is recommended to save backups for each set in a dedicated path. 
- 
Backup ID — a unique identifier (based on Unix epoch time) assigned to each backup session. 
- 
Increment merge — the ability to combine several successful backups into one. 
Architectural deployment strategies
The following architectural deployment strategies can be used for backing up data in HBase:
- 
Inside the cluster — backups are stored on the same HDFS cluster. It is only suitable for testing and does not provide fault tolerance. 
- 
Dedicated cluster — backups are copied to a separate HDFS cluster, often with a cheaper hardware configuration. Recommended for production environments, it provides geographical separation. 
- 
Cloud Storage — backups are saved to the cloud (for example, S3) or an HDFS-compatible storage system (via S3A, WebHDFS). The most fault-tolerant option. 
Configuration
To configre the backup feature in HBase, do the following:
- 
Go to the ADCM UI and select your ADH cluster. 
- 
Navigate to Services → HBase → Primary configuration and toggle Show advanced. 
- 
Open the hbase-site.xml section and set the hbase.backup.enable parameter to true.
- 
Review the following parameters and make sure they have the corresponding values: - 
hbase.master.logcleaner.plugins — org.apache.hadoop.hbase.backup.master.BackupLogCleaner. You can also add other backup procedure manager classes to this parameter value, separating them by commas. You can do the same for all the following parameters in this list.
- 
hbase.procedure.master.classes — org.apache.hadoop.hbase.backup.master.LogRollMasterProcedureManager.
- 
hbase.procedure.regionserver.classes — org.apache.hadoop.hbase.backup.regionserver.LogRollRegionServerProcedureManager.
- 
hbase.coprocessor.region.classes — org.apache.hadoop.hbase.backup.BackupObserver.
- 
hbase.coprocessor.master.classes — org.apache.hadoop.hbase.backup.BackupMasterObserver.
- 
hbase.master.hfilecleaner.plugins — org.apache.hadoop.hbase.backup.BackupHFileCleaner.
 
- 
- 
Save the configuration by clicking Save → Create and restart the service by clicking Actions → Reconfig and graceful restart. 
Limitations
The backup functionality in HBase has the following limitations:
- 
Sequential execution. You cannot run multiple backup or restore operations at the same time. 
- 
No cancellation. Running backup or recovery operations cannot be canceled. 
- 
Superuser rights. All operations are performed on behalf of the HBase superuser. 
- 
Restoring to online cluster only. The target cluster for recovery must be up and running. 
- 
One destination. Backup can only be saved to one location. Copying requires manual actions. 
- 
WAL volume growth. Incremental backups prevent the deletion of WAL files before the next backup, which can lead to an increase in the amount of data in HDFS. It is necessary to closely monitor the schedule and storage. 
- 
Transparent data encryption (TDE). The functionality has not been tested on clusters with data encryption enabled. 
- 
Restoring in another cluster. When you copy a backup to another cluster, it will be possible to restore, but after that you will need a full backup to create a history. 
Basic commands
HBase backup functionality exploits the following basic commands:
- 
Create a set of tables for granular backup configuration: $ hbase backup set add <set_name1> <table1>,<table2> $ hbase backup set add <set_name2> <table3>
- 
Create full and incremental backups for different sets: $ hbase backup create full hdfs://<nn>:8020/tmp/backups-path1/ -w 3 -s <set_name1> -d $ hbase backup create full hdfs://<nn>:8020/tmp/backups-path2/ -w 3 -s <set_name2> -d $ hbase backup create incremental hdfs://<nn>:8020/tmp/backups-path1/ -w 3 -s <set_name1> -d $ hbase backup create incremental hdfs://<nn>:8020/tmp/backups-path2/ -w 3 -s <set_name2> -d
- 
View backup history: $ hbase backup history
- 
Merge backups: $ hbase backup merge <backup_increment_id1>,<backup_increment_id2>,<backup_increment_id3>IDs can be found in the history or in the root directory of backups on the filesystem. 
- 
Restore: $ hbase restore hdfs://<nn>:8020/tmp/backups-path2/ <backup_id> -o -s <set_name2> -d
Examples
HBase
This example demonstrates the backup functionality with pure HBase, without Phoenix.
- 
Create the original table: $ hbase shellcreate 'user_activity', 'cf1', 'cf2' put 'user_activity', 'user1', 'cf1:name', 'Alice' put 'user_activity', 'user1', 'cf1:email', 'alice@email.com' put 'user_activity', 'user1', 'cf2:last_login', '2024-01-15' put 'user_activity', 'user1', 'cf2:login_count', '5' put 'user_activity', 'user2', 'cf1:name', 'Bob' put 'user_activity', 'user2', 'cf1:email', 'bob@email.com' put 'user_activity', 'user2', 'cf2:last_login', '2024-01-14' put 'user_activity', 'user2', 'cf2:login_count', '3' put 'user_activity', 'user3', 'cf1:name', 'Charlie' put 'user_activity', 'user3', 'cf1:email', 'charlie@email.com' put 'user_activity', 'user3', 'cf2:last_login', '2024-01-13' put 'user_activity', 'user3', 'cf2:login_count', '7'
- 
Exit hbaseshell and create a set of backups:exit$ hbase backup set add user_backup_set user_activity
- 
Create a full backup: $ hbase backup create full hdfs://tmp/hbase-backup -s user_backup_set -w 3
- 
Add new data: $ hbase shellput 'user_activity', 'user4', 'cf1:name', 'Diana' put 'user_activity', 'user4', 'cf1:email', 'diana@email.com' put 'user_activity', 'user4', 'cf2:last_login', '2024-01-16' put 'user_activity', 'user4', 'login_count', '2' put 'user_activity', 'user1', 'cf2:login_count', '6' put 'user_activity', 'user2', 'cf2:last_login', '2024-01-16'
- 
Exit hbaseshell and make an incremental backup:exit$ hbase backup create incremental hdfs://tmp/hbase-backup -s user_backup_set -w 3
- 
Add more data and create a second incremental backup: $ hbase shellput 'user_activity', 'user5', 'cf1:name', 'Eve' put 'user_activity', 'user5', 'cf1:email', 'eve@email.com' deleteall 'user_activity', 'user3' put 'user_activity', 'user1', 'cf2:login_count', '7' exit$ hbase backup create incremental hdfs://tmp/hbase-backup -s user_backup_set -w 3
- 
Check history and description: $ hbase backup history -s user_backup_set $ hbase backup describe <backup_id>
- 
Simulate data loss: $ hbase shelldisable 'user_activity' drop 'user_activity'
- 
Exit hbaseshell and restore the data:exit$ hbase restore hdfs://tmp/hbase-backup <backup_id> -s user_backup_setYou can restore the data from any point in time when one of the previous backups was made. 
Phoenix
This example demonstrates the backup functionality with Phoenix.
- 
Run the /usr/lib/phoenix/bin/sqlline.py script and create the original table: CREATE TABLE USER_ACTIVITY ( USER_ID VARCHAR PRIMARY KEY, NAME VARCHAR, EMAIL VARCHAR, LAST_LOGIN DATE, LOGIN_COUNT INTEGER ) COMPRESSION='SNAPPY', SALT_BUCKETS=4;
- 
Load values into the table: UPSERT INTO USER_ACTIVITY VALUES ('user1', 'Alice', 'alice@email.com', TO_DATE('2024-01-15'), 5); UPSERT INTO USER_ACTIVITY VALUES ('user2', 'Bob', 'bob@email.com', TO_DATE('2024-01-14'), 3); UPSERT INTO USER_ACTIVITY VALUES ('user3', 'Charlie', 'charlie@email.com', TO_DATE('2024-01-13'), 7);
- 
Press Ctrl+Dand create a set of backups. Additional system tables are required for Phoenix tables. You may need more of them if, for example, indexes are used. For this reason, it is better to manage backups for Phoenix tables in one set:$ hbase backup set add phoenix_backup_set USER_ACTIVITY,SYSTEM.CATALOG,SYSTEM.SEQUENCE,SYSTEM.STATS
- 
Create a full backup: $ hbase backup create full hdfs://tmp/hbase-backup -s phoenix_backup_set -w 3
- 
Run the /usr/lib/phoenix/bin/sqlline.py script again and add new data: UPSERT INTO USER_ACTIVITY VALUES ('user4', 'Alice', 'alice@email.com', TO_DATE('2024-01-15'), 5); UPSERT INTO USER_ACTIVITY VALUES ('user5', 'Bob', 'bob@email.com', TO_DATE('2024-01-14'), 3); UPSERT INTO USER_ACTIVITY VALUES ('user6', 'Charlie', 'charlie@email.com', TO_DATE('2024-01-13'), 7);
- 
Press Ctrl+Dand make an incremental backup:$ hbase backup create incremental hdfs://tmp/hbase-backup -s phoenix_backup_set -w 3
- 
Run the /usr/lib/phoenix/bin/sqlline.py script again and add more new data: UPSERT INTO USER_ACTIVITY VALUES ('user7', 'Alice', 'alice@email.com', TO_DATE('2024-01-15'), 5); UPSERT INTO USER_ACTIVITY VALUES ('user8', 'Bob', 'bob@email.com', TO_DATE('2024-01-14'), 3); UPSERT INTO USER_ACTIVITY VALUES ('user9', 'Charlie', 'charlie@email.com', TO_DATE('2024-01-13'), 7);
- 
Press Ctrl+Dand make a second incremental backup:$ hbase backup create incremental hdfs://tmp/hbase-backup -s phoenix_backup_set -w 3
- 
Check history: $ hbase backup history -s user_backup_set $ hbase backup describe <backup_id>
- 
Run the /usr/lib/phoenix/bin/sqlline.py script again and simulate data loss: DROP TABLE USER_ACTIVITY;
- 
Press Ctrl+Dand recover the data:$ hbase restore hdfs://tmp/hbase-backup <backup_id> -s phoenix_backup_setYou can restore the data from any point in time when one of the previous backups was made.