HBase replication
Overview
Replication in HBase is a mechanism that allows you to copy the contents of a table (or its separate column families) to another ADH cluster. This may serve the purposes of backing up and recovering data, geographic data distribution, data aggregation, and others.
The cluster that contains the original data is the source cluster, and the cluster that receives the data is the destination cluster. Any cluster can be source, destination, or both. A source cluster can send data to any number of destination clusters. The replication process is incremental: as soon as any changes are made in the source data, they are transmitted to the destination. If the destination cluster is unavailable for some reason, the changes made since the last successful transaction are stored on the source side and await the destination cluster to be available back again. For the replication to work, the names of the source and destination tables and their respective column families must be the same.
NOTE
Detailed information on the HBase shell commands used for the replication process setup is provided in the HBase shell commands → Replication commands section.
|
Replication setup
For the replication of a table you can have any number of clusters for the one to be the source cluster and the others to be destination clusters. When your clusters are configured and running, do the following:
-
Go to the ADCM UI and select your source ADH cluster.
-
Navigate to Services → HBase → Primary configuration and toggle Show advanced.
-
Open the Custom hbase-site.xml section and click Add property.
-
Enter
hbase.replication
for the property name and set its value totrue
. -
Save the configuration by clicking Save → Create and restart the service by clicking Actions → Reconfig and graceful restart.
Now you can set up the replication in various scenarios.
From one cluster to another
Test table used to illustrate the replication setup process is obtained by importing the people.csv file. Import procedure is provided in the Bulk loading via built-in MapReduce jobs article.
To set up the table replication from the source cluster to the destination cluster, do the following:
-
Log in to the HBase shells on both clusters.
-
On the destination cluster, create the table with the same structure and properties as the one being replicated:
create 'people', {NAME => 'basic', VERSIONS => 5}, {SPLITS => ['F', 'K', 'P', 'W']}, {NAME => 'location', VERSIONS => 5}
You can use the
describe
command in the source cluster HBase shell to obtain the properties of the table. -
On the source cluster, create a peer record using the add_peer command:
add_peer '1', CLUSTER_KEY => 'av-adh-backup-1.ru-central1.internal,av-adh-backup-2.ru-central1.internal,av-adh-backup-3.ru-central1.internal:2181:/hbase'
-
Disable the table using the disable command:
disable 'people'
-
Set the column families up for replication using the alter command:
alter 'people', {NAME => 'basic', REPLICATION_SCOPE => '1'}, {NAME => 'location', REPLICATION_SCOPE => '1'}
-
Enable the table back again using the enable command:
enable 'people'
This should be enough for the replication process to start. You can make some changes in the table (e.g. add a value) and then check the replication process status using the status command:
status 'replication'
The output will look like this:
version 2.5.10 3 live servers av-adh-1.ru-central1.internal: SOURCE: PeerID=1 Normal Queue: 1 No Ops shipped since last restart, SizeOfLogQueue=1, EditsReadFromLogQueue=0, OpsShippedToTarget=0, No edits for this source since it started, Replication Lag=0 SINK: TimeStampStarted=1729231941314, Waiting for OPs... av-adh-2.ru-central1.internal: SOURCE: PeerID=1 Normal Queue: 1 No Ops shipped since last restart, SizeOfLogQueue=1, EditsReadFromLogQueue=0, OpsShippedToTarget=0, No edits for this source since it started, Replication Lag=0 SINK: TimeStampStarted=1729231977710, Waiting for OPs... av-adh-3.ru-central1.internal: SOURCE: PeerID=1 Normal Queue: 1 AgeOfLastShippedOp=799, TimeStampOfLastShippedOp=Fri Oct 18 12:27:53 UTC 2024, SizeOfLogQueue=1, EditsReadFromLogQueue=1, OpsShippedToTarget=1, TimeStampOfNextToReplicate=Fri Oct 18 12:27:53 UTC 2024, Replication Lag=0
The OpsShippedToTarget
property indicates how many change operations have been transmitted to the destination cluster table. Nonzero value on at least one server means the replication is happening.
If the source cluster table was non-empty before the replication setup, you may need to copy its contents to the destination cluster. To do so, connect to an HBase source cluster node via SSH and execute the commands of the following kind:
$ cd /bin
./hbase org.apache.hadoop.hbase.mapreduce.CopyTable --peer.adr=av-adh-backup-1.ru-central1.internal,av-adh-backup-2.ru-central1.internal,av-adh-backup-3.ru-central1.internal:2181:/hbase people
where:
-
--peer.adr
— the cluster key you previously used composing theadd_peer
command; -
people
— the name of the table being copied.
From one cluster to several others
To set up the replication to several destination clusters, do the same as for one cluster, but when adding peers, make a separate peer record for each destination cluster, paying attention to different IDs.
Example:
add_peer '1', CLUSTER_KEY => 'av-adh-backup1-1.ru-central1.internal,av-adh-backup1-2.ru-central1.internal,av-adh-backup1-3.ru-central1.internal:2181:/hbase'
add_peer '2', CLUSTER_KEY => 'av-adh-backup2-1.ru-central1.internal,av-adh-backup2-2.ru-central1.internal,av-adh-backup2-3.ru-central1.internal:2181:/hbase'
add_peer '3', CLUSTER_KEY => 'av-adh-backup3-1.ru-central1.internal,av-adh-backup3-2.ru-central1.internal,av-adh-backup3-3.ru-central1.internal:2181:/hbase'