NameNode recovery process in HDFS
Before starting recovery
CAUTION
Manual NameNode recovery can cause data loss. Do not attempt it if there’s another valid copy of the edit log and fsimage. For example, on the other NameNode. It is preferable to use that copy rather than trying to recover the corrupted copy. If you decide to recover NameNode manually, save your metadata before trying. |
In Hadoop, NameNode recovery is a process of attempting to recover lost or corrupted metadata when there is no other way to restore it. For example, in clusters with no StandBy NameNode and Checkpointing Nodes.
For High Availability clusters, manual NameNode recovery is almost never necessary. The Journal Nodes and the StandBy NameNode are usually enough to insure the metadata integrity.
To avoid metadata corruption, you can:
-
Keep multiple copies of NameNode metadata on different disks.
-
Configure HA mode or use checkpointing.
Recovery process
To manually recover after a NameNode failure:
-
Connect to the NameNode host via SSH.
-
As
root
, create the log directory for running the recovery command and give thehdfs
user necessary permissions:mkdir /usr/lib/hadoop/logs chown -R hdfs:hadoop /usr/lib/hadoop/logs chmod -R 755 /usr/lib/hadoop/logs
-
Login as the
hdfs
user and turn on the NameNode safemode:su hdfs $ hdfs dfsadmin -safemode enter
-
As
root
, turn off the NameNode:systemctl stop hadoop-hdfs-namenode
-
As the
hdfs
user, run the recovery command (add the-force
option to skip choice):$ hdfs namenode -recover
The following message appears:
You have selected Metadata Recovery mode. This mode is intended to recover lost metadata on a corrupt filesystem. Metadata recovery mode often permanently deletes data from your HDFS filesystem. Please back up your edit log and fsimage before trying this! Are you ready to proceed? (Y/N) (Y or N)
-
Press
Y
to start the recovery process. The system prints the status of the recovery process in the console and prompts for action if necessary. Once the recovery is complete, the system will print following:2023-12-06 11:27:00,283 INFO namenode.MetaRecoveryContext: RECOVERY COMPLETE 2023-12-06 11:27:00,318 INFO namenode.FSNamesystem: Stopping services started for active state 2023-12-06 11:27:00,421 INFO namenode.FSNamesystem: Stopping services started for standby state 2023-12-06 11:27:00,423 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at elenas-adh2.ru-central1.internal/127.0.0.1
-
As
root
, start the NameNode:systemctl start hadoop-hdfs-namenode
-
As the
hdfs
user, turn off the safemode:
$ hdfs dfsadmin -safemode leave
Recovery actions
During the recovery, the system might require user intervention to choose between several possible actions. If an error is encountered, the system describes it and suggests to choose one of four actions:
-
c
(Continue) — ignore the error and attempt to save the rest of the data. -
s
(Stop) — stop reading the edit log and forfeit attempting to save the rest of the data. In this case, the edits that have not been read will be permanently lost. -
q
(Quit) — quit recovery without saving. -
a
(Always) — always select optionc
. The system will automatically choose to continue without prompting a user for action.
When the -force
option is used, the system always chooses the first action, which is c
.