Snapshots in HDFS

Overview

The snapshot is a read-only image of the filesystem metadata state at the specific time. When a snapshot is made, only metadata is copied: the list of blocks for each file and its size. Even though the data itself is not saved, snapshots can still be used as a backup option and for emergency data recovery.

You can create a snapshot of a specific directory or the whole system. It requires an insignificant amount of resources from the system and can be done immediately.

Snapshottable directories

Before making a snapshot, the administrator must allow creating snapshots for the directory by flagging it as snapshottable. This can be done by using the allowSnapshot command.

Only one directory in a path can be marked as snapshottable; nested snapshottable directories are not allowed. This means that a directory cannot be set to snapshottable if one of its ancestors/descendants is a snapshottable directory.

Once a snapshot is made, the system creates the .snapshot directory that will contain snapshots. A snapshottable directory can contain up to 65536 snapshots simultaneously. The directory that contains snapshots can’t be deleted or renamed before all its snapshots are deleted.

NOTE

Snapshots are not supported in versions of Hadoop prior to version 3.1.2. When upgrading from an older version, make sure there are no directories with the name .snapshot. This will help to avoid conflicts with the reserved path.

The list of snapshottable directories and available snapshots can be found in the NameNode UI, on the Snapshots page.

The page is available at:

http://<HOST>:9870/dfshealth.html#tab-snapshot/

Where <HOST> is the IP address or FQDN of the NameNode host.

NameNode UI: Snapshots page
NameNode UI: Snapshots page
NameNode UI: Snapshots page
NameNode UI: Snapshots page

Snapshot paths

To access files in a snapshot using CLI and API calls, specify the path as follows:

/<PATH>/.snapshot/<NAME>/<FILE>

Where:

  • <PATH> — the path to the snapshotable directory;

  • <NAME> — the name of the snapshot;

  • <FILE> — the path to the file in the snapshot.

Commands

Allow snapshots

Allows snapshots for a directory. The operation requires HDFS superuser privilege.

$ hdfs dfsadmin -allowSnapshot <path>

Where <path> is the path to the directory for which to allow snapshots.

Disable snapshots

Disables snapshots for a directory. All snapshots of the directory must be deleted before disabling snapshots. The operation requires HDFS superuser privilege.

$ hdfs dfsadmin -disallowSnapshot <path>

Where <path> is the path to the directory for which to disable snapshots.

List snapshottable directories

Returns the list of snapshottable directories owned by the current user. When run as a superuser, it returns all snapshottable directories.

$ hdfs lsSnapshottableDir

List snapshots

Lists all snapshots in a directory.

$ hdfs dfs -ls /<DIR>/.snapshot

Where <DIR> is the name of the snapshotable directory.

List snapshotted files

Lists all files and directories in a snapshot.

$ hdfs dfs -ls /<DIR>/.snapshot/<NAME>/

Where:

  • <DIR> — the name of the snapshotable directory;

  • <NAME> — the name of the snapshot.

Restore from a snapshot

To restore a file or a directory from a snapshot, use a regular copy command like cp or distcp with the snapshot path as source path.

The example command for restoring the deleted text.txt file by copying it from the image-1 snapshot:

$ hdfs dfs -cp -ptopax hdfs://127.0.0.1:8020/tmp/test/.snapshot/image-1/text.txt hdfs://127.0.0.1:8020/tmp/test/text.txt

The -ptopax option is used to preserve timestamps, ownership, permission, ACLs, and XAttrs.

Create snapshots

Makes a snapshot of the specified directory. The directory must be snapshotable.

$ hdfs dfs -createSnapshot <path> <snapshotName>

Where:

  • <path> — the path to the snapshottable directory.

  • <snapshotName> — the name of the snapshot. This is an optional argument. If not provided, the system will generate the name using a timestamp.

Delete snapshots

Deletes the specified snapshot.

$ hdfs dfs -deleteSnapshot <path> <snapshotName>

Where:

  • <path> — the path to the snapshottable directory;

  • <snapshotName> — the name of the snapshot.

Rename snapshots

Renames the specified snapshot.

$ hdfs dfs -renameSnapshot <path> <oldName> <newName>

Where:

  • <path> — the path to the snapshottable directory;

  • <oldName> — the snapshot`s current name;

  • <newName> — the snapshot`s new name.

Get snapshots difference report

Outputs a report on difference between two snapshots or between a snapshot and the directory’s current state.

$ hdfs snapshotDiff <path> <start> <end>

Where:

  • <path> — the path to the snapshottable directory;

  • <start> — the name of the starting snapshot or . for the current state;

  • <end> — the name of the ending snapshot or . for the current state.

The results interpretation:

  • + — the file or a directory has been created;

  • - — the file or a directory has been deleted;

  • M — the file or a directory has been modified;

  • R — the file or a directory has been renamed.

Example output:

Difference between snapshot image-1 and current directory under directory /tmp/test:
M       .
+       ./text-2.txt
+       ./text.txt
-       ./text.txt
Found a mistake? Seleсt text and press Ctrl+Enter to report it