HDFS data replication with SSM
This article describes how information is synchronized between HDFS and SSM if the Enable SmartFileSystem for Hadoop option in ADCM is used and in cases where SSM rules use the sync action. It explains how SSM collects HDFS metadata, tracks file changes, and schedules replication actions to keep data consistent between source and target clusters.
Fetching files from HDFS
In order to replicate HDFS files from one cluster to another, SSM first needs to collect information about them. For this purpose, SSM includes a dedicated HDFS fetcher service that starts during SSM startup and asynchronously retrieves metadata about HDFS files and directories. This metadata is stored in a dedicated file table in the SSM Metastore.
The fetcher uses a composite mechanism for extracting file metadata from the NameNode, consisting of two phases: an optional namespace snapshot phase and a mandatory event listening phase.
Initial namespace fetching
If SSM started replication of an HDFS cluster for the first time, or if the last handled EditLog transaction ID is no longer available (because NameNode periodically purges or rolls its old EditLogs), SSM performs a full namespace fetch.
During this process, SSM clears the existing file table and builds a complete snapshot of the HDFS namespace by listing all files in parallel using HDFS client calls. Metadata for each discovered file is then stored in the file table.
The level of parallelism for this operation can be configured using the smart.namespace.fetcher.producers.num option, which defines the number of concurrent connections used for file listing.
inotify event fetching
If SSM Metastore contains a valid last handled EditLog transaction ID or initial snapshot, the fetcher switches to the event listening phase.
HDFS provides an API for subscribing to file system events (inotify events). These notification events are delivered in batches and include a unique, monotonically increasing transaction_id. SSM uses this field to track HDFS fetcher progress.
HDFS allows clients to fetch either all events or only those that occurred after a specific transaction ID. By storing the transaction_id of the last processed batch in SSM Metastore, SSM effectively subscribes only to new HDFS events.
Events are retrieved via the DFSClient.getInotifyEventStream() method, which returns a stream of event batches.
SSM can process the following inotify event types:
-
CREATE— generated after a file is created; -
CLOSE— generated after data is fully written and the file is closed; -
RENAME— generated when a file is renamed; -
METADATA— generated when the following file metadata changes:-
modification time;
-
access time;
-
owner;
-
group;
-
permissions;
-
extended attributes (
xattrs);
-
-
UNLINK— generated when a file is deleted.
Depending on the event type, SSM either creates a new entry in the file table or updates the existing record for the affected file.
File filtering by path
Users can exclude specific files or directories from being fetched by providing comma-separated Java regular expressions in the smart.ignore.path.templates option.
If only certain directories should be monitored, users can specify them in the smart.cover.dirs option as a comma-separated list of directory paths.
Both options can be used simultaneously.
HDFS replication execution
Sync action
HDFS replication can be triggered manually by starting a rule with the sync action. Note that it can only be used within rules and cannot be executed as a standalone cmdlet.
The sync action supports the following arguments:
-
dest $address— required, absolute HDFS path of the target directory; -
preserve— optional, comma-separated list of file attributes to transfer:-
owner(transferred by default); -
group(transferred by default); -
permissions(transferred by default); -
replication; -
modification-time.
-
Rule plugin
When a user starts a rule with this action, the SSM sync rule plugin creates a replication metadata object and stores it in the backup_file table of the SSM Metastore.
This object contains information about the replication source and target directories (or source directory patterns).
The plugin also creates a FileDiff object for the replication source directory. FileDiff is a data structure that stores all information required to replicate file events to the target cluster. This information is later used by the CopyScheduler when scheduling sync actions.
Saving FileDiffs during event fetching
If at least one enabled rule contains a sync action (and therefore at least one replication metadata object exists in SSM Metastore), the HDFS event fetcher also begins generating FileDiff objects.
These objects are created for files located in the source replication directories and stored in the file_diff table with PENDING status.
Sync action scheduling
When SSM triggers a rule with a sync action, it selects all files that match the user-defined filters and have associated FileDiff objects with PENDING status.
For each selected file, SSM wraps the file path as a src argument of the sync action and attempts to schedule it for processing.
First, the action scheduler asynchronously generates FileDiff objects for all files under the source replication directories and saves them to SSM Metastore. Then, it periodically retrieves PENDING FileDiff records and groups them into file chains.
A file chain is essentially a queue of operations grouped by file path. Operations within a chain can be merged when possible to improve efficiency and handle edge cases.
The scheduler accepts a sync action only if a file chain already exists for the corresponding src path. It then polls the oldest FileDiff from the chain and translates the sync action into an operation such as:
-
copy -
dircopy -
delete -
rename -
metadata
The specific action depends on the type of change described by the FileDiff.
Once an action completes successfully, the corresponding FileDiff is marked as applied. The scheduler also handles retries for failed actions and applies throttling to copy operations, which helps to avoid overloading the system.
To prevent concurrent modifications to the same file, SSM uses a file locking mechanism based on a concurrent set of file paths (fileLocks). When a file path is locked, no other sync actions can be scheduled for it until the lock is released. Locks are acquired when actions are submitted and released after completion or failure.