distcp

The tool is used for inter/intra-cluster copying of files and directories.

The tool usage is as follows:

$ mapred distcp <src> <dst> [args]
Arguments

-append

Allows to reuse existing data in destination files and append new data to them if possible

-async

Runs the distcp command asynchronously. Quits as soon as the Hadoop job is started

-atomic

Instructs distcp to copy source data to a temporary target location, and then move the temporary target to the final location atomically

-bandwidth <arg>

Specifies a bandwidth per map (in MB/second)

-blocksperchunk <arg>

The number of blocks per chunk. When specified, splits files into chunks to copy in parallel

-copybuffersize

The size of the copy buffer to use (in bytes). Defaults to 8192 bytes

-delete

Deletes files existing in <dst> but not in <src>

-diff <oldSnapshot> <newSnapshot>

Allows to identify the difference between source and target, and apply the diff to the target to make it in sync with source

-f <urilist_uri>

Specifies a path to a file with a list of URIs to be copied

-filelimit <n>

Limits the total number of files to copy to be <= n

-filters

The path to a file containing a list of pattern strings, one string per line, to exclude paths that match the pattern from the copy

-i

Ignores failures

-log <path/to/logdir>

Saves logs to <path/to/logdir>

-m

Defines the maximum number of simultaneous copies

-numListstatusThreads

The number of threads to use for building file listings

-overwrite

If provided, overwrites the destination

-p <arg>

Preserve status (replication, block-size, user, group, permission, checksum-type, ACL, XATTR, timestamps). If -p is specified with no <arg>, then preserves the replication, block size, user, group, permission, checksum type and timestamps. XATTRs are preserved when both the source and destination paths are in the /.reserved/raw hierarchy (HDFS only)

-rdiff <newSnapshot> <oldSnapshot>

Allows to identify the changes on the target since <oldSnapshot> was created on the target, and apply the diff reversely to the target, and copy modified files from the source’s <oldSnapshot>, to make the target the same as <oldSnapshot>

-sizelimit <n>

Deprecated. Limits the total size to be <= n (in bytes)

-skipcrccheck

Defines whether to skip CRC checks for source and target paths

-strategy <arg>

The copy strategy to be used in distcp. Possible values are dynamic and uniformsize

-tmp <path/to/dir>

An intermediate work path to be used for atomic commits

-update

Updates the target, copying only missing files or directories

-v

Logs additional info (path, size) in the SKIP/COPY log

-xtrack <path>

Saves information about missing source files to the specified <path>

Example:

$ mapred distcp hdfs://nn1:8020/foo/bar hdfs://nn2:8020/bar/foo
Found a mistake? Seleсt text and press Ctrl+Enter to report it