The tool is used for inter/intra-cluster copying of files and directories.

The tool usage is as follows:

$ mapred distcp <src> <dst> [args]


Allows to reuse existing data in destination files and append new data to them if possible


Runs the distcp command asynchronously. Quits as soon as the Hadoop job is started


Instructs distcp to copy source data to a temporary target location, and then move the temporary target to the final location atomically

-bandwidth <arg>

Specifies a bandwidth per map (in MB/second)

-blocksperchunk <arg>

The number of blocks per chunk. When specified, splits files into chunks to copy in parallel


The size of the copy buffer to use (in bytes). Defaults to 8192 bytes


Deletes files existing in <dst> but not in <src>

-diff <oldSnapshot> <newSnapshot>

Allows to identify the difference between source and target, and apply the diff to the target to make it in sync with source

-f <urilist_uri>

Specifies a path to a file with a list of URIs to be copied

-filelimit <n>

Limits the total number of files to copy to be <= n


The path to a file containing a list of pattern strings, one string per line, to exclude paths that match the pattern from the copy


Ignores failures

-log <path/to/logdir>

Saves logs to <path/to/logdir>


Defines the maximum number of simultaneous copies


The number of threads to use for building file listings


If provided, overwrites the destination

-p <arg>

Preserve status (replication, block-size, user, group, permission, checksum-type, ACL, XATTR, timestamps). If -p is specified with no <arg>, then preserves the replication, block size, user, group, permission, checksum type and timestamps. XATTRs are preserved when both the source and destination paths are in the /.reserved/raw hierarchy (HDFS only)

-rdiff <newSnapshot> <oldSnapshot>

Allows to identify the changes on the target since <oldSnapshot> was created on the target, and apply the diff reversely to the target, and copy modified files from the source’s <oldSnapshot>, to make the target the same as <oldSnapshot>

-sizelimit <n>

Deprecated. Limits the total size to be <= n (in bytes)


Defines whether to skip CRC checks for source and target paths

-strategy <arg>

The copy strategy to be used in distcp. Possible values are dynamic and uniformsize

-tmp <path/to/dir>

An intermediate work path to be used for atomic commits


Updates the target, copying only missing files or directories


Logs additional info (path, size) in the SKIP/COPY log

-xtrack <path>

Saves information about missing source files to the specified <path>


$ mapred distcp hdfs://nn1:8020/foo/bar hdfs://nn2:8020/bar/foo
Found a mistake? Seleсt text and press Ctrl+Enter to report it