distcp
Used for large inter/intra-cluster copying. It uses MapReduce to effect its distribution, error handling, recovery, and reporting. It expands a list of files/directories into input to map tasks, each of which will copy a partition of the files specified in the source list.
More information can be found at Hadoop DistCp Guide.
The usage is as follows:
$ hadoop distcp [OPTIONS] <src> <dst>
-append |
Reuses existing data in target files and appends new data to them if possible |
-async |
Defines whether the |
-atomic |
Commits all changes or none |
-bandwidth <arg> |
Specifies bandwidth per map in MB, accepts bandwidth as a fraction |
-blocksperchunk <arg> |
If set to a positive value, files with more blocks than this value will be split into chunks of <blocksperchunk> blocks to be transferred in parallel, and reassembled on the destination.
By default, <blocksperchunk> is 0 and the files will be transmitted in their entirety without splitting.
This switch is only applicable when the source file system implements the |
-copybuffersize <arg> |
The size of the copy buffer to use (in bytes). Defaults to 8192 |
-delete |
Deletes those files on target that are missing in source.
Delete is applicable only with |
-diff <arg> |
Uses the snapshot diff report to identify the difference between source and target |
-f <arg> |
Specifies a list of files to copy |
-filters <arg> |
The path to a file containing a list of strings for paths to be excluded from the copy |
-i |
Ignores failures during copy |
-log <arg> |
Specifies a directory on DFS where |
-m <arg> |
The maximum number of concurrent maps to use for copy |
-numListstatusThreads <arg> |
The number of threads to use for building file listing (max 40) |
-overwrite |
Overwrites target files unconditionally, even if they exist |
-p <arg> |
Preserves status (rbugpcaxt)(replication, block-size, user, group, permission, checksum-type, ACL, XATTR, timestamps).
If |
-rdiff <arg> |
Use target snapshot diff reports to identify changes made on target |
-skipcrccheck |
Whether to skip CRC checks between source and target paths. |
-strategy <arg> |
The copy strategy to use. The default is dividing work based on file sizes |
-tmp <arg> |
Intermediate work path to be used for atomic commits |
-update |
Updates the target, copying only missing files and overwriting the files that are different from source |
-v |
Logs additional info (path, size) to the SKIP/COPY log |
-xtrack <arg> |
Saves information about missing source files to the specified directory |
Examples:
$ hadoop distcp hdfs://nn1:8020/foo/bar hdfs://nn2:8020/bar/foo
$ hadoop distcp hdfs://nn1:8020/foo/a hdfs://nn1:8020/foo/b hdfs://nn2:8020/bar/foo