distcp

DistCP is the shortform of Distributed Copy in context of Apache Hadoop. It is basically a tool which can be used in case we need to copy large amount of data/files in inter/intra-cluster setup. In the background, DisctCP uses MapReduce to distribute and copy the data which means the operation is distributed across multiple available nodes in the cluster. This makes it more efficient and effective copy tool.

DistCP takes a list of files(in case of multiple files) and distribute the data between multiple Map tasks and these map tasks copy the data portion assigned to them to the destination.

DistCp Version 2 (distributed copy) is a tool used for large inter/intra-cluster copying. It uses MapReduce to effect its distribution, error handling and recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list.

Usage

The most common invocation of DistCp is an inter-cluster copy

bash$ hadoop distcp2 hdfs://nn1:8020/foo/bar \

hdfs://nn2:8020/bar/foo

This will expand the namespace under /foo/bar on nn1 into a temporary file, partition its contents among a set of map tasks, and start a copy on each TaskTracker from nn1 to nn2.

One can also specify multiple source directories on the command line:

bash$ hadoop distcp2 hdfs://nn1:8020/foo/a \

hdfs://nn1:8020/foo/b \

hdfs://nn2:8020/bar/foo

Or, equivalently, from a file using the -f option:

bash$ hadoop distcp2 -f hdfs://nn1:8020/srclist \

hdfs://nn2:8020/bar/foo

Where srclist contains

hdfs://nn1:8020/foo/a

hdfs://nn1:8020/foo/b

When copying from multiple sources, DistCp will abort the copy with an error message if two sources collide, but collisions at the destination are resolved per the options specified. By default, files already existing at the destination are skipped (i.e. not replaced by the source file). A count of skipped files is reported at the end of each job, but it may be inaccurate if a copier failed for some subset of its files, but succeeded on a later attempt.

It is important that each TaskTracker can reach and communicate with both the source and destination file systems. For HDFS, both the source and destination must be running the same version of the protocol or use a backwards-compatible protocol.

After a copy, it is recommended that one generates and cross-checks a listing of the source and destination to verify that the copy was truly successful. Since DistCp employs both Map/Reduce and the FileSystem API, issues in or between any of the three could adversely and silently affect the copy. Some have had success running with -update enabled to perform a second pass, but users should be acquainted with its semantics before attempting this.

It’s also worth noting that if another client is still writing to a source file, the copy will likely fail. Attempting to overwrite a file being written at the destination should also fail on HDFS. If a source file is (re)moved before it is copied, the copy will fail with a FileNotFoundException.

Components

The components of the new DistCp may be classified into the following categories

DistCp Driver – The DistCp Driver components are responsible for:

  • Parsing the arguments passed to the DistCp command on the command-line, via OptionsParser, and DistCpOptionsSwitch
  • Assembling the command arguments into an appropriate DistCpOptions object, and initializing DistCp. These arguments include – Source-paths, Target location and Copy options (e.g. whether to update-copy, overwrite, which file-attributes to preserve, etc.)
  • Orchestrating the copy operation by
  • Invoking the copy-listing-generator to create the list of files to be copied.
  • Setting up and launching the Hadoop Map-Reduce Job to carry out the copy.
  • Based on the options, either returning a handle to the Hadoop MR Job immediately, or waiting till completion.

The parser-elements are exercised only from the command-line (or if DistCp::run() is invoked). The DistCp class may also be used programmatically, by constructing the DistCpOptions object, and initializing a DistCp object appropriately.

Copy-listing Generator – The copy-listing-generator classes are responsible for creating the list of files/directories to be copied from source. They examine the contents of the source-paths (files/directories, including wild-cards), and record all paths that need copy into a sequence- file, for consumption by the DistCp Hadoop Job. The main classes in this module include:

  • CopyListing: The interface that should be implemented by any copy-listing-generator implementation. Also provides the factory method by which the concrete CopyListing implementation is chosen.
  • SimpleCopyListing: An implementation of CopyListing that accepts multiple source paths (files/directories), and recursively lists all the individual files and directories under each, for copy.
  • GlobbedCopyListing: Another implementation of CopyListing that expands wild-cards in the source paths.
  • FileBasedCopyListing: An implementation of CopyListing that reads the source-path list from a specified file.

Based on whether a source-file-list is specified in the DistCpOptions, the source-listing is generated in one of the following ways:

  • If there’s no source-file-list, the GlobbedCopyListing is used. All wild-cards are expanded, and all the expansions are forwarded to the SimpleCopyListing, which in turn constructs the listing (via recursive descent of each path).
  • If a source-file-list is specified, the FileBasedCopyListing is used. Source-paths are read from the specified file, and then forwarded to the GlobbedCopyListing. The listing is then constructed as described above.

One may customize the method by which the copy-listing is constructed by providing a custom implementation of the CopyListing interface. The behaviour of DistCp differs here from the legacy DistCp, in how paths are considered for copy.

The legacy implementation only lists those paths that must definitely be copied on to target. E.g. if a file already exists at the target (and -overwrite isn’t specified), the file isn’t even considered in the Map-Reduce Copy Job. Determining this during setup (i.e. before the Map-Reduce Job) involves file-size and checksum-comparisons that are potentially time-consuming.

The new DistCp postpones such checks until the Map-Reduce Job, thus reducing setup time. Performance is enhanced further since these checks are parallelized across multiple maps.

Input-formats and Map-Reduce Components – The Input-formats and Map-Reduce components are responsible for the actual copy of files and directories from the source to the destination path. The listing-file created during copy-listing generation is consumed at this point, when the copy is carried out. The classes of interest here include:

  • UniformSizeInputFormat: This implementation of org.apache.hadoop.mapreduce.InputFormat provides equivalence with Legacy DistCp in balancing load across maps. The aim of the UniformSizeInputFormat is to make each map copy roughly the same number of bytes. Apropos, the listing file is split into groups of paths, such that the sum of file-sizes in each InputSplit is nearly equal to every other map. The splitting isn’t always perfect, but its trivial implementation keeps the setup-time low.
  • DynamicInputFormat and DynamicRecordReader: The DynamicInputFormat implements org.apache.hadoop.mapreduce.InputFormat, and is new to DistCp. The listing-file is split into several “chunk-files”, the exact number of chunk-files being a multiple of the number of maps requested for in the Hadoop Job. Each map task is “assigned” one of the chunk-files (by renaming the chunk to the task’s id), before the Job is launched. Paths are read from each chunk using the DynamicRecordReader, and processed in the CopyMapper. After all the paths in a chunk are processed, the current chunk is deleted and a new chunk is acquired. The process continues until no more chunks are available. This “dynamic” approach allows faster map-tasks to consume more paths than slower ones, thus speeding up the DistCp job overall.
  • CopyMapper: This class implements the physical file-copy. The input-paths are checked against the input-options (specified in the Job’s Configuration), to determine whether a file needs copy. A file will be copied only if at least one of the following is true:
  • A file with the same name doesn’t exist at target.
  • A file with the same name exists at target, but has a different file size.
  • A file with the same name exists at target, but has a different checksum, and -skipcrccheck isn’t mentioned.
  • A file with the same name exists at target, but -overwrite is specified.
  • A file with the same name exists at target, but differs in block-size (and block-size needs to be preserved.
  • CopyCommitter: This class is responsible for the commit-phase of the DistCp job, including:
  • Preservation of directory-permissions (if specified in the options)
  • Clean-up of temporary-files, work-directories, etc.

Command Line Options

FlagDescriptionNotes
-p[rbugp]Preserve
r: replication number
b: block size
u: user
g: group
p: permission
Modification times are not preserved. Also, when -update is specified, status updates will not be synchronized unless the file sizes also differ (i.e. unless the file is re-created).
-iIgnore failuresThis option will keep more accurate statistics about the copy than the default case. It also preserves logs from failed copies, which can be valuable for debugging. Finally, a failing map will not cause the job to fail before all splits are attempted.
-log <logdir>Write logs to <logdir>DistCp keeps logs of each file it attempts to copy as map output. If a map fails, the log output will not be retained if it is re-executed.
-m <num_maps>Maximum number of simultaneous copiesSpecify the number of maps to copy data. Note that more maps may not necessarily improve throughput.
-overwriteOverwrite destinationIf a map fails and -i is not specified, all the files in the split, not only those that failed, will be recopied. As discussed in the Usage documentation, it also changes the semantics for generating destination paths, so users should use this carefully.
-updateOverwrite if src size different from dst sizeAs noted in the preceding, this is not a “sync” operation. The only criterion examined is the source and destination file sizes; if they differ, the source file replaces the destination file. As discussed in the Usage documentation, it also changes the semantics for generating destination paths, so users should use this carefully.
-f <urilist_uri>Use list at <urilist_uri> as src listThis is equivalent to listing each source on the command line. The urilist_uri list should be a fully qualified URI.
-deleteDelete the files existing in the dst but not in srcThe deletion is done by FS Shell. So the trash will be used, if it is enable.
-strategy {dynamic|uniformsize}Choose the copy-strategy to be used in DistCp.By default, uniformsize is used. (i.e. Maps are balanced on the total size of files copied by each map. Similar to legacy.) If “dynamic” is specified, DynamicInputFormat is used instead. (This is described in the Architecture section, under InputFormats.)
-bandwidthSpecify bandwidth per map, in MB/second.Each map will be restricted to consume only the specified bandwidth. This is not always exact. The map throttles back its bandwidth consumption during a copy, such that the net bandwidth used tends towards the specified value.
-atomic {-tmp <tmp_dir>}Specify atomic commit, with optional tmp directory.-atomic instructs DistCp to copy the source data to a temporary target location, and then move the temporary target to the final-location atomically. Data will either be available at final target in a complete and consistent form, or not at all. Optionally, -tmp may be used to specify the location of the tmp-target. If not specified, a default is chosen. tmp_dir must be on the final target cluster.
-mapredSslConf <ssl_conf_file>Specify SSL Config file, to be used with HSFTP sourceWhen using the hsftp protocol with a source, the security- related properties may be specified in a config-file and passed to DistCp. <ssl_conf_file> needs to be in the classpath.
-asyncRun DistCp asynchronously. Quits as soon as the Hadoop Job is launched.The Hadoop Job-id is logged, for tracking.
Rack Awareness
MapReduce Basics

Get industry recognized certification – Contact us

keyboard_arrow_up
Open chat
Need help?
Hello 👋
Can we help you?