Hadoop & Mapreduce Tutorial | Components & Command Line Interface

Map-Reduce Components & Command Line Interface

  • This “dynamic” approach allows faster map-tasks to consume more paths than slower ones, thus speeding up the DistCp job overall.
  • CopyMapper: This class implements the physical file-copy. The input-paths are checked against the input-options (specified in the Job’s Configuration), to determine whether a file needs copy. A file will be copied only if at least one of the following is true:
  • A file with the same name doesn’t exist at target.
  • A file with the same name exists at target, but has a different file size.
  • A file with the same name exists at target, but has a different checksum, and -skipcrccheck isn’t mentioned.
  • A file with the same name exists at target, but -overwrite is specified.
  • A file with the same name exists at target, but differs in block-size (and block-size needs to be preserved.
  • CopyCommitter: This class is responsible for the commit-phase of the DistCp job, including:
  • Preservation of directory-permissions (if specified in the options)
  • Clean-up of temporary-files, work-directories, etc.

Command Line Options

Flag Description Notes
-p[rbugp] Preserve
r: replication number
b: block size
u: user
g: group
p: permission
Modification times are not preserved. Also, when -update is specified, status updates will not be synchronized unless the file sizes also differ (i.e. unless the file is re-created).
-i Ignore failures This option will keep more accurate statistics about the copy than the default case. It also preserves logs from failed copies, which can be valuable for debugging. Finally, a failing map will not cause the job to fail before all splits are attempted.
-log <logdir> Write logs to <logdir> DistCp keeps logs of each file it attempts to copy as map output. If a map fails, the log output will not be retained if it is re-executed.
-m <num_maps> Maximum number of simultaneous copies Specify the number of maps to copy data. Note that more maps may not necessarily improve throughput.
-overwrite Overwrite destination If a map fails and -i is not specified, all the files in the split, not only those that failed, will be recopied. As discussed in the Usage documentation, it also changes the semantics for generating destination paths, so users should use this carefully.
-update Overwrite if src size different from dst size As noted in the preceding, this is not a “sync” operation. The only criterion examined is the source and destination file sizes; if they differ, the source file replaces the destination file. As discussed in the Usage documentation, it also changes the semantics for generating destination paths, so users should use this carefully.
-f <urilist_uri> Use list at <urilist_uri> as src list This is equivalent to listing each source on the command line. The urilist_uri list should be a fully qualified URI.
-delete Delete the files existing in the dst but not in src The deletion is done by FS Shell. So the trash will be used, if it is enable.
-strategy {dynamic|uniformsize} Choose the copy-strategy to be used in DistCp. By default, uniformsize is used. (i.e. Maps are balanced on the total size of files copied by each map. Similar to legacy.) If “dynamic” is specified, DynamicInputFormat is used instead. (This is described in the Architecture section, under InputFormats.)
-bandwidth Specify bandwidth per map, in MB/second. Each map will be restricted to consume only the specified bandwidth. This is not always exact. The map throttles back its bandwidth consumption during a copy, such that the net bandwidth used tends towards the specified value.
-atomic {-tmp <tmp_dir>} Specify atomic commit, with optional tmp directory. -atomic instructs DistCp to copy the source data to a temporary target location, and then move the temporary target to the final-location atomically. Data will either be available at final target in a complete and consistent form, or not at all. Optionally, -tmp may be used to specify the location of the tmp-target. If not specified, a default is chosen. tmp_dir must be on the final target cluster.
-mapredSslConf <ssl_conf_file> Specify SSL Config file, to be used with HSFTP source When using the hsftp protocol with a source, the security- related properties may be specified in a config-file and passed to DistCp. <ssl_conf_file> needs to be in the classpath.
-async Run DistCp asynchronously. Quits as soon as the Hadoop Job is launched. The Hadoop Job-id is logged, for tracking.
Share this post
[social_warfare]
Hadoop & Mapreduce Tutorial | distcp (Distributed Copy)
SequenceFile and MapFile, Checksumming, codecs and Writables

Get industry recognized certification – Contact us

keyboard_arrow_up