MapReduce Framework
It is a framework in which user has to fit their solution into the framework of map and reduce, which in some situations might be challenging. It is similar to recursion for the first time: it is challenging to find the recursive solution to the problem, but when it comes to you, it is clear, concise, and elegant. In many situations you have to be conscious of system resources being used by the MapReduce job, especially inter-cluster network utilization. The tradeoff of being confined to the framework is the ability to process your data with distributed computing, without having to deal with concurrency, robustness, scale, and other common challenges.
The input to a MapReduce job is a set of files in the data store that are spread out over the Hadoop Distributed File System (HDFS). In Hadoop, these files are split with an input format, which defines how to separate a file into input splits. An input split is a byte-oriented view of a chunk of the file to be loaded by a map task.
Each map task in Hadoop is broken into the following phases: record reader, mapper, combiner, and partitioner. The output of the map tasks, called the intermediate keys and values, are sent to the reducers. The reduce tasks are broken into the following phases: shuffle, sort, reducer, and output format. The nodes in which the map tasks run are optimally on the nodes in which the data rests. This way, the data typically does not have to move over the network and can be computed on the local machine.
Example
The “Word Count” program is the canonical example, and for good reason. It is a straightforward application of MapReduce and MapReduce can handle it extremely efficiently. In this particular example, we’re going to be doing a word count over user-submitted comments on StackOverflow. The content of the Text field will be pulled out and pre-processed a bit, and then we’ll count up how many times we see each word. An example record from this data set is
<row Id=”8189677″ PostId=”6881722″ Text=”Have you looked at Hadoop?”
CreationDate=”2011-07-30T07:29:33.343″ UserId=”831878″ />
This record is the 8,189,677th comment on Stack Overflow, and is associated with post number 6,881,722, and is by user number 831,878. The number of the PostId and the UserId are foreign keys to other portions of the data set. We’ll show how to join these datasets together in the chapter on join patterns.
The first chunk of code we’ll look at is the driver. The driver takes all of the components that we’ve built for our MapReduce job and pieces them together to be submitted to execution. This code is usually pretty generic and considered “boiler plate.” You’ll find that in all of our patterns the driver stays the same for the most part. This code is from the “Word Count” example that ships with Hadoop Core
import java.io.IOException;
import java.util.StringTokenizer;
import java.util.Map;
import java.util.HashMap;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.commons.lang.StringEscapeUtils;
public class CommentWordCount {
public static class WordCountMapper
extends Mapper<Object, Text, Text, IntWritable> {
…
}
public static class IntSumReducer
extends Reducer<Text, IntWritable, Text, IntWritable> {
…
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs =
new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println(“Usage: CommentWordCount <in> <out>”);
System.exit(2);
}
Job job = new Job(conf, “StackOverflow Comment Word Count”);
job.setJarByClass(CommentWordCount.class);
job.setMapperClass(WordCountMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Apply for Big Data and Hadoop Developer Certification
https://www.vskills.in/certification/certified-big-data-and-apache-hadoop-developer