MapReduce is a processing technique and a program model for distributed computing based on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as an input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce task is always performed after the map job.
The major advantage of MapReduce is that it is easy to scale data processing over multiple computing nodes. Under the MapReduce model, the data processing primitives are called mappers and reducers.
A MapReduce program is composed of a Map() procedure (method) that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and a Reduce() method that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies). The “MapReduce System” (also called “infrastructure” or “framework”) orchestrates the processing by marshalling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy and fault tolerance.
MapReduce is a framework for processing parallelizable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster (if all nodes are on the same local network and use similar hardware) or a grid (if the nodes are shared across geographically and administratively distributed systems, and use more heterogenous hardware). Processing can occur on data stored either in a filesystem (unstructured) or in a database (structured). MapReduce can take advantage of locality of data, processing it on or near the storage assets in order to reduce the distance over which it must be transmitted.
- “Map” step: Each worker node applies the “map()” function to the local data, and writes the output to a temporary storage. A master node orchestrates that for redundant copies of input data, only one is processed.
- “Shuffle” step: Worker nodes redistribute data based on the output keys (produced by the “map()” function), such that all data belonging to one key is located on the same worker node.
- “Reduce” step: Worker nodes now process each group of output data, per key, in parallel.
Another way to look at MapReduce is as a 5-step parallel and distributed computation
- Prepare the Map() input – the “MapReduce system” designates Map processors, assigns the input key value K1 that each processor would work on, and provides that processor with all the input data associated with that key value.
- Run the user-provided Map() code – Map() is run exactly once for each K1 key value, generating output organized by key values K2.
- “Shuffle” the Map output to the Reduce processors – the MapReduce system designates Reduce processors, assigns the K2 key value each processor should work on, and provides that processor with all the Map-generated data associated with that key value.
- Run the user-provided Reduce() code – Reduce() is run exactly once for each K2 key value produced by the Map step.
- Produce the final output – the MapReduce system collects all the Reduce output, and sorts it by K2 to produce the final outcome.
The model is inspired by the map and reduce functions commonly used in functional programming, although their purpose in the MapReduce framework is not the same as in their original forms. The MapReduce paradigm takes inspiration from the map and reduce programming constructs common in many programming languages. We introduce map and reduce using Python.
Map
Map applies a function to each element in a list, returning a list of the results. For example, here is code in Python which uses map to triple each element in a list:
def triple(n):
return n * 3
print map(triple, [1, 2, 3, 4])
This code prints the following list: [3, 6, 9, 12]
Reduce
Reduce also applies a function to elements in a list, but instead of being applied to each element separately, the function is applied to two arguments, the “current result” and the next element in the list. The current result is initialized by calling reduce on the first two elements in the list. This allows you to build a single result (which can be another list but is often a scalar value) from a list. This is best illustrated in another simple Python example:
def sum(n1, n2):
return n1 + n2
print reduce(sum, [1, 2, 3, 4])
You can think of this function as making three recursive function calls like this:
sum(sum(sum(1, 2), 3), 4)
Parallelizing Map and Reduce with Hadoop
In the MapReduce programming abstraction, the map function takes a single <key, value> pair, and produces zero or more <key, value> pairs as a result. This differs slightly from the functional programming construct map which always produces one and only one result for each invocation of map. The MapReduce style of map allows you to produce many intermediate pairs which can then be further analyzed with reduce.
In MapReduce, the reducer is also more flexible than its functional programming counterpart. While reduce is similar in spirit to the reduce described in the Python example above, it is not limited to processing the list of pairs two-at-a-time, but rather is given an iterator over all pairs that have the same key, and that list can be walked over in any way the programmer chooses. Also like the MapReduce map, the MapReduce reduce can emit an arbitrary number of pairs, although applications often will want to just reduce to a single output pair.
The reduce phase in MapReduce joins together in some way those pairs which have the same key. For example, say that you do a word count on N documents, and each document is on a separate node. Your map function will process each document separately, producing many pairs of the following form: <word, count>. The documents will most likely have many words in common, so those counts will need to be combined. The MapReduce algorithm automatically starts a reduce process for each set of pairs with the same key (in this case, counts for the same word), and the reducer can simply sum those counts together, producing a single pair of the form <word, total_count>. For example, reduce might transform one set of pairs into another like this:
a 37
the 20 a 50
a 10 and 16
a 3 — reduce –> the 32
and 16 zygote 1
zygote 1
the 12
By separating the map task (where the computation on each input element is independent of the other) from the reduce task (where pairs with the same key must be processed together on the same node), the MapReduce algorithm can improve parallelism. This is the reason why MapReduce separates map and reduce into two separate phases.
MapReduce works in a master-slave / master-worker fashion. JobTracker acts as the master and TaskTrackers act as the slaves. MapReduce has two major phases – A Map phase and a Reduce phase. Map phase processes parts of input data using mappers based on the logic defined in the map() function. The Reduce phase aggregates the data using a reducer based on the logic defined in the reduce() function. Depending upon the problem at hand, we can have One Reduce Task, Multiple Reduce Tasks or No Reduce Tasks. MapReduce has built-in fault tolerance and hence can run on commodity hardware. MapReduce takes care of distributing the data across various nodes, assigning the tasks to each of the nodes, getting the results back from each node, re-running the task in case of any node failures, consolidation of results, etc. MapReduce processes the data in the form of (Key, Value) pairs. Hence, we need to fit out business problem in this Key-Value arrangement.
Logical View – The Map and Reduce functions of MapReduce are both defined with respect to data structured in (key, value) pairs. Map takes one pair of data with a type in one data domain, and returns a list of pairs in a different domain:
Map(k1,v1) → list(k2,v2)
The Map function is applied in parallel to every pair in the input dataset. This produces a list of pairs for each call. After that, the MapReduce framework collects all pairs with the same key from all lists and groups them together, creating one group for each key.
The Reduce function is then applied in parallel to each group, which in turn produces a collection of values in the same domain:
Reduce(k2, list (v2)) → list(v3)
Each Reduce call typically produces either one value v3 or an empty return, though one call is allowed to return more than one value. The returns of all calls are collected as the desired result list.
Thus the MapReduce framework transforms a list of (key, value) pairs into a list of values. This behavior is different from the typical functional programming map and reduce combination, which accepts a list of arbitrary values and returns one single value that combines all the values returned by map.
An example of MapReduce
Let’s look at a simple example. Assume you have five files, and each file contains two columns (a key and a value in Hadoop terms) that represent a city and the corresponding temperature recorded in that city for the various measurement days. Of course we’ve made this example very simple so it’s easy to follow. You can imagine that a real application won’t be quite so simple, as it’s likely to contain millions or even billions of rows, and they might not be neatly formatted rows at all; in fact, no matter how big or small the amount of data you need to analyze, the key principles we’re covering here remain the same. Either way, in this example, city is the key and temperature is the value.
Toronto, 20
Whitby, 25
New York, 22
Rome, 32
Toronto, 4
Rome, 33
New York, 18
Out of all the data we have collected, we want to find the maximum temperature for each city across all of the data files (note that each file might have the same city represented multiple times). Using the MapReduce framework, we can break this down into five map tasks, where each mapper works on one of the five files and the mapper task goes through the data and returns the maximum temperature for each city. For example, the results produced from one mapper task for the data above would look like this:
(Toronto, 20) (Whitby, 25) (New York, 22) (Rome, 33)
Let’s assume the other four mapper tasks (working on the other four files not shown here) produced the following intermediate results:
(Toronto, 18) (Whitby, 27) (New York, 32) (Rome, 37)(Toronto, 32) (Whitby, 20) (New York, 33) (Rome, 38)(Toronto, 22) (Whitby, 19) (New York, 20) (Rome, 31)(Toronto, 31) (Whitby, 22) (New York, 19) (Rome, 30)
All five of these output streams would be fed into the reduce tasks, which combine the input results and output a single value for each city, producing a final result set as follows:
(Toronto, 32) (Whitby, 27) (New York, 33) (Rome, 38)
As an analogy, you can think of map and reduce tasks as the way a census was conducted in Roman times, where the census bureau would dispatch its people to each city in the empire. Each census taker in each city would be tasked to count the number of people in that city and then return their results to the capital city. There, the results from each city would be reduced to a single count (sum of all cities) to determine the overall population of the empire. This mapping of people to cities, in parallel, and then combining the results (reducing) is much more efficient than sending a single person to count every person in the empire in a serial fashion.