Hadoop & Mapreduce Tutorial | Parallelizing Map and Reduce with Hadoop

Map

Map applies a function to each element in a list, returning a list of the results. For example, here is code in Python which uses map to triple each element in a list:

def triple(n):

return n * 3

print map(triple, [1, 2, 3, 4])

This code prints the following list: [3, 6, 9, 12]

Reduce

Reduce also applies a function to elements in a list, but instead of being applied to each element separately, the function is applied to two arguments, the “current result” and the next element in the list. The current result is initialized by calling reduce on the first two elements in the list. This allows you to build a single result (which can be another list but is often a scalar value) from a list. This is best illustrated in another simple Python example:

def sum(n1, n2):

return n1 + n2

print reduce(sum, [1, 2, 3, 4])

You can think of this function as making three recursive function calls like this:

sum(sum(sum(1, 2), 3), 4)

Parallelizing Map and Reduce with Hadoop

In the MapReduce programming abstraction, the map function takes a single <key, value> pair, and produces zero or more <key, value> pairs as a result. This differs slightly from the functional programming construct map which always produces one and only one result for each invocation of map. The MapReduce style of map allows you to produce many intermediate pairs which can then be further analyzed with reduce.

In MapReduce, the reducer is also more flexible than its functional programming counterpart. While reduce is similar in spirit to the reduce described in the Python example above, it is not limited to processing the list of pairs two-at-a-time, but rather is given an iterator over all pairs that have the same key, and that list can be walked over in any way the programmer chooses. Also like the MapReduce map, the MapReduce reduce can emit an arbitrary number of pairs, although applications often will want to just reduce to a single output pair.

The reduce phase in MapReduce joins together in some way those pairs which have the same key. For example, say that you do a word count on N documents, and each document is on a separate node. Your map function will process each document separately, producing many pairs of the following form: <word, count>. The documents will most likely have many words in common, so those counts will need to be combined. The MapReduce algorithm automatically starts a reduce process for each set of pairs with the same key (in this case, counts for the same word), and the reducer can simply sum those counts together, producing a single pair of the form <word, total_count>. For example, reduce might transform one set of pairs into another like this:

a   37

the 20       a   50

a   10        and 16

a   3   — reduce –>   the 32

and 16      zygote 1

zygote 1

the 12

By separating the map task (where the computation on each input element is independent of the other) from the reduce task (where pairs with the same key must be processed together on the same node), the MapReduce algorithm can improve parallelism. This is the reason why MapReduce separates map and reduce into two separate phases.

Apply for Big Data and Hadoop Developer Certification

https://www.vskills.in/certification/certified-big-data-and-apache-hadoop-developer

Back to Tutorials

Share this post
[social_warfare]
Hadoop & Mapreduce Tutorial | Add and removal of nodes
Hadoop & Mapreduce Tutorial | Pig

Get industry recognized certification – Contact us

keyboard_arrow_up