Hadoop & Mapreduce Tutorial | MapReduce

MapReduce

It is a processing technique and a program model for distributed computing based on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as an input and combines those data tuples into a smaller set of tuples. As the sequence of the name implies, the reduce task is always performed after the map job.

The major advantage of MapReduce is that it is easy to scale data processing over multiple computing nodes. Under the MapReduce model, the data processing primitives are called mappers and reducer.

MapReduce Basics

A MapReduce program is composed of a Map() procedure (method) that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and a Reduce() method that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies). The “MapReduce System” (also called “infrastructure” or “framework”) orchestrates the processing by marshalling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy and fault tolerance.

It is a framework for processing parallelizable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster (if all nodes are on the same local network and use similar hardware) or a grid (if the nodes are shared across geographically and administratively distributed systems, and use more heterogenous hardware). Processing can occur on data stored either in a filesystem (unstructured) or in a database (structured). It can take advantage of locality of data, processing it on or near the storage assets in order to reduce the distance over which it must be transmitted.

  • “Map” step: Each worker node applies the “map()” function to the local data, and writes the output to a temporary storage. A master node orchestrates that for redundant copies of input data, only one is processed.
  • “Shuffle” step: Worker nodes redistribute data based on the output keys (produced by the “map()” function), such that all data belonging to one key is located on the same worker node.
  • “Reduce” step: Worker nodes now process each group of output data, per key, in parallel.

Another way to look at MapReduce is as a 5-step parallel and distributed computation

  • Prepare the Map() input – the system designates Map processors, assigns the input key value K1 that each processor would work on, and provides that processor with all the input data associated with that key value.
  • Run the user-provided Map() code – Map() is run exactly once for each K1 key value, generating output organized by key values K2.
  • “Shuffle” the Map output to the Reduce processors – the MapReduce system designates Reduce processors, assigns the K2 key value each processor should work on, and provides that processor with all the Map-generated data associated with that key value.
  • Run the user-provided Reduce() code – Reduce() is run exactly once for each K2 key value produced by the Map step.
  • Produce the final output – the MapReduce system collects all the Reduce output, and sorts it by K2 to produce the final outcome.

The model is inspired by the map and reduce functions commonly used in functional programming, although their purpose in the MapReduce framework is not the same as in their original forms. The MapReduce paradigm takes inspiration from the map and reduce programming constructs common in many programming languages. We introduce map and reduce using Python.

Apply for Big Data and Hadoop Developer Certification

https://www.vskills.in/certification/certified-big-data-and-apache-hadoop-developer

Back to Tutorials

Get industry recognized certification – Contact us

Menu