MapReduce: Simplified Data Processing on Large Clusters

Jeffrey Dean, Sanjay Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” OSDI, 2004. [PDF]

Summary

MapReduce is a programming model and associated implementation for processing and generating large data sets in a parallel, fault-tolerant, distributed, and load-balanced manner. There are two main functions (both user provided) in this programming model. The map function takes an input pair and produces a set of intermediate key/value pairs. The reduce function accepts an intermediate key and a set of values for that key and process them to output some user-defined result. The biggest gain in MapReduce is that the user is only concerned about this two functions; everything else, including distribution of tasks, scheduling, fault-tolerance, and scalability,  is taken care of by the framework.

MapReduce is implemented on large clusters of commodity hardware. Map invocations (mappers) automatically partition the input data to M splits which are processed in parallel in many different machines. Reduce invocations (reducers) are distributed by partitioning the intermediate data into R pieces (set by user) using a partitioning function. There is a single master, that coordinates everything. There can be optional combiner functions after mappers to decrease the number of intermediate key/value data transported to the reducers to decrease network usage.

Mappers read input from and and reducers output result back to a distributed file system (e.g., GFS). Intermediate data produced by the mappers are buffered in memory and also saved in local disks and pulled by reducers using RPC calls. Failure in MapReduce is handled by rescheduling and restarting of tasks. MapReduce tries to increase data locality to minimize data transfers over the network.

Comments

MapReduce is one of the most influential papers as well as computing paradigms of this century. It has fundamentally changed how people think about processing large data and given rise to many systems (including Hadoop, which is the open-source implementation of MapReduce). While MapReduce is not the best at everything, it is great in doing what it does: embarrassingly-parallel data-intensive computing.

The fundamental tradeoff here is between simplicity, generality, and efficiency of a high level programming abstraction versus fine-grained control. It is surprising that even without fine-grained like MPI, how many thing MapReduce can perform.

Leave a Reply

Your email address will not be published. Required fields are marked *