Tag Archives: MapReduce

Dremel: Interactive Analysis of Web-Scale Datasets

Google, “Dremel: Interactive Analysis of Web-Scale Datasets,” VLDB, 2010. [PDF]

Summary

Dremel is Google’s interactive  ad hoc query system for analysis of read-only nested data. Unlike MapReduce, Dremel is aimed toward data exploration, monitoring, and debugging, where near real-time performance is of utmost importance. To achieve scalability and performance, Dremel builds upon three key ideas:

  1. It uses a column-striped storage representation on top of GFS, which enables it to store nested data in a compressed but easily searchable form and to read much less amount of data from secondary storage. Dremel uses Finite State Machines (FSMs) to quickly assemble data from its compact representation. The paper shows that this columnar representation reduces completion times even for regular MapReduce jobs by an order of magnitude.
  2. It utilizes the serving tree architecture to rewrite queries during work distribution and to use aggregation at multiple levels. This minimizes data movement and speeds up query results. This optimization roughly accounts for another order of magnitude speedup over MapReduce.
  3. It provides a high-level (limited) SQL-like query language that translates to native execution as opposed to getting translated to a sequence of MapReduce jobs.

Comments

Dremel is fast, but I wonder how much faster it can go if it allowed caching of intermediate results that can be used in subsequent queries; this should more impact for data exploration workloads. The paper is very terse (may be due to VLDB page limit), and I found it hard to read even though none of the concepts were that complicated.

Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks

Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, Dennis Fetterly, “Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks,” EuroSys, 2007. [PDF]

Summary

Dryad is Microsoft’s answer to the MapReduce paradigm, albeit at a (slightly) lower level with greater flexibility. Like MapReduce, Dryad allows developers to think about what to do with the data, and Dryad itself takes care of distribution, fault-tolerance, stragglers etc. Unlike MapReduce, Dryad enables creating extensive data flow models using DAGs or directed acyclic graphs. Dryad also adds increased flexibility for communication between computation nodes in a DAG via disk, TCP pipes, and shared memory queues, as opposed to only disk-based communication promoted by MapReduce. This could possibly allow fully in-memory data flow models for faster data mining and iterative jobs.

Comments

By giving more power to the developers, Dryad sacrificed its simplicity. This is the exact opposite of the tradeoff made in the MapReduce design. Dryad is great as a substrate for higher-level systems that are more usable. This became apparent when Microsoft later built DryadLINQ and SCOPE on top of Dryad, that are easy to use, powerful, but takes away unnecessary flexibility from the developers. The influence of Dryad is big, at least for Microsoft. As far as we know, they heavily use DryadLINQ and SCOPE for most of their data-intensive workloads.

The paper itself has some obvious flaws. One of them would be its evaluation. Comparison against MS SQL server is nothing to be too excited about. It would have been nice if they compared it against some implementation of MapReduce (Hadoop was not publicly available when Dryad was being built, but they could easily use Dryad to imitate MapReduce). However, Dryad is expected to win over MapReduce for multi-stage workload. I found this paper not easy to read, nor to follow, often boring. Nevertheless, it is an interesting piece of work and the obvious next step for MapReduce.

MapReduce: Simplified Data Processing on Large Clusters

Jeffrey Dean, Sanjay Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” OSDI, 2004. [PDF]

Summary

MapReduce is a programming model and associated implementation for processing and generating large data sets in a parallel, fault-tolerant, distributed, and load-balanced manner. There are two main functions (both user provided) in this programming model. The map function takes an input pair and produces a set of intermediate key/value pairs. The reduce function accepts an intermediate key and a set of values for that key and process them to output some user-defined result. The biggest gain in MapReduce is that the user is only concerned about this two functions; everything else, including distribution of tasks, scheduling, fault-tolerance, and scalability,  is taken care of by the framework.

MapReduce is implemented on large clusters of commodity hardware. Map invocations (mappers) automatically partition the input data to M splits which are processed in parallel in many different machines. Reduce invocations (reducers) are distributed by partitioning the intermediate data into R pieces (set by user) using a partitioning function. There is a single master, that coordinates everything. There can be optional combiner functions after mappers to decrease the number of intermediate key/value data transported to the reducers to decrease network usage.

Mappers read input from and and reducers output result back to a distributed file system (e.g., GFS). Intermediate data produced by the mappers are buffered in memory and also saved in local disks and pulled by reducers using RPC calls. Failure in MapReduce is handled by rescheduling and restarting of tasks. MapReduce tries to increase data locality to minimize data transfers over the network.

Comments

MapReduce is one of the most influential papers as well as computing paradigms of this century. It has fundamentally changed how people think about processing large data and given rise to many systems (including Hadoop, which is the open-source implementation of MapReduce). While MapReduce is not the best at everything, it is great in doing what it does: embarrassingly-parallel data-intensive computing.

The fundamental tradeoff here is between simplicity, generality, and efficiency of a high level programming abstraction versus fine-grained control. It is surprising that even without fine-grained like MPI, how many thing MapReduce can perform.

Spark short paper has been accepted at HotCloud’10

An initial overview of our ongoing work on Spark, an iterative and interactive framework for cluster computing, has been accepted at HotCloud’10. I’ve been joined the project last February, while Matei has been working on it since last Fall. I will have uploaded the paper in the publications page. once we have taken care of the reviewer comments/suggestions, meanwhile you can read the technical report version.

This year HotCloud accepted 18 papers (24% of the submitted papers), and the PC are thinking about extending the workshop to a 2nd day from next year.