MapReduce
Dremel: Interactive Analysis of Web-Scale Datasets
Google, “Dremel: Interactive Analysis of Web-Scale Datasets,” VLDB, 2010. [PDF] Summary Dremel is Google’s interactive ad hoc query system for analysis of read-only nested data. Unlike MapReduce, Dremel is aimed toward data exploration, monitoring, and debugging, where near real-time performance is of utmost importance. To achieve scalability and performance, Dremel builds upon three key ideas: ...
Continue reading →
Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks
Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, Dennis Fetterly, “Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks,” EuroSys, 2007. [PDF] Summary Dryad is Microsoft’s answer to the MapReduce paradigm, albeit at a (slightly) lower level with greater flexibility. Like MapReduce, Dryad allows developers to think about what to do with the data, and Dryad ...
Continue reading →
MapReduce: Simplified Data Processing on Large Clusters
Jeffrey Dean, Sanjay Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” OSDI, 2004. [PDF] Summary MapReduce is a programming model and associated implementation for processing and generating large data sets in a parallel, fault-tolerant, distributed, and load-balanced manner. There are two main functions (both user provided) in this programming model. The map function takes an input ...
Continue reading →
Spark short paper has been accepted at HotCloud’10
An initial overview of our ongoing work on Spark, an iterative and interactive framework for cluster computing, has been accepted at HotCloud’10. I’ve been joined the project last February, while Matei has been working on it since last Fall. I will have uploaded the paper in the publications page. once we have taken care of ...
Continue reading →