Google, “Dremel: Interactive Analysis of Web-Scale Datasets,” VLDB, 2010. [PDF]
Dremel is Google’s interactive ad hoc query system for analysis of read-only nested data. Unlike MapReduce, Dremel is aimed toward data exploration, monitoring, and debugging, where near real-time performance is of utmost importance. To achieve scalability and performance, Dremel builds upon three key ideas:
- It uses a column-striped storage representation on top of GFS, which enables it to store nested data in a compressed but easily searchable form and to read much less amount of data from secondary storage. Dremel uses Finite State Machines (FSMs) to quickly assemble data from its compact representation. The paper shows that this columnar representation reduces completion times even for regular MapReduce jobs by an order of magnitude.
- It utilizes the serving tree architecture to rewrite queries during work distribution and to use aggregation at multiple levels. This minimizes data movement and speeds up query results. This optimization roughly accounts for another order of magnitude speedup over MapReduce.
- It provides a high-level (limited) SQL-like query language that translates to native execution as opposed to getting translated to a sequence of MapReduce jobs.
Dremel is fast, but I wonder how much faster it can go if it allowed caching of intermediate results that can be used in subsequent queries; this should more impact for data exploration workloads. The paper is very terse (may be due to VLDB page limit), and I found it hard to read even though none of the concepts were that complicated.