Category Archives: Reviews

PNUTS: Yahoo!’s Hosted Data Serving Platform

Yahoo! Research, “PNUTS: Yahoo!’s Hosted Data Serving Platform,” PVLDB, 2008. [PDF]

Summary

PNUTS is a scalable, highly available, and geographically distributed (but low latency) data store used by most Yahoo! online properties. To achieve both availability and partition tolerance, it uses a novel notion of consistency called per-record timeline consistency; under this model, all replicas of a given record apply all updates to the record in the same order. However, it is applicable to only a single record, and hence, it is not suitable for transactions involving multiple records without considerable application logic. PNUTS also allows its users to switch to eventual consistency, which is easier to maintain and often acceptable for many online services.

The PNUTS system is divided into regions, where each region contains a full complement of system components and a complete copy of each table. Data tables can be (horizontally) ordered or hash partitioned into groups of records called tablets. Each tablet is stored in a single server within a region. To read or write an update, the router first determines which tablet contains the record, and which server hosts that tablet. Routers contain only a cached copy of the mapping, which is owned by the tablet controller. Routers periodically poll the tablet controller to get any changes to the mapping.

Underneath it all, Yahoo! Message Broker or YMB is a pub/sub system that simultaneously acts as a write-ahead log to allow committing of updates and asynchronously replicates committed data to subscribers in other regions. A message is not purged from the YMB log until PNUTS has verified that the update is applied to all replicas of the database. YMB provides partial ordering of published messages, meaning that messages published to a particular YMB cluster will be delivered to all subscribers in the order they were published; however, messages published to different YMB clusters may be delivered in any order. In order to provide record-level timeline consistency, PNUTS implements record-level mastering: each record has a master copy in some cluster from where message propagation must start. Different records of the same table can have masters in different regions.

Comments

PNUTS tradeoffs power to achieve simplicity in design by offloading more complex tasks to application designers. Providing support for ordered tables is one of the key contributions of PNUTS, which allows it to implement efficient range operations. The notion of per-record timeline consistency is interesting, but it cannot support real transactions (the authors argue, however, that transactions involving multiple records are not that common). PNUTS also does not support complex queries.

Data-parallel pipelines using high-level languages

Microsoft, “DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language,” OSDI, 2008. [PDF]

Google, “FlumeJava: Easy, Efficient Data-Parallel Pipelines,” PLDI, 2010. [LINK]

Background

Data-parallel computing systems expose high-level abstractions to the users to reason about distributed computations, while handling low-level tasks of scheduling and automated fault-tolerance without any user input. At one end of the spectrum, MapReduce exposes a limited interface consisting of only a map, a shuffle, and a reduce step, which makes it harder to write workflows or data-parallel pipelines. Workarounds include scripting languages to stich together a sequence of MapReduce jobs. On the other end of the spectrum lies Dryad; it allows creating arbitrary DAGs to represent long and complex workflows, without proper language support.

The key missing part from all these frameworks is an even higher-level abstraction that will

  1. be expressive enough to create arbitrary workflows/pipelines,
  2. remove worry about data representations and underlying implementation details,
  3. permit dynamic optimization across multiple MapReduce jobs, and
  4. be accessible to developers by providing integration with familiar sequential languages.

DryadLINQ and FlumeJava

To address these challenges in Dryad and MapReduce, Microsoft and Google introduced DryadLINQ and FlumeJava, respectively. Both are very similar in design and operation with differences in their implementations. DryadLINQ is language integrated into the .NET family of languages and LINQ, whereas FlumeJava is a library written in Java. In both cases, developers write pipelines that look either like normal C# or Java program, but they internally work on distributed datasets using a set of core primitives (including map, shuffle, combine, and reduce). Given a workflow, the compilers create a corresponding DAG representation, optimizes that DAG to compact it to a set of kernels, and run those kernels efficiently on the underlying MapReduce or Dryad frameworks.

Comments

Higher-level abstractions are going to become more and more common as we move toward the future of data-parallel computing. They increase productivity, efficiency, and usability, while decreasing code length and possibilities of bugs. Success of FlumeJava has given rise to Crunch, which is its parallel in the open-source world.

Also, these frameworks presented a rare surprise in that Google apparently came after Microsoft into the business!

Dremel: Interactive Analysis of Web-Scale Datasets

Google, “Dremel: Interactive Analysis of Web-Scale Datasets,” VLDB, 2010. [PDF]

Summary

Dremel is Google’s interactive  ad hoc query system for analysis of read-only nested data. Unlike MapReduce, Dremel is aimed toward data exploration, monitoring, and debugging, where near real-time performance is of utmost importance. To achieve scalability and performance, Dremel builds upon three key ideas:

  1. It uses a column-striped storage representation on top of GFS, which enables it to store nested data in a compressed but easily searchable form and to read much less amount of data from secondary storage. Dremel uses Finite State Machines (FSMs) to quickly assemble data from its compact representation. The paper shows that this columnar representation reduces completion times even for regular MapReduce jobs by an order of magnitude.
  2. It utilizes the serving tree architecture to rewrite queries during work distribution and to use aggregation at multiple levels. This minimizes data movement and speeds up query results. This optimization roughly accounts for another order of magnitude speedup over MapReduce.
  3. It provides a high-level (limited) SQL-like query language that translates to native execution as opposed to getting translated to a sequence of MapReduce jobs.

Comments

Dremel is fast, but I wonder how much faster it can go if it allowed caching of intermediate results that can be used in subsequent queries; this should more impact for data exploration workloads. The paper is very terse (may be due to VLDB page limit), and I found it hard to read even though none of the concepts were that complicated.

Dynamo: Amazon’s Highly Available Key-value Store

Amazon, “Dynamo: Amazon’s Highly Available Key-value Store,” SOSP, 2007. [PDF]

Summary

Dynamo is a highly available (99.9th percentile) key-value storage mechanism that sacrifices traditional consistency models for eventual consistency to achieve availability. Dynamo works with a simple query model, where read/write (get() and put()) operations are performed on data items uniquely identified by their keys. There is no relational schema because there are no operations that span multiple data items. Stored objects in Dynamo are relatively small (less than 1MB).

Dynamo defers conflict resolutions due to incompatible versions of data during read operations and allows write operations on any data all the time. It moves conflict resolutions to applications, since applications have more semantic information about the data they are using and can more effectively resolve conflicts.

Dynamo uses distributed hashing to partition data on a ring structure and replicates data among N-1 clockwise successor nodes on that ring. However, unlike traditional DHTs, it uses zero-hop routing, i.e., it does not route through other nodes on the ring, in order to reduce latency. For any updates, data are versioned using vector clocks and conflicts as well as temporary failures are resolved using sloppy quorums and hinted handoffs. To handle permanent failures, Dynamo uses Merkle-tree based anty-entropy protocol to keep replicas in sync. Failures are detected and taken care off in local context using gossip protocols. Since it is used in a controlled environment, there are no strong security requirements.

Comments

All in all, Dynamo creates a great mix of well-known protocols and techniques to achieve high availability. One very interesting tradeoff here is for not missing any writes, even writing twice, at the expense of being in possibly inconsistent states. It is also a great example how far business goals can define a system design; to paraphrase the authors: “additions to shopping carts are never lost, however, deleted items can resurface!”

Versioning capability of Dynamo is limited by the size of the vector clocks it can use. The authors do mention a truncating algorithm, but they avoid the problem of what will happen if Dynamo requires larger clock. Unfortunately, if it fails to support the required length of vector lengths, the whole eventual consistency setup will break down.

Dynamo is the poster child of eventual consistency and noSQL admirers. Its influence on the open-source community is undeniable: Cassandra and Voldemort are two prominent examples of Dynamo-inspired systems.

Bigtable: A Distributed Storage System for Structured Data

Google, “Bigtable: A Distributed Storage System for Structured Data,” OSDI, 2006. [PDF]

Summary

Bigtable is a large-scale (petabytes of data across thousands of machines) distributed storage system for managing structured data. It is built on top of several existing Google technology (e.g., GFS, Chubby, and Sawzal) and used by many of Google’s online services. The authors state flexibility and high performance as the two primary goals of Bigtable while supporting applications with diverse requirements (e.g., small to large data sizes, backend to real time latency).

Essentially, Bigtable is a “sparse, distributed, persistent multi-dimensional sorted map” that indexes each <row, column, timestamp> tuple to an array of bytes. Data in Bigtable are maintained in tables that are partitioned into row ranges called tablets. Tablets are the units of data distribution and load balancing in Bigtable, and each tablet server manages some number of tablets. Each table can have hundreds of column families, and each column family can have an unbounded number of columns. Different versions of data are sorted using timestamp in each cell. Bigtable supports single-row transactions, which can be used to perform atomic read-modify-write sequences on data stored under a single row key, it does not support general transactions unlike a standard RDBMS.

GFS is used underneath Bigtable to store the actual data in the Google SSTable file format. An SSTable provides a persistent, ordered immutable map from keys to values and allows iteration over all key/value pairs. Writes in Bigtable go to a redo log in GFS, and the recent writes are cached in a memtable.  Reads are served from memtables first, and then from the SStables in GFS. Bigtable uses Chubby to manage active server, to discover tablet servers, to store Bigtable metadata, and above all, as the root of a three-level tablet location hierarchy. Caching of tablet locations at client-side ensures that finding a tablet server does not take up to six RTTs.

While the basic design is not too complicated, Bigtable includes a large number of refinements/optimizations, including column family locality groups, compression, caching, bloom filters, and commit logs among others, to achieve high performance.

Comments

One of the key tradeoffs made by the Bigtable designers was going for a general design by leaving many performance decisions to its users. Bigtable positioned itself between a database and a distributed key/value store that was a first step in adopting database concepts in general key/value stores. The authors promised further improvements (e.g., secondary indices) to their design, but there was never a followup paper.

Judging by the numbers, Bigtable was highly influential inside Google when this paper was published. I believe it is general enough to survive until today as back-end for many of their newer services. In the open-source world, HBase on HDFS would be its counterpart. Given Facebook’s wide adoption of HBase, we can safely say Bigtable have had enough influence on the world outside Google as well.

SCADS: Scale-Independent Storage for Social Computing Applications

Michael Armbrust, Armando Fox, David A. Patterson, Nick Lanham, Beth Trushkowsky, Jesse Trutna, Haruki Oh, “SCADS: Scale-Independent Storage for Social Computing Applications,” CIDR, 2009. [PDF]

Summary

SCADS (Scalable Consistency Adjustable Data Storage) is a proposal for a collection of components leveraging database, control theory, and machine learning techniques to achieve data scale independence for rapidly growing (or shrinking) Web 2.0 services. It has three key components:

  1. A performance-insightful query language (PIQL) that provides strict scalability guarantees and predictable performance;
  2. A declarative way for developers to explicitly define there performance-consistency tradeoff requirements; and
  3. Machine learning models to add and remove capacity to meet SLA requirements.

Critique

I believe that restricting queries to have bounded performance and allowing developers to explicitly dictate/specify their requirements/deadlines are the key contributions of this proposal. There are several other interesting concepts embedded in different parts of the proposal; it is not clear, however, how influential the whole architecture will be.

The key tradeoff in SCADS is whatever-is-required for predictable performance. The authors are willing to restrict queries, use CPU/disks to build additional indices, consider developer inputs and give feedback to them, and do several other things to ensure predictability.

Since this is a position paper, the authors only provide high-level ideas without any concrete solution. Many of the components proposed in this paper have so far been developed, and there is a good chance that the overall architecture will see the light of day at some point, in some form.

High-level platforms on top of Hadoop

Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins, “Pig Latin: A Not-So-Foreign Language for Data Processing,” SIGMOD, 2008. [PDF]

Facebook Data Team, “Hive: Data Warehousing and Analytics on Hadoop,” . [LINK]

Summary

Pig and Hive are higher level programming interfaces to Hadoop with corresponding data management tools and related optimizations developed by Yahoo! and Facebook, respectively. Pig looks more like a scripting language to create workflows composed of multiple MapReduce jobs, and its authors claim it to be in a sweet spot between SQL and MapReduce. Hive is closer to SQL in look-and-feel, and it logically arranges data in tables similar to RDBMS using schemas.

Pig vs Hive

In both cases, the end goal is to enable complex workflows that require multiple MapReduce jobs without having to change the underlying execution engine, Hadoop, and the common storage system, HDFS. However, Pig seems to have a few more tricks than Hive in terms of query optimization and debugging. A nice side-by-side comparison can be found here (probably a little outdated).

Critique

Pig and Hive are open-source answers to Microsoft’s Dryad, DryadLINQ, and SCOPE, that allow creating extended workflows. They both want to keep the underlying systems, Hadoop and HDFS, untouched, but each makes different tradeoffs/assumptions in designing the system. Pig assumes its jobs to be ad hoc and does not stress on building indices and play other DB tricks to make repetitive jobs faster. Hive, on the other hand, assumes a large number of small queries on the same data, and hence, it builds indices and enforces schema.

Both the systems are and continue to be influential due to their large usage at Yahoo! and Facebook. There are still a lot of opportunities for optimizations though, which can and are inspiring academic research. The question is whether we run the risk of reinventing everything that DB people had invented a while ago; my guess is (unfortunately) yes.

Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks

Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, Dennis Fetterly, “Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks,” EuroSys, 2007. [PDF]

Summary

Dryad is Microsoft’s answer to the MapReduce paradigm, albeit at a (slightly) lower level with greater flexibility. Like MapReduce, Dryad allows developers to think about what to do with the data, and Dryad itself takes care of distribution, fault-tolerance, stragglers etc. Unlike MapReduce, Dryad enables creating extensive data flow models using DAGs or directed acyclic graphs. Dryad also adds increased flexibility for communication between computation nodes in a DAG via disk, TCP pipes, and shared memory queues, as opposed to only disk-based communication promoted by MapReduce. This could possibly allow fully in-memory data flow models for faster data mining and iterative jobs.

Comments

By giving more power to the developers, Dryad sacrificed its simplicity. This is the exact opposite of the tradeoff made in the MapReduce design. Dryad is great as a substrate for higher-level systems that are more usable. This became apparent when Microsoft later built DryadLINQ and SCOPE on top of Dryad, that are easy to use, powerful, but takes away unnecessary flexibility from the developers. The influence of Dryad is big, at least for Microsoft. As far as we know, they heavily use DryadLINQ and SCOPE for most of their data-intensive workloads.

The paper itself has some obvious flaws. One of them would be its evaluation. Comparison against MS SQL server is nothing to be too excited about. It would have been nice if they compared it against some implementation of MapReduce (Hadoop was not publicly available when Dryad was being built, but they could easily use Dryad to imitate MapReduce). However, Dryad is expected to win over MapReduce for multi-stage workload. I found this paper not easy to read, nor to follow, often boring. Nevertheless, it is an interesting piece of work and the obvious next step for MapReduce.

MapReduce: Simplified Data Processing on Large Clusters

Jeffrey Dean, Sanjay Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” OSDI, 2004. [PDF]

Summary

MapReduce is a programming model and associated implementation for processing and generating large data sets in a parallel, fault-tolerant, distributed, and load-balanced manner. There are two main functions (both user provided) in this programming model. The map function takes an input pair and produces a set of intermediate key/value pairs. The reduce function accepts an intermediate key and a set of values for that key and process them to output some user-defined result. The biggest gain in MapReduce is that the user is only concerned about this two functions; everything else, including distribution of tasks, scheduling, fault-tolerance, and scalability,  is taken care of by the framework.

MapReduce is implemented on large clusters of commodity hardware. Map invocations (mappers) automatically partition the input data to M splits which are processed in parallel in many different machines. Reduce invocations (reducers) are distributed by partitioning the intermediate data into R pieces (set by user) using a partitioning function. There is a single master, that coordinates everything. There can be optional combiner functions after mappers to decrease the number of intermediate key/value data transported to the reducers to decrease network usage.

Mappers read input from and and reducers output result back to a distributed file system (e.g., GFS). Intermediate data produced by the mappers are buffered in memory and also saved in local disks and pulled by reducers using RPC calls. Failure in MapReduce is handled by rescheduling and restarting of tasks. MapReduce tries to increase data locality to minimize data transfers over the network.

Comments

MapReduce is one of the most influential papers as well as computing paradigms of this century. It has fundamentally changed how people think about processing large data and given rise to many systems (including Hadoop, which is the open-source implementation of MapReduce). While MapReduce is not the best at everything, it is great in doing what it does: embarrassingly-parallel data-intensive computing.

The fundamental tradeoff here is between simplicity, generality, and efficiency of a high level programming abstraction versus fine-grained control. It is surprising that even without fine-grained like MPI, how many thing MapReduce can perform.

Megastore: Providing Scalable, Highly Available Storage for Interactive Services

Google, “Megastore: Providing Scalable, Highly Available Storage for Interactive Services,” CIDR, 2011. [PDF]

Summary

Megastore is a highly available, scalable storage system built on top of Google’s BigTable system for scalable storage and Chubby for locks and configuration data. It supports full ACID semantics and specially suited for interactive services, even though BigTable itself does not provide ACID guarantees. Megastore supports fine-grained partitioning of data based on the concept of entity groups, provides strong ACID properties within each entity group but not necessarily across them, and performs synchronous replication of partitions across multiple data centers using Paxos underneath for seamless failover.

Megastore has been designed keeping its administrators, developers of services based on it, and end users of those services in mind. It tries to ensure low latency for end users, performance, scalability, and ease-of-use for developers, and ease-of-maintenance for the admins. However, as the authors point out, each application must carefully choose its partitioning into entity groups. Megastore involves some interesting tweaks for ensuring a fast Paxos implementation: one WAN RTT on writes and zero WAN RTT for reads on average (i.e., reads are local). It replicates transaction log entries across datacenters on each write.

Comments

While Megastore has been successful in achieving many of goals, it does not support any query language (i.e., applications must implement query plans) and allows limited update rate for individual entity groups. There assumption about high latency writes (due to synchronous replication) is probably acceptable for Google’s workload, but it is not clear how important it is to have failover support at the expense of performance.

One key difference between Megastore and other Google systems (e.g., GFS, BigTable, MapReduce etc.) is that it is closer to the applications it wants to support, more tied to their requirements, and consequently, the authors had to take decisions that have made it less general than other systems. While I believe that Megastore is useful for Google’s business, it is less likely to spawn as much following in the open-source world as GFS and MapReduce have done.