Tag Archives: UCB Cloud Computing Course F11

Memory Management in the Cloud

Stanford, “The Case for RAMClouds: Scalable High-Performance Storage Entirely in DRAM,” SIGOPS Operating Systems Review, Vol. 43, No. 4, December 2009, pp. 92-105. [PDF]

AMP Lab, “PACMan: Coordinated Memory Caching for Parallel Jobs,” Secret Draft.

Update: PACMan has been accepted at NSDI’2012. Secret draft won’t remain secret anymore :)


Cloud applications require storage systems that provide low latency and high throughput for large amounts of data.  While traditional disks cannot meet such requirements, given the trend in DRAM price and capacity, it is possible to envision a future where most of the storage needs can be fulfilled by DRAM; RAMCloud is such a system. PACMan, on the other hand, suggests that even today, most of the workloads can be kept into DRAM using better caching mechanisms.


The core idea in RAMCloud is to keep everything in DRAM with disks used only as backups. The biggest challenge is to make sure that the storage system can be recovered quickly upon failure. RAMCloud uses buffered logging. The authors claim that replication is not necessary to achieve high performance, rather replicas are used only for parallel recovery. In steady state, there is a single copy of the data present in DRAM. Recovery is performed using a massively parallel read of data from disks.


PACMan is a caching mechanism and corresponding system for HDFS and similar distributed file systems. The key idea is that current clusters have a large amount of unused memory that can be used to cache frequently-used data blocks, and traditional caching strategies like LRU or LFU do not work well on cluster jobs. The authors propose the concept of all-or-nothing property, i.e., when caching all data blocks for a given job across the cluster should be cached or nothing at all.


RAMCloud is a more general system than PACMan, but clearly, it is more expensive as well. RAMCloud trades off price for speed, but it is likely to be used in many future systems if prices of DRAM and high-speed network equipments keep going down. PACMan, from the high level, may seem to be a more short-term fix for the existing clusters. However, the insight of all-or-nothing is important and will be useful even in the future. Also, PACMan can have a quicker impact because it does not ask for any investment to reap the possible gains.

Confidentiality and Security in the Cloud

Raluca Ada Popa, Catherine M. S. Redfield, Nickolai Zeldovich, Hari Balakrishnan, “CryptDB: Protecting Confidentiality with Encrypted Query Processing,” SOSP, 2011. [PDF]

Thomas Ristenpart, Eran Tromer, Hovav Shacham, Stefan Savage, “Hey, You, Get Off of My Cloud: Exploring Information Leakage in Third-Party Compute Clouds,” CCS, 2009. [PDF]


With the increase in popularity of cloud computing as a scalable, elastic, and cost-effective infrastructure solution, concerns about the security, privacy, and confidentiality of user data hosted on public clouds are also increasing. Curious administrators might breach trust, malicious entities can try to restrict/deny services, and adversaries might gain access to confidential data.


CryptDB stores user data in an SQL-aware encrypted form with multi-layered encryption onions. Each layer provides different levels of security and restricts execution of  SQL queries to limited sets. Depending on user queries, layers are dynamically ripped off one after another. Eventually, the database reaches a steady-state that strikes a balance between confidentiality of data and usability of the database. Encryption keys are chained together with user passwords to survive security breaches of both database and application servers.

Hey You, Get Off of My Cloud!

This paper discusses the risks of shared public clouds by demonstrating how an attacker can find the network topology of a cloud provider (e.g., Amazon EC2) to get a VM that co-resides with a victim VM and extract information from the victim. The goal is more to show that these risks existed in 2009 (it is questionable how big of a risk they are, and how hard it is avoid them), than how to address them.


CryptDB is undoubtedly the more practical of the two papers with a usable solution to a real problem. However, it has its weaknesses: CryptDB should require N times more space for N layers of the onion, creation/update of new onions with the change of user passwords and corresponding encryption key chains will be expensive, and for databases with mostly long-running and persistent connections, information of most users will be exposed when database and application servers are compromised.

Graph-parallel frameworks

Google, “Pregel: A System for Large-Scale Graph Processing,” SIGMOD, 2010. [PDF]

Carnegie Mellon, “GraphLab: A New Framework for Parallel Machine Learning,” arXiv:1006.4990, 2010. [PDF]


Data-parallel frameworks such as MapReduce and Dryad are good at performing embarrassingly parallel jobs. These frameworks are not ideal for iterative jobs and for jobs where data-dependencies across stages are sparse (e.g., in MapReduce, each reducer is likely to depend on each mapper). However, there are many problems, specially in machine learning, that can be intuitively expressed using graphs with sparse computational dependencies, require multiple iterations to converge, and have variable convergence rate for different parameters. Pregel and GraphLab are two frameworks optimized for this type of graph-based problems.

A typical graph-parallel problem is expressed using graphs with vertices and edges, where each vertex and edge have associated data with them. In every iteration, vertex and edge data are updated and a bunch messages are exchanged between neighboring entities. This update function is typically the same for every vertex, and it is written by the user. There may or may not be a synchronization step at the end of every iteration. In a distributed setting, the graph is cut and divided across multiple nodes and updates from a collection of vertices in one node is communicated to another using message passing.

Pregel vs GraphLab

The key difference between Pregel and GraphLab is that Pregel has a barrier at the end of every iteration, whereas GraphLab is completely asynchronous. Asynchrony in GraphLab allows it to prioritize more complex vertices over others, but it also calls for consistency models to maintain sanity of results. GraphLab proposes three consistency models: full, edge, and vertex consistency, to allow different levels of parallelism. Another difference is that Pregel allows dynamic modifications to the graph structure, whereas GraphLab does not.


Pregel and GraphLab sit at two ends of the “power of framework” vs “ease of use” tradeoff space. Allowing asynchrony makes GraphLab more general and powerful than Pregel, but it is more complex and requires users to understand which consistency model is suitable for them. Pregel is simpler (common for most frameworks in Google’s arsenal), but still capable of handling a wide variety of problems. Given its origin at Google, open-source clones like Giraph, Pregel’s model is more likely to succeed in near future.

Datacenter transport layer protocols

Stanford and Microsoft, “DCTCP: Efficient Packet Transport for the Commoditized Data Center,” SIGCOMM, 2010. [PDF]

Raiciu et al, “Improving Datacenter Performance and Robustness with Multipath TCP,” SIGCOMM, 2011. [PDF]

MSR Asia, ICTCP: Incast Congestion Control for TCP in Data Center Networks,” CoNEXT, 2010. [PDF]


Datacenters pose a different set of challenges than the Internet, such as microsecond RTTs, synchronized workloads that cause incast, and decreased level of multiplexing. TCP, as we know it with milliseconds feedback loops and dependence on packet drops for congestion, works mostly alright but leaves one wondering whether we could design a better transport protocol. DCTCP, MPTCP, and ICTCP are three recent proposals that try to address this question. The proliferation of such proposals stems from the unique opportunities that only a datacenter network can provide, e.g., complete knowledge of the topology and workloads, single administrative domain that allows enforcing changes to the network elements, and uniform network behavior almost all over the network. Each of the three protocols summarized below exploits one or more datacenter-specific network characteristics.


DCTCP aims for smaller occupancy in switch buffers through explicit rate throttling at end hosts in order to ensure low latency for short flows and high throughput for long flows. Switches set ECN bits to signal the senders to cut back their window sizes, while the senders estimate the level of congestion and reduce their window sizes proportionally (as opposed to multiplicative decrease in TCP).


ICTCP is a specialized TCP variation to solve the incast problem in the last hop. The key idea is to adjust the receiving window of each connection by estimating the available bandwidth.


MPTCP is kind of orthogonal to DCTCP and ICTCP in that it tries to address the problem of underutilization of bisection bandwidth and relevant unfairness when flows follow only a single path. Their solution is, unsurprisingly, using multiple paths. Transparent to the applications, MPTCP divides each source-destination flow into several sub-flows and employs a congestion control mechanism that pushes toward using up as much available bandwidth as possible.


Out of the three, I found DCTCP and MPTCP more interesting because of the breadth of problems they try to solve. ICTCP is geared toward solving the incast problem only; however, one thing I found interesting about it is that it takes a flow control approach to the problem instead of the more common congestion control approach. In general, all three suffer from the this-may-not-be-not-real syndrome: DCTCP and ICTCP are possibly too biased by Microsoft workloads, while MCTCP has no evaluation on real workloads. It will be nice to see more general evaluation of all three. Also, my personal opinion on the order of long-run impacts of these papers is MPTCP>DCTCP>>ICTCP.

Cloudy operating systems

MIT, An Operating System for Multicore and Clouds: Mechanisms and Implementation,” SOCC, 2010. [PDF]

Barret Rhoden, Kevin Klues, David (Yu) Zhu, Eric Brewer, “Improving Per-Node Efficiency in the Datacenter with New OS Abstractions,” SOCC, 2011. [PDF]


Factored Operating System

The Factored Operating System (FOS) proposes an OS architecture where each core runs individual microkernels that communicate using message passing. The OS has function-specific services, and each one is implemented as a parallel, distributed service. The authors argue that such an API is feasible enough to simultaneously address many-core and cloud OS requirements. Scheduling in FOS moves from time multiplexing to space multiplexing, where individual functions/programs will occupy their own cores for longer periods of time than they would have in time multiplexing.


Akaros has a more manageable goal of handling individual many-core machines instead of tackling distributed coordination. It aims for achieving high efficiency by giving bare-metal feel to applications running on top:  applications have more information and greater control over resources they are using. Akaros introduces the concept of a many-core process (MCP), where all cores of an MCP are gang scheduled. MCPs are aware of cores they are running on and are never preempted without warning. Using two-level scheduling, Akaros only assign cores using space multiplexing, and MCPs are allowed to manage their own threads using time multiplexing. Akaros also provides asynchronous IO and MCPs have more information about the devices they interact with.


The need for an operating system for the datacenter is real. It is not clear which is the best OS abstraction though. To me FOS seems a long stretch, whereas Akaros is probably doable. However, neither has any reasonable performance benchmark so far.

Multi-framework resource managers for datacenters

AMPLab, “Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center,” NSDI, 2011. [PDF]

Apache Software Foundation, “Hadoop NextGen”, 2011. [LINK]


Traditional cluster resource schedulers fall into two broad categories: some do fine-grained management of resources for individual frameworks (e.g., in Hadoop), but this requires multiple frameworks to run on multiple isolated clusters. Some others perform course-grained resource management across multiple frameworks at the cost of underutilization (e.g., MPI schedulers). However, fine-grained sharing of cluster resources across multiple, possibly diverse, data- and compute-intensive frameworks is important for several reasons: better utilization and multiplexing of resources, ease of cluster management, and faster innovation without worrying about underlying physical resources. Mesos and Hadoop NextGen aim to achieve just that.

Without subscribing to either approach’s terminology, a typical resource manager has a central coordinator that keeps track of all the resources in the cluster by periodically communicating with its daemons in individual machines. Instead of interfacing to actual physical resources, frameworks now use a library provided by the resource manager to interact with the coordinator. Once a framework expresses its requirements and later accepts some, it’s on its own to schedule those resources among its workers.

Mesos vs Hadoop NextGen

The primary and possibly the only major difference between Mesos (that came earlier) and Hadoop NextGen (that spun out from the basic Hadoop framework) is the way the coordinator and frameworks interact while expressing and accepting (or rejecting) resources. Mesos provides resource offers to individual frameworks that can then accept or reject them. Consequently, resource allocation becomes a distributed problem, and Mesos itself remains minimal. Hadoop NextGen, on the contrary, requires each framework to explicitly express their requirements and then runs a centralized algorithm to allocate resources.


Both resource managers are pretty much the same. May be I am biased as an AMPLab member, but it seems that Hadoop NextGen design was highly influenced by Mesos. In either case, the central coordinator can become the bottleneck. But with increasing cluster size, Mesos’ approach is likely to scale more than that of Hadoop NextGen due to Mesos’ distributed approach. Given Hadoop’s popularity, however, Hadoop NextGen is likely to become more widespread than Mesos.


Distributed in-memory datasets

AMPLab, “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing,” UCB/EECS-2011-82, 2011. [PDF]

Russell Power, Jinyang Li, “Piccolo: Building Fast, Distributed Programs with Partitioned Tables,” OSDI, 2010. [PDF]


MapReduce and similar frameworks, while widely applicable, are limited to directed acyclic data flow models, do not expose global states, and generally slow due to the lack of support for in-memory computations. MPI, while extremely powerful, is hard to use for non-experts. An ideal solution would be a compromise between the two approaches. Spark and Piccolo try to approximate that ideal within the MapReduce-to-MPI spectrum using in-memory data abstractions.


Piccolo provides a distributed key-value store-like abstraction, where applications/tasks can read from and write to a shared storage. Users write partition functions to divide the data across multiple machines, control functions to decide the workflow, kernel functions for performing distributed operations on mutable states, and conflict resolution functions to resolve write-write conflicts. Piccolo uses Chandi-Lamport snapshot algorithm for periodic checkpointing and rolls back all tasks of a failed job from the last checkpoint when required.


Spark is a distributed programming model based on a distributed in-memory data abstraction called Resilient Distributed Datasets (RDDs). RDDs are immutable, support coarse-grained transformations, and keep track of which transformations have been applied to them so far using lineages that can be used for RDD reconstruction. As a result, checkpointing requirements/overheads are low in Spark.

Spark vs Piccolo

There are two key differences between Spark and Piccolo.

  1. RDDs only support coarse-grained writes (transformations) as opposed to finer-grained writes supported by distributed tables used by Piccolo. This allows efficient storage of lineage information, which reduces checkpointing overhead and fast fault recovery. However, this makes Spark unsuitable for applications that depend on fine-grained updates.
  2. RDDs are immutable, which enables straggler mitigation by speculative execution in Spark.


Piccolo is closer to MPI, while Spark is closer to MapReduce on the MapReduce-to-MPI spectrum. The key tradeoff in both cases, however, is between framework usability vs its applicability/power (framework complexity follows power). Both frameworks are much faster than Hadoop (but remember that Hadoop is not the best implementation of MapReduce), a large fraction of which comes from the use of memory. May be I am biased as a member of the Spark project, but Spark should be good enough for most applications unless they absolutely require fine-grained updates.

Cloud databases

MIT, “Relational Cloud: A Database-as-a-Service for the Cloud,” CIDR, 2011. [PDF]

Divyakant Agrawal, Amr El Abbadi, Sudipto Das, Aaron J. Elmore, “Database Scalability, Elasticity, and Autonomy in the Cloud,” DASFAA, 2011. [PDF]

Relational Cloud

The key idea of the Relational Cloud project is to define the concept of transactional Database-as-a-Service (DBaaS), identify the key challenges toward materializing it, and finally to address each one individually (in separate papers). The authors identify workload awareness as the key ingredient in addressing these challenges. Since this is only an overview paper, they do not go into the details, but they do identify three high-level goals:

  1. Efficient Multi-tenancy: This deals with packing databases from different tenants in a single machine while maintaining individual SLAs. The paper suggests that just creating VMs for each database is not the ideal solution due to the lack of proper isolation. Instead they propose using accurate resource models for colocation.
  2. Elastic Scalability: The goal here is to dynamically scale up/down databases based on current load from one machine to more and vice versa. The challenge is partitioning the data while avoiding cross-machine dependency as much as possible.
  3. Database Privacy: The last key challenge, according to the authors, is adjustable security. This will allow querying encrypted data without decryption on the cloud. Btw, this IS cool!

Database Scalability, Elasticity, and Autonomy in the Cloud

Instead of rooting for one particular solution, this paper more or less surveys the design space of DBaaS in terms of scalability, elasticity, and autonomy. The authors believe that there can be two different approaches to simultaneously achieve scalability and atomicity in databases: one is to add some level of atomicity to existing key-value stores (Data Fusion) and the other is to scale traditional transactional databases by intelligent partitioning (Data Fission). For elasticity, they argue for live migration of instances during runtime. However, it is not sure how this technique will affect SLAs. The paper also self-cites some of the systems the authors have built at different points of the design space.


I like the proposed techniques presented in both papers, specially the Relational Cloud ones, from a technical perspective. However, I’m not sure whether DBaaS will be a successful business model. Yes, Amazon and Microsoft have DBaaS or similar products, but who are using them? If they are big enough to care about performance and security isolation, they might not be comfortable in sharing DBs with random entities. If they are not, may be they can do without all the complicated solutions anyway.

Don’t Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS

Wyatt Lloyd, Michael J. Freedman, Michael Kaminsky, and David G. Andersen, “Don’t Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS,” SOSP, 2011. [PDF]


This paper introduces a new consistency model, causal+, that extends the causal consistency model and lies between sequential and causal consistency models. The authors claim that causal+ is the strongest consistency model achievable for ALPS systems (i.e., systems that require availability, low latency, partition-tolerance, and high scalability), but they do not prove why something stronger cannot be achieved.

The claimed novel part of the proposal that made the authors create a new consistency model instead of using causal itself is convergent conflict handling. Essentially, it requires all conflicting puts to be handled in the same manner at all replicas using the same handler function, h, which must be associative and commutative. The authors propose COPS (Clusters of Order-Preserving Servers) that executes all puts/gets in the underlying key-value store in a linearizable fashion within the same datacenter and replicates across datacenters in causal+ consistent order. In order to allow get transactions without locking in a non-blocking manner that will enable consistency across multiple keys, they also propose COPS-GT. Understandably, COPS-GT is more expensive; however, it is not more expensive than comparable systems.


The convergent conflict handling mechanism is related to the way Dynamo allows handling conflicts using user-defined functions. The cross-datacenter replication mechanism is similar to that in PNUTS. While COPS-GT provides transaction guarantees across multiple keys without much application logic, it is quite complicated. May be, it’s better to leave this to application developers as PNUTS argued from real-world experience (i.e., online services do not require this functionality often enough to make it a part of the system).

Btw, I liked the summary of consistency models in Section 3.2, where the authors put causal+ in a partial order of different consistency models.

PNUTS: Yahoo!’s Hosted Data Serving Platform

Yahoo! Research, “PNUTS: Yahoo!’s Hosted Data Serving Platform,” PVLDB, 2008. [PDF]


PNUTS is a scalable, highly available, and geographically distributed (but low latency) data store used by most Yahoo! online properties. To achieve both availability and partition tolerance, it uses a novel notion of consistency called per-record timeline consistency; under this model, all replicas of a given record apply all updates to the record in the same order. However, it is applicable to only a single record, and hence, it is not suitable for transactions involving multiple records without considerable application logic. PNUTS also allows its users to switch to eventual consistency, which is easier to maintain and often acceptable for many online services.

The PNUTS system is divided into regions, where each region contains a full complement of system components and a complete copy of each table. Data tables can be (horizontally) ordered or hash partitioned into groups of records called tablets. Each tablet is stored in a single server within a region. To read or write an update, the router first determines which tablet contains the record, and which server hosts that tablet. Routers contain only a cached copy of the mapping, which is owned by the tablet controller. Routers periodically poll the tablet controller to get any changes to the mapping.

Underneath it all, Yahoo! Message Broker or YMB is a pub/sub system that simultaneously acts as a write-ahead log to allow committing of updates and asynchronously replicates committed data to subscribers in other regions. A message is not purged from the YMB log until PNUTS has verified that the update is applied to all replicas of the database. YMB provides partial ordering of published messages, meaning that messages published to a particular YMB cluster will be delivered to all subscribers in the order they were published; however, messages published to different YMB clusters may be delivered in any order. In order to provide record-level timeline consistency, PNUTS implements record-level mastering: each record has a master copy in some cluster from where message propagation must start. Different records of the same table can have masters in different regions.


PNUTS tradeoffs power to achieve simplicity in design by offloading more complex tasks to application designers. Providing support for ordered tables is one of the key contributions of PNUTS, which allows it to implement efficient range operations. The notion of per-record timeline consistency is interesting, but it cannot support real transactions (the authors argue, however, that transactions involving multiple records are not that common). PNUTS also does not support complex queries.