Tag Archives: UCB Cloud Computing Course F11

The Chubby Lock Service for Loosely-Coupled Distributed Systems

Mike Burrows, “The Chubby Lock Service for Loosely-Coupled Distributed Systems,” OSDI, 2006. [PDF]

Summary

Chubby provides coarse-grained locking and reliable small-file storage for loosely-coupled distributed systems running in Google datacenters. The primary use case of Chubby is leader election in Google File System (GFS) and BigTable. However, over time, developers have used Chubby to implement synchronization points in their distributed systems (e.g., in MapReduce) and as a name service (instead of using DNS). At its core Chubby solves the distributed consensus problem using Paxos in an asynchronous environment.

The primary goal of Chubby is to ensure availability and reliability as well as usability (easy for developers to use it) and deployability (minimal code changes to existing systems). Performance (in terms of throughput) and storage capacity were considered either secondary or non-goals. Chubby provides an interface and an API similar to a simple UNIX-like file system. It allows systems to get read/write advisory locks on any directory or file. These locks are coarse-grained, meaning they are held for hours and days instead of seconds and minutes.

Chubby has two main components that communicate via RPC: a server and a client-side library that applications use to utilize Chubby. A Chubby cell typically consists of 5  server replicas and use a distributed consensus protocol to elect a master. Each replica has a simple database to store meta-data; the master has the primary copy, which is replicated to the other servers. Client libraries find master through the DNS. All read requests are directly satisfied by the master. Write requests, however, do not finish until a majority of the server replicas have been updated. Chubby includes event notification mechanisms to notify clients of any changes to their locks, and it uses extensive caching (including negative caching) for performance reasons.

Comments

Chubby has been highly successful inside Google and have also influenced systems like ZooKeeper in the open-source community. From the numbers presented in the paper, Chubby is highly scalable (could handle up to 90,000 connections with provisions for further scaling), available (only 61 outages and out of that 61, 52 were resolved within 30 seconds before applications could notice its absence), and reliable (only 6 data loss incidents: 4 due to software bugs that had been fixed and 2 were operator errors during software bug fixing). This means that Chubby has successfully achieved its goals of providing a available, reliable, and scalable lock service.

One important lesson of this paper is that it is hard to plan how and what developers might use a large, multifunctional system for. It is also hard to force developers to use a service in a particular way. Instead, the designers must update the service itself to prevent accidental harmful activities and adapt the system based on developers’ needs and feedback.

Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services

Seth Gilbert, Nancy Lynch, “Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services,” ACM SIGACT News, Volume 33 Issue 2 (2002), pg. 51-59. [PDF]

Summary

In this PODC 2000 keynote speech, Eric Brewer claimed that it is impossible for a distributed system to provide the following three guarantees simultaneously:

  • Consistency,
  • Availability, and
  • Partition-tolerance.

Even though all three are desirable properties, the claim is that one of the three has to be weakened or cannot be satisfied at all.

In this paper, Gilbert and Lynch formalize the problem and subsequently prove it for asynchronous networks. They also show the same for partially synchronous networks by using the concepts of timeouts, and local clocks.

Comments

The CAP theorem, by itself, is somewhat intuitive. Consequently, the proofs are very straightforward for the asynchronous case. It becomes slightly more complicated for the (partially) synchronous model though.

While this theorem has been extremely influential in reasoning about the design decisions made by many system developers, I think it can easily be abused to quickly give up on something. For example, getting 2 out of 3 do not mean that some shade of the 3rd cannot be achieved at all. For example, there exist weaker consistency models like causal consistency that can achieve availability and partition-tolerance.

Cluster-Based Scalable Network Services

Armando Fox, Steven D. Gribble, Yatin Chawathe, Eric A. Brewer, Paul Gauthier, “Cluster-Based Scalable Network Services,” Sixteenth ACM Symposium on Operating Systems Principles (SOSP), (October, 1997). [PDF]

Summary

This paper identifies scalability (elasticity), availability, and cost effectiveness as the three fundamental requirements of cluster computing. It proposes a layered architecture that separates high-level policies from actual underlying mechanisms using a service programming model consisting of composable workers that can perform Transformation, Aggregation, Caching, and Customization (TACC) to build scalable (online) network services. In addition, the authors propose BASE (Basically Available, Soft State, and Eventual Consistency), a weaker-than-traditional-ACID semantics that trades consistency for availability and relies on soft states for robustness. Under the light of these concepts, the authors discuss the designs of two different online services (TranSend and HotBot) in terms of load balancing, fault management, caching, and evaluate them. Both the services have an ACID component, but primarily manipulate BASE data.

Comments

This work is dated pre-CAP theorem, but it points to some of the observations that I think eventually led to the CAP theorem (e.g., the trade-off between availability and consistency was later extended to include partition tolerance). The focus on cost-effectiveness was a very important observation as well, as it led to the later revolution in datacenters made of commodity components. While user-facing services and relevant content types have changed over time, I think the authors got the basics right and predicted well how modern cluster computing might look like.

The Datacenter Needs an Operating System

Matei Zaharia, Benjamin Hindman, Andy Konwinski, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, Ion Stoica, “The Datacenter Needs an Operating System,” USENIX HotCloud, (June, 2011). [PDF]

Summary

In recent years, many have pondered whether the datacenter is the new computer. This paper answers that question somewhat positively and goes one step forward by proposing operating system (OS) primitives for this new computer. Essentially, the authors want to have a common software layer to manage resources and to provide shared services for the datacenter as a whole. They expect four key primitives from this OS-like abstraction:

  • Resource sharing across diverse applications (e.g., MapReduce, Dryad, GFS, Hadoop) and users.
  • Data sharing between programs written by these users.
  • Programming abstractions to address the datacenter as a whole.
  • Debugging facilities throughout the software stack across the datacenter.

To this end, the authors survey a collection of existing work (many from the AMPLab) that partly address problems across all four dimensions. They put forth a vision for an OS that will allow users to build Hadoop-like systems in a week, share data and the datacenter resourcers across different programming models and users, and debug such new systems as a whole instead of logging into individual machines.

Critique

This paper makes a valid point that there is a need for ‘some’ software layer to manage and schedule resources in a datacenter. It is, however, not clear whether that software layer is an OS in the traditional sense. Specially, it is hard to tell whether such an OS should include programming abstractions and debuggers, or they should come as bundled software with it.

In general, this is a good paper to generate discussions in the community and to mine citations, which it will, but I think there is a need for a lot more back-and-forth discussions to flesh out what is expected of such an OS, what are the bare minimums to be expected to call it an OS, what will the hardware/software layers look like, and how they will interact.

Hardware trends and future datacenter designs

John L. Hennessy, David A. Patterson, “Graphic Processing Units (GPUs),” Chapter 4, Computer Architecture, Fifth Edition: A Quantitative Approach, 2011. [LINK]

Mark D. Hill, Michael R. Marty, “Amdahl’s Law in the Multicore Era,” IEEE Computer, (July, 2008). [PDF]

H. Howie Huang, Shan Li, Alex Szalay, Andreas Terzis, “Performance Modeling and Analysis of Flash-based Storage Devices,” 27th IEEE Symposium on Massive Storage Systems and Technologies (MSST), 2011. [PDF]

Summary

We are in the middle of a massive shift in hardware landscape. All aspects of computing, be it processing, memory, storage, or network, are changing really fast. These set of articles touch upon some of them. Hennessy and Patterson discuss the utility and applicability of SIMD (Single Instruction Multiple Data) architectures that come with modern Graphic Processing Units or GPUs. Hill and Marty extend the Amdahl’s to the era of many-core processors and essentially promote asymmetric multicore chips over symmetric ones with a preference for a more dynamic design. Lastly, Huang et al. make a case for Flash-based storage devices (read SSDs) by comparing their performance against rotational HDDs using a black-box model.

Before spitting out my 2 cents, I have to say that none of the articles cover the networking trends. Things are getting faster on the wire too. 10Gbps and 40Gbps links are becoming increasingly common on the higher layers of the datacenter hierarchy, and some facilities are even trying out 100Gbps interconnects. Also, while discussing the many-core/multi-core chips, the question of memory bandwidth to and from those chips has also been ignored.

Possible Impacts on Future Datacenter Designs

CPU

The advance of many-core chips is unavoidable. We can expect to see asymmetric designs in the future that will combine 1 (or more) strong cores with many weak cores. In fact, there is a similarity between asymmetric design mentioned by Hill and Marty and the current setup of CPU and GPU in our desktop machines. Recent calls for putting both of them on the same chip by many vendors point toward a move to asymmetric designs. Cluster schedulers will have to be made aware of the presence of such strong and weak cores as well.

However, GPU-based parallelism requires too much attention from software designers/developers. Unless something fundamentally different happens (in terms of may be programming abstraction), it is likely to stay that way and will be avoided for general cluster software systems as much as possible.

Storage/Memory

Given the weak write/update/erase performance of Flash-based devices along with their cost overhead, it is likely to see one more level in the storage/memory hierarchy. In between in-memory cache and underlying HDDs, a new layer of SSDs will be introduced for read-heavy workloads. Writes should directly go down to HDDs from memory and remain there until the corresponding data are identified as read-heavy and brought up to the SSD layer.

We are already seeing calls for putting everything (or a lot) in memory (e.g., in RAMCloud). This trend will continue to grow to the point it becomes prohibitively expensive. Cluster memory/cache management systems like Memento will become more and more common.

Network

Network designs will change toward a minimum of 10Gbps links. Full bisection bandwidth will become a fact of life. The problems however will not completely go away. Full bisection bandwidth does not necessarily mean infinite bandwidth and network management systems (e.g., Orchestra) within and across datacenters will become increasingly common.

Above the Clouds: A Berkeley View of Cloud Computing

RAD Lab, “Above the Clouds: A Berkeley View of Cloud Computing,” UC Berkeley Technical Report UCB/EECS 2009-28, 2009. [PDF]

Summary

The term ‘Cloud Computing’ has been in extensive use for last few years to suit many different, often unrelated,  needs. This report tries to define what Cloud Computing might mean by identifying its characteristics in terms of hardware and software, quantifying economic tradeoffs, and analyzing the top ten obstacles (toward its adoption and growth as well as its policy and business implications) and corresponding opportunities to overcome those obstacles.

The authors define Cloud Computing as a more general instantiation of Software as a Service (SaaS) that can provide Utility Computing to third parties that do not own their own datacenters. They refer to the datacenter hardware and software as a Cloud and differentiate between Public Cloud (available for other’s to rent) and Private Cloud (belongs to individual entities such as large Internet service providers). The authors also identify three hardware aspects that are completely new in Cloud Computing: the illusion of infinite computing resources that can be dynamically scaled up or down on the fly.

This report compares Cloud Computing vendors across three different axes: computation, storage, and networking, but keeps their economic models separate. Each of the existing Cloud providers (e.g., Amazon EC2, Microsoft Azure, and Google AppEngine) take different routes in the ways they expose these resources to their customers. It is very hard to say if one is better than another, and the authors try not to make such an statement; however, there are implicit nods toward a model that has the least amount of restrictions (read Amazon EC2). I personally agree with this because for my (systems) research I want to be able to have as much control over the resources I am using as possible.

The most interesting part of the report is the list of obstacles and corresponding research opportunities (Table 6). It has been only more than two years this report has been published, but several of the opportunities mentioned here have already been given rise to numerous research papers. The >1100 citations (according to Google Scholar) this technical report has gathered is a testament to its appropriate timing and large appeal, which will continue to be steady in coming years.

Critique

This is not a technical work, rather a comprehensive study that tries to sort out a workable definition of Cloud Computing. Instead of trying to impose yet another definition, the authors dig deeper to understand the key characteristics of the Cloud and put forth the tradeoffs and opportunities in front of the reader for them to decide. This is a must-read for all techies and many non-techies out there.

The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines

Luiz Andre Barroso and Urs Holzle, “The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines,” Chapters 1-4 and 7, Morgan & Claypool Publishers. [PDF]

Summary

With the advent of large Internet service providers (e.g., Google, Facebook, Microsoft, Yahoo!), clusters full of commodity machines are becoming increasingly common. The authors of this book introduce the term Warehouse-Scale Computer (WSC) to refer to such clusters. According to them, the key differentiating factor between a traditional datacenter and a WSC is the presence of homogeneous hardware and software stack throughout the WSC with a focus on serving a common service/goal. Throughout the reviewed chapters, they touch upon key characteristics of a WSC such as infrastructure, network, power, failure/recovery etc.

The first chapter focuses on the tradeoffs in designing a WSC, in terms of storage choice, network fabric design, storage hierarchy, and power characteristics. Overall, the choices strive for scalability and fault-tolerance as cheap as possible.

In the second chapter, the authors focus on the workload and corresponding software stack in a WSC (read Google). There are typically three software layers:

  1. Platform-level software: The firmware and OS kernel in individual machines.
  2. Cluster-level infrastructure: The collection of common distributed systems/services that include distributed file systems, schedulers, programming models etc.
  3. Application-level software: Top-level (mostly user-facing) software.

They also discuss the key elements of Google’s performance and availability toolbox, which include replication, sharding (partitioning), load balancing, eventual consistency etc. There is a huge demand in performance and correctness debugging tools for distributed applications in WSCs. Google uses Dapper, a light-weight annotation-based tracing tool as opposed to taking a black-box approach to debugging.

The authors try to find a sweet spot in hardware choice between low-end machines and shared memory high-end servers in Chapter 3. While trying to find the right design, they ask the natural question: “how cheap the hardware can be, while maintaining good performance?” The rule of thumb they’ve come up with is the following: a low-end server building block must have a healthy cost-efficiency advantage over a higher-end alternative to be competitive. In the end, they argue for a balanced design that depends on the expected type, size, churn, and other characteristics of the workload, while achieving a good tradeoff between price and performance.

Chapter 4 discusses the physical design of modern datacenters with a focus on power distribution, cooling systems, and layout of machines. The authors note that self-sufficient container-based datacenters are gaining popularity in recent years.

Finally, in Chapter 7, the authors discuss arguably the most important aspect of working at warehouse scale: failures and repairs. Hardware failure is unavoidable in a datacenter; so the authors argue for putting more effort in designing resilient software infrastructure layer instead. However, this philosophy still requires prompt and correct detection of hardware failures. The authors outline several categories of failures and the sources (>60% software, >20% hardware, ~10% human error) that caused them. They notice that while failure prediction would have been a much celebrated feature, it is very hard to achieve through existing models.

The authors also give a high-level overview of Google’s System Health monitoring and analysis infrastructure (collects data from machines and batch-processes them using MapReduce) in Chapter 7 and make a case for achieving cost efficiency of WSCs through efficient failure discovery and repair process.

Critique

One minor comment: TPC-* are probably not ideal for benchmarking cloud systems or datacenter hardware/software.

Overall, this book provides a short and useful introduction to the underpinnings of modern datacenters/WSCs. I would recommend it to any newcomer to this field.