Category Archives: Reviews

The Google File System

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, “The Google File System, ” SOSP, (October, 2003). [PDF]

Summary

The Google File System or GFS is a scalable, fault-tolerance distributed file system custom-designed to handle Google’s data-intensive workloads. GFS provides high aggregate throughput for a large number of readers and append-only writers (i.e., no overwrites) in a fault-tolerant manner, while running on inexpensive commodity hardware. It is specially optimized for large files, sequential reads/writes, and high sustained throughput instead of low latency. GFS uses a single master with minimal involvement in regular operations and a relaxed consistency model to simplify concurrent operations by many clients on same files.

GFS sits on top of local file systems in individual machines to create a meta-file system, where each of the participating machine is called a chunkserver. The master keeps track of meta-data using soft states that are periodically refreshed by talking to chunkservers. GFS divides files into large 64MB chunks and replicates these chunks across machines and racks for fault-tolerance and availability. It uses daisy-chained writing for better utilizing the network bandwidth. It is not POSIX compliant but supports a reasonable set of file system operations (e.g., create, delete, open, close, read, and write) useful for Google’s operations. Unlike any other file system, GFS supports atomic record appends to allow multiple clients to append records to the same file.

Key tradeoffs in GFS’ design include throughput vs latency, using an expensive storage system vs designing an distributed file system using commodity hardware, write traffic for replication vs savings in read traffic due to data locality, and simplicity of design using a single master and a relaxed consistency model vs highly consistent but more complex design. It is noticeable that the authors almost always go for simpler design at the expense of many highly regarded traditional file system properties.

Comments

GFS has successfully achieved its goal of performance, availability, reliability, and scale that the designers set out to achieve. In addition, it has directly motivated the design of the Hadoop Distributed File System (HDFS), which is its open source equivalent and widely used by the industry. Higher level storage systems have also been developed on top of HDFS, which is a testament to GFS’ success. As long as MapReduce like large read/append only workloads will be around, GFS and HDFS are expected to remain relevant as well.

The Chubby Lock Service for Loosely-Coupled Distributed Systems

Mike Burrows, “The Chubby Lock Service for Loosely-Coupled Distributed Systems,” OSDI, 2006. [PDF]

Summary

Chubby provides coarse-grained locking and reliable small-file storage for loosely-coupled distributed systems running in Google datacenters. The primary use case of Chubby is leader election in Google File System (GFS) and BigTable. However, over time, developers have used Chubby to implement synchronization points in their distributed systems (e.g., in MapReduce) and as a name service (instead of using DNS). At its core Chubby solves the distributed consensus problem using Paxos in an asynchronous environment.

The primary goal of Chubby is to ensure availability and reliability as well as usability (easy for developers to use it) and deployability (minimal code changes to existing systems). Performance (in terms of throughput) and storage capacity were considered either secondary or non-goals. Chubby provides an interface and an API similar to a simple UNIX-like file system. It allows systems to get read/write advisory locks on any directory or file. These locks are coarse-grained, meaning they are held for hours and days instead of seconds and minutes.

Chubby has two main components that communicate via RPC: a server and a client-side library that applications use to utilize Chubby. A Chubby cell typically consists of 5  server replicas and use a distributed consensus protocol to elect a master. Each replica has a simple database to store meta-data; the master has the primary copy, which is replicated to the other servers. Client libraries find master through the DNS. All read requests are directly satisfied by the master. Write requests, however, do not finish until a majority of the server replicas have been updated. Chubby includes event notification mechanisms to notify clients of any changes to their locks, and it uses extensive caching (including negative caching) for performance reasons.

Comments

Chubby has been highly successful inside Google and have also influenced systems like ZooKeeper in the open-source community. From the numbers presented in the paper, Chubby is highly scalable (could handle up to 90,000 connections with provisions for further scaling), available (only 61 outages and out of that 61, 52 were resolved within 30 seconds before applications could notice its absence), and reliable (only 6 data loss incidents: 4 due to software bugs that had been fixed and 2 were operator errors during software bug fixing). This means that Chubby has successfully achieved its goals of providing a available, reliable, and scalable lock service.

One important lesson of this paper is that it is hard to plan how and what developers might use a large, multifunctional system for. It is also hard to force developers to use a service in a particular way. Instead, the designers must update the service itself to prevent accidental harmful activities and adapt the system based on developers’ needs and feedback.

Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services

Seth Gilbert, Nancy Lynch, “Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services,” ACM SIGACT News, Volume 33 Issue 2 (2002), pg. 51-59. [PDF]

Summary

In this PODC 2000 keynote speech, Eric Brewer claimed that it is impossible for a distributed system to provide the following three guarantees simultaneously:

  • Consistency,
  • Availability, and
  • Partition-tolerance.

Even though all three are desirable properties, the claim is that one of the three has to be weakened or cannot be satisfied at all.

In this paper, Gilbert and Lynch formalize the problem and subsequently prove it for asynchronous networks. They also show the same for partially synchronous networks by using the concepts of timeouts, and local clocks.

Comments

The CAP theorem, by itself, is somewhat intuitive. Consequently, the proofs are very straightforward for the asynchronous case. It becomes slightly more complicated for the (partially) synchronous model though.

While this theorem has been extremely influential in reasoning about the design decisions made by many system developers, I think it can easily be abused to quickly give up on something. For example, getting 2 out of 3 do not mean that some shade of the 3rd cannot be achieved at all. For example, there exist weaker consistency models like causal consistency that can achieve availability and partition-tolerance.

Cluster-Based Scalable Network Services

Armando Fox, Steven D. Gribble, Yatin Chawathe, Eric A. Brewer, Paul Gauthier, “Cluster-Based Scalable Network Services,” Sixteenth ACM Symposium on Operating Systems Principles (SOSP), (October, 1997). [PDF]

Summary

This paper identifies scalability (elasticity), availability, and cost effectiveness as the three fundamental requirements of cluster computing. It proposes a layered architecture that separates high-level policies from actual underlying mechanisms using a service programming model consisting of composable workers that can perform Transformation, Aggregation, Caching, and Customization (TACC) to build scalable (online) network services. In addition, the authors propose BASE (Basically Available, Soft State, and Eventual Consistency), a weaker-than-traditional-ACID semantics that trades consistency for availability and relies on soft states for robustness. Under the light of these concepts, the authors discuss the designs of two different online services (TranSend and HotBot) in terms of load balancing, fault management, caching, and evaluate them. Both the services have an ACID component, but primarily manipulate BASE data.

Comments

This work is dated pre-CAP theorem, but it points to some of the observations that I think eventually led to the CAP theorem (e.g., the trade-off between availability and consistency was later extended to include partition tolerance). The focus on cost-effectiveness was a very important observation as well, as it led to the later revolution in datacenters made of commodity components. While user-facing services and relevant content types have changed over time, I think the authors got the basics right and predicted well how modern cluster computing might look like.

The Datacenter Needs an Operating System

Matei Zaharia, Benjamin Hindman, Andy Konwinski, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, Ion Stoica, “The Datacenter Needs an Operating System,” USENIX HotCloud, (June, 2011). [PDF]

Summary

In recent years, many have pondered whether the datacenter is the new computer. This paper answers that question somewhat positively and goes one step forward by proposing operating system (OS) primitives for this new computer. Essentially, the authors want to have a common software layer to manage resources and to provide shared services for the datacenter as a whole. They expect four key primitives from this OS-like abstraction:

  • Resource sharing across diverse applications (e.g., MapReduce, Dryad, GFS, Hadoop) and users.
  • Data sharing between programs written by these users.
  • Programming abstractions to address the datacenter as a whole.
  • Debugging facilities throughout the software stack across the datacenter.

To this end, the authors survey a collection of existing work (many from the AMPLab) that partly address problems across all four dimensions. They put forth a vision for an OS that will allow users to build Hadoop-like systems in a week, share data and the datacenter resourcers across different programming models and users, and debug such new systems as a whole instead of logging into individual machines.

Critique

This paper makes a valid point that there is a need for ‘some’ software layer to manage and schedule resources in a datacenter. It is, however, not clear whether that software layer is an OS in the traditional sense. Specially, it is hard to tell whether such an OS should include programming abstractions and debuggers, or they should come as bundled software with it.

In general, this is a good paper to generate discussions in the community and to mine citations, which it will, but I think there is a need for a lot more back-and-forth discussions to flesh out what is expected of such an OS, what are the bare minimums to be expected to call it an OS, what will the hardware/software layers look like, and how they will interact.

Hardware trends and future datacenter designs

John L. Hennessy, David A. Patterson, “Graphic Processing Units (GPUs),” Chapter 4, Computer Architecture, Fifth Edition: A Quantitative Approach, 2011. [LINK]

Mark D. Hill, Michael R. Marty, “Amdahl’s Law in the Multicore Era,” IEEE Computer, (July, 2008). [PDF]

H. Howie Huang, Shan Li, Alex Szalay, Andreas Terzis, “Performance Modeling and Analysis of Flash-based Storage Devices,” 27th IEEE Symposium on Massive Storage Systems and Technologies (MSST), 2011. [PDF]

Summary

We are in the middle of a massive shift in hardware landscape. All aspects of computing, be it processing, memory, storage, or network, are changing really fast. These set of articles touch upon some of them. Hennessy and Patterson discuss the utility and applicability of SIMD (Single Instruction Multiple Data) architectures that come with modern Graphic Processing Units or GPUs. Hill and Marty extend the Amdahl’s to the era of many-core processors and essentially promote asymmetric multicore chips over symmetric ones with a preference for a more dynamic design. Lastly, Huang et al. make a case for Flash-based storage devices (read SSDs) by comparing their performance against rotational HDDs using a black-box model.

Before spitting out my 2 cents, I have to say that none of the articles cover the networking trends. Things are getting faster on the wire too. 10Gbps and 40Gbps links are becoming increasingly common on the higher layers of the datacenter hierarchy, and some facilities are even trying out 100Gbps interconnects. Also, while discussing the many-core/multi-core chips, the question of memory bandwidth to and from those chips has also been ignored.

Possible Impacts on Future Datacenter Designs

CPU

The advance of many-core chips is unavoidable. We can expect to see asymmetric designs in the future that will combine 1 (or more) strong cores with many weak cores. In fact, there is a similarity between asymmetric design mentioned by Hill and Marty and the current setup of CPU and GPU in our desktop machines. Recent calls for putting both of them on the same chip by many vendors point toward a move to asymmetric designs. Cluster schedulers will have to be made aware of the presence of such strong and weak cores as well.

However, GPU-based parallelism requires too much attention from software designers/developers. Unless something fundamentally different happens (in terms of may be programming abstraction), it is likely to stay that way and will be avoided for general cluster software systems as much as possible.

Storage/Memory

Given the weak write/update/erase performance of Flash-based devices along with their cost overhead, it is likely to see one more level in the storage/memory hierarchy. In between in-memory cache and underlying HDDs, a new layer of SSDs will be introduced for read-heavy workloads. Writes should directly go down to HDDs from memory and remain there until the corresponding data are identified as read-heavy and brought up to the SSD layer.

We are already seeing calls for putting everything (or a lot) in memory (e.g., in RAMCloud). This trend will continue to grow to the point it becomes prohibitively expensive. Cluster memory/cache management systems like Memento will become more and more common.

Network

Network designs will change toward a minimum of 10Gbps links. Full bisection bandwidth will become a fact of life. The problems however will not completely go away. Full bisection bandwidth does not necessarily mean infinite bandwidth and network management systems (e.g., Orchestra) within and across datacenters will become increasingly common.

Above the Clouds: A Berkeley View of Cloud Computing

RAD Lab, “Above the Clouds: A Berkeley View of Cloud Computing,” UC Berkeley Technical Report UCB/EECS 2009-28, 2009. [PDF]

Summary

The term ‘Cloud Computing’ has been in extensive use for last few years to suit many different, often unrelated,  needs. This report tries to define what Cloud Computing might mean by identifying its characteristics in terms of hardware and software, quantifying economic tradeoffs, and analyzing the top ten obstacles (toward its adoption and growth as well as its policy and business implications) and corresponding opportunities to overcome those obstacles.

The authors define Cloud Computing as a more general instantiation of Software as a Service (SaaS) that can provide Utility Computing to third parties that do not own their own datacenters. They refer to the datacenter hardware and software as a Cloud and differentiate between Public Cloud (available for other’s to rent) and Private Cloud (belongs to individual entities such as large Internet service providers). The authors also identify three hardware aspects that are completely new in Cloud Computing: the illusion of infinite computing resources that can be dynamically scaled up or down on the fly.

This report compares Cloud Computing vendors across three different axes: computation, storage, and networking, but keeps their economic models separate. Each of the existing Cloud providers (e.g., Amazon EC2, Microsoft Azure, and Google AppEngine) take different routes in the ways they expose these resources to their customers. It is very hard to say if one is better than another, and the authors try not to make such an statement; however, there are implicit nods toward a model that has the least amount of restrictions (read Amazon EC2). I personally agree with this because for my (systems) research I want to be able to have as much control over the resources I am using as possible.

The most interesting part of the report is the list of obstacles and corresponding research opportunities (Table 6). It has been only more than two years this report has been published, but several of the opportunities mentioned here have already been given rise to numerous research papers. The >1100 citations (according to Google Scholar) this technical report has gathered is a testament to its appropriate timing and large appeal, which will continue to be steady in coming years.

Critique

This is not a technical work, rather a comprehensive study that tries to sort out a workable definition of Cloud Computing. Instead of trying to impose yet another definition, the authors dig deeper to understand the key characteristics of the Cloud and put forth the tradeoffs and opportunities in front of the reader for them to decide. This is a must-read for all techies and many non-techies out there.

The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines

Luiz Andre Barroso and Urs Holzle, “The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines,” Chapters 1-4 and 7, Morgan & Claypool Publishers. [PDF]

Summary

With the advent of large Internet service providers (e.g., Google, Facebook, Microsoft, Yahoo!), clusters full of commodity machines are becoming increasingly common. The authors of this book introduce the term Warehouse-Scale Computer (WSC) to refer to such clusters. According to them, the key differentiating factor between a traditional datacenter and a WSC is the presence of homogeneous hardware and software stack throughout the WSC with a focus on serving a common service/goal. Throughout the reviewed chapters, they touch upon key characteristics of a WSC such as infrastructure, network, power, failure/recovery etc.

The first chapter focuses on the tradeoffs in designing a WSC, in terms of storage choice, network fabric design, storage hierarchy, and power characteristics. Overall, the choices strive for scalability and fault-tolerance as cheap as possible.

In the second chapter, the authors focus on the workload and corresponding software stack in a WSC (read Google). There are typically three software layers:

  1. Platform-level software: The firmware and OS kernel in individual machines.
  2. Cluster-level infrastructure: The collection of common distributed systems/services that include distributed file systems, schedulers, programming models etc.
  3. Application-level software: Top-level (mostly user-facing) software.

They also discuss the key elements of Google’s performance and availability toolbox, which include replication, sharding (partitioning), load balancing, eventual consistency etc. There is a huge demand in performance and correctness debugging tools for distributed applications in WSCs. Google uses Dapper, a light-weight annotation-based tracing tool as opposed to taking a black-box approach to debugging.

The authors try to find a sweet spot in hardware choice between low-end machines and shared memory high-end servers in Chapter 3. While trying to find the right design, they ask the natural question: “how cheap the hardware can be, while maintaining good performance?” The rule of thumb they’ve come up with is the following: a low-end server building block must have a healthy cost-efficiency advantage over a higher-end alternative to be competitive. In the end, they argue for a balanced design that depends on the expected type, size, churn, and other characteristics of the workload, while achieving a good tradeoff between price and performance.

Chapter 4 discusses the physical design of modern datacenters with a focus on power distribution, cooling systems, and layout of machines. The authors note that self-sufficient container-based datacenters are gaining popularity in recent years.

Finally, in Chapter 7, the authors discuss arguably the most important aspect of working at warehouse scale: failures and repairs. Hardware failure is unavoidable in a datacenter; so the authors argue for putting more effort in designing resilient software infrastructure layer instead. However, this philosophy still requires prompt and correct detection of hardware failures. The authors outline several categories of failures and the sources (>60% software, >20% hardware, ~10% human error) that caused them. They notice that while failure prediction would have been a much celebrated feature, it is very hard to achieve through existing models.

The authors also give a high-level overview of Google’s System Health monitoring and analysis infrastructure (collects data from machines and batch-processes them using MapReduce) in Chapter 7 and make a case for achieving cost efficiency of WSCs through efficient failure discovery and repair process.

Critique

One minor comment: TPC-* are probably not ideal for benchmarking cloud systems or datacenter hardware/software.

Overall, this book provides a short and useful introduction to the underpinnings of modern datacenters/WSCs. I would recommend it to any newcomer to this field.

Skilled in the Art of Being Idle: Reducing Energy Waste in Networked Systems

S. Nedevschi, J. Chandrashekar, J. Liu, B. Nordman, S. Ratnasamy, N. Taft, “Skilled in the Art of Being Idle: Reducing Energy Waste in Networked Systems,” NSDI’09, (April 2009). [PDF]

Summary

This paper argues that putting networked end-systems into low-power sleep modes, instead of keeping them in higher-power-consuming idle states, can result in significant energy savings. However, a sleeping device loses its network presence and can prevent running scheduled tasks. There are two main approaches that address these issues: first, wake-on-lan (WoL) mechanisms wake machines up at the arrival of specific packets; and second, the use of a proxy that handles packets for the sleeping machine unless it is absolutely necessary to wake the machine up. The authors study the pros and cons of the two approaches based on data collected from 250 enterprise machines and present a proxy architecture with a narrow API interface to support different idle states.

From their measurement data, the authors found that the machines under study are active for only 10% of the time and on the average 50% of the time they are idle; only a small fraction are put to sleep at all. This observation suggests that there is enormous opportunity of saving energy by exploiting the sleep states. The authors also notice that there is always a steady flow of packets, which makes WoP (wake-on-packet) an infeasible choice.

In the proxy-based solution, packets destined for a sleeping host are intercepted by its proxy. The proxy decides whether to ignore/drop the packet or to respond somehow or to wake up the machine it is representing. To make judicious decisions, the authors identified different classes of packets (e.g., incoming/outgoing, unicast/broadcast/multicast) and found that both broadcast and multicast are largely responsible for poor sleep (80% more sleep time in home environment and 50% more in office environment). They deconstructed different classes of traffic and ended up with some ground rules about what to do when a proxy sees different types of packets.

Idle-time traffic have been differentiated along two different dimensions. The first classifies traffic based on the need to proxy the traffic (protocol) in question into three categories (don’t-wake protocols, don’t ignore protocols, and policy-dependent protocols). The second one identifies the complexity of decision making (ignorable(drop), handled via mechanical response, and require specialized processing). The end result is a firewall like architecture, where the power-proxy table consists of a list of rules with each rule consisting of a <trigger, action, timeout> tuple. For incoming packets, the proxy matches the rules and performs necessary actions. The authors also built a simple Click-based prototype of their proposed solution.

Comments

This is yet another very good measurement paper. Specially, the study of various types of traffic in different network environment is an eye-opener.

The solution presented in the paper is very much similar to firewall; so an intelligent guess would be that a lot of firewall literature might come in handy in solving the complexities of matching rules.

Cutting the Electric Bill for Internet-Scale Systems

A. Qureshi, R. Weber, H. Balakrishnan, J. Guttag, B. Maggs, “Cutting the Electric Bill for Internet-Scale Systems,” ACM SIGCOMM Conference, (August 2009). [PDF]

Summary

Large organizations like Google, Microsoft, and Yahoo! annually consume tens of millions of dollars worth of electricity. The traditional approach toward reducing energy costs by reducing the amount of energy consumption has not been that successful. The authors of this paper propose a new method based on two key observations: first, electricity prices exhibit both temporal and geographical variation; and second, large distributed systems incorporate request routing and replication. Based on these observations, they posit that cost-aware routing of computations to low price electricity zones can save a significant amount of energy expenses per annum without increasing other costs (e.g., bandwidth). It should be noted that the paper deals with reducing energy cost and not energy consumption.

Electricity is generally produced somewhere else and then transmitted to consumer locations. The whole process is regulated by Regional Transmission Organizations or RTOs (there are eight reliability regions in the US). RTOs use auctioning mechanisms to match buyers with sellers in multiple parallel wholesale markets such as day-ahead, hour-ahead, and real-time markets. This paper is concerned only with the real-time market and depends on its variation over time and locality. Through empirical study of average prices from January 2006 to April 2009, the authors have found significant uncorrelated variation of prices across time, geographic locations, and different wholesale markets. These variations are due to the source of electricity and time difference in regions (resulting in peak hours in different absolute times).

Routing computations to different, possibly further, geographical places can result in increased latency to client experience and an increase in bandwidth cost. Based on their study on Akamai data, the authors found that for Akamai-like large corporations such problems are manageable, i.e., they do have some impact but the total operating cost still decreases. In the end, they provide a simple strategy to reduce energy costs by moving computations to minimum priced places within a certain radius from the original location.

The results presented in this paper are highly dependent on the energy-proportionality of data centers/clusters, which refers to the fact that energy consumption should be proportional to the load. Unfortunately, that is not the case in today’s hardware, but the authors hope for something like that in the future. Anyhow, they did find significant temporal and geographic variation in prices to exploit and through trace-based simulation they showed that 40% savings can be achieved in ideal energy-proportional scenario. However, on real hardware total savings is brought down to only ~5-6%.

Critique

The authors have presented an excellent overview of the contemporary electricity economy, and they have obviously done a great job in shedding new light on a (sort of) well-established problem.

While the paper is excellent in its observations and measurements, the solution is straightforward with several simplifying assumptions. One can imagine a multi-variable optimization problem here that will consider energy-, bandwidth-, and other costs together with multiple constraints on latencies, bandwidth etc. On the other hand, such optimization problems are really hard to solve in real-time, and it might happen so that people will end up using the straightforward approach in the end.

The choice of 1500km as the radius seemed pretty arbitrary. The authors did try to justify the number by some significant jumps in cost and distance at that value, but I did not find it very convincing.

Also, one might ask if such measures do save some cost, why no one is using it. The possible reason is that the overhead of practically implementing something like this outweighs its savings. But this is only the first step in this direction, there might be ways to find some practical solution in the future.

Also, it seemed to me that the paper does not have much networking or communication content per se. Anyhow, it was an interesting read.