Mike Burrows, “The Chubby Lock Service for Loosely-Coupled Distributed Systems,” OSDI, 2006. [PDF]
Chubby provides coarse-grained locking and reliable small-file storage for loosely-coupled distributed systems running in Google datacenters. The primary use case of Chubby is leader election in Google File System (GFS) and BigTable. However, over time, developers have used Chubby to implement synchronization points in their distributed systems (e.g., in MapReduce) and as a name service (instead of using DNS). At its core Chubby solves the distributed consensus problem using Paxos in an asynchronous environment.
The primary goal of Chubby is to ensure availability and reliability as well as usability (easy for developers to use it) and deployability (minimal code changes to existing systems). Performance (in terms of throughput) and storage capacity were considered either secondary or non-goals. Chubby provides an interface and an API similar to a simple UNIX-like file system. It allows systems to get read/write advisory locks on any directory or file. These locks are coarse-grained, meaning they are held for hours and days instead of seconds and minutes.
Chubby has two main components that communicate via RPC: a server and a client-side library that applications use to utilize Chubby. A Chubby cell typically consists of 5 server replicas and use a distributed consensus protocol to elect a master. Each replica has a simple database to store meta-data; the master has the primary copy, which is replicated to the other servers. Client libraries find master through the DNS. All read requests are directly satisfied by the master. Write requests, however, do not finish until a majority of the server replicas have been updated. Chubby includes event notification mechanisms to notify clients of any changes to their locks, and it uses extensive caching (including negative caching) for performance reasons.
Chubby has been highly successful inside Google and have also influenced systems like ZooKeeper in the open-source community. From the numbers presented in the paper, Chubby is highly scalable (could handle up to 90,000 connections with provisions for further scaling), available (only 61 outages and out of that 61, 52 were resolved within 30 seconds before applications could notice its absence), and reliable (only 6 data loss incidents: 4 due to software bugs that had been fixed and 2 were operator errors during software bug fixing). This means that Chubby has successfully achieved its goals of providing a available, reliable, and scalable lock service.
One important lesson of this paper is that it is hard to plan how and what developers might use a large, multifunctional system for. It is also hard to force developers to use a service in a particular way. Instead, the designers must update the service itself to prevent accidental harmful activities and adapt the system based on developers’ needs and feedback.