Luiz Andre Barroso and Urs Holzle, “The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines,” Chapters 1-4 and 7, Morgan & Claypool Publishers. [PDF]
With the advent of large Internet service providers (e.g., Google, Facebook, Microsoft, Yahoo!), clusters full of commodity machines are becoming increasingly common. The authors of this book introduce the term Warehouse-Scale Computer (WSC) to refer to such clusters. According to them, the key differentiating factor between a traditional datacenter and a WSC is the presence of homogeneous hardware and software stack throughout the WSC with a focus on serving a common service/goal. Throughout the reviewed chapters, they touch upon key characteristics of a WSC such as infrastructure, network, power, failure/recovery etc.
The first chapter focuses on the tradeoffs in designing a WSC, in terms of storage choice, network fabric design, storage hierarchy, and power characteristics. Overall, the choices strive for scalability and fault-tolerance as cheap as possible.
In the second chapter, the authors focus on the workload and corresponding software stack in a WSC (read Google). There are typically three software layers:
- Platform-level software: The firmware and OS kernel in individual machines.
- Cluster-level infrastructure: The collection of common distributed systems/services that include distributed file systems, schedulers, programming models etc.
- Application-level software: Top-level (mostly user-facing) software.
They also discuss the key elements of Google’s performance and availability toolbox, which include replication, sharding (partitioning), load balancing, eventual consistency etc. There is a huge demand in performance and correctness debugging tools for distributed applications in WSCs. Google uses Dapper, a light-weight annotation-based tracing tool as opposed to taking a black-box approach to debugging.
The authors try to find a sweet spot in hardware choice between low-end machines and shared memory high-end servers in Chapter 3. While trying to find the right design, they ask the natural question: “how cheap the hardware can be, while maintaining good performance?” The rule of thumb they’ve come up with is the following: a low-end server building block must have a healthy cost-efficiency advantage over a higher-end alternative to be competitive. In the end, they argue for a balanced design that depends on the expected type, size, churn, and other characteristics of the workload, while achieving a good tradeoff between price and performance.
Chapter 4 discusses the physical design of modern datacenters with a focus on power distribution, cooling systems, and layout of machines. The authors note that self-sufficient container-based datacenters are gaining popularity in recent years.
Finally, in Chapter 7, the authors discuss arguably the most important aspect of working at warehouse scale: failures and repairs. Hardware failure is unavoidable in a datacenter; so the authors argue for putting more effort in designing resilient software infrastructure layer instead. However, this philosophy still requires prompt and correct detection of hardware failures. The authors outline several categories of failures and the sources (>60% software, >20% hardware, ~10% human error) that caused them. They notice that while failure prediction would have been a much celebrated feature, it is very hard to achieve through existing models.
The authors also give a high-level overview of Google’s System Health monitoring and analysis infrastructure (collects data from machines and batch-processes them using MapReduce) in Chapter 7 and make a case for achieving cost efficiency of WSCs through efficient failure discovery and repair process.
One minor comment: TPC-* are probably not ideal for benchmarking cloud systems or datacenter hardware/software.
Overall, this book provides a short and useful introduction to the underpinnings of modern datacenters/WSCs. I would recommend it to any newcomer to this field.