Bigtable: A Distributed Storage System for Structured Data

Google, “Bigtable: A Distributed Storage System for Structured Data,” OSDI, 2006. [PDF]

Summary

Bigtable is a large-scale (petabytes of data across thousands of machines) distributed storage system for managing structured data. It is built on top of several existing Google technology (e.g., GFS, Chubby, and Sawzal) and used by many of Google’s online services. The authors state flexibility and high performance as the two primary goals of Bigtable while supporting applications with diverse requirements (e.g., small to large data sizes, backend to real time latency).

Essentially, Bigtable is a “sparse, distributed, persistent multi-dimensional sorted map” that indexes each <row, column, timestamp> tuple to an array of bytes. Data in Bigtable are maintained in tables that are partitioned into row ranges called tablets. Tablets are the units of data distribution and load balancing in Bigtable, and each tablet server manages some number of tablets. Each table can have hundreds of column families, and each column family can have an unbounded number of columns. Different versions of data are sorted using timestamp in each cell. Bigtable supports single-row transactions, which can be used to perform atomic read-modify-write sequences on data stored under a single row key, it does not support general transactions unlike a standard RDBMS.

GFS is used underneath Bigtable to store the actual data in the Google SSTable file format. An SSTable provides a persistent, ordered immutable map from keys to values and allows iteration over all key/value pairs. Writes in Bigtable go to a redo log in GFS, and the recent writes are cached in a memtable.  Reads are served from memtables first, and then from the SStables in GFS. Bigtable uses Chubby to manage active server, to discover tablet servers, to store Bigtable metadata, and above all, as the root of a three-level tablet location hierarchy. Caching of tablet locations at client-side ensures that finding a tablet server does not take up to six RTTs.

While the basic design is not too complicated, Bigtable includes a large number of refinements/optimizations, including column family locality groups, compression, caching, bloom filters, and commit logs among others, to achieve high performance.

Comments

One of the key tradeoffs made by the Bigtable designers was going for a general design by leaving many performance decisions to its users. Bigtable positioned itself between a database and a distributed key/value store that was a first step in adopting database concepts in general key/value stores. The authors promised further improvements (e.g., secondary indices) to their design, but there was never a followup paper.

Judging by the numbers, Bigtable was highly influential inside Google when this paper was published. I believe it is general enough to survive until today as back-end for many of their newer services. In the open-source world, HBase on HDFS would be its counterpart. Given Facebook’s wide adoption of HBase, we can safely say Bigtable have had enough influence on the world outside Google as well.

Leave a Reply

Your email address will not be published. Required fields are marked *