Availability in Globally Distributed Storage Systems

Daniel Ford
Francois Labelle
Florentina Popovici
Murray Stokely
Van-Anh Truong
Luiz Barroso
Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation, USENIX (2010)
Google Scholar

Abstract

Highly available cloud storage is often implemented with
complex, multi-tiered distributed systems built on top of clusters of
commodity servers and disk drives. Sophisticated management, load
balancing and recovery techniques are needed to achieve high
performance and availability amidst an abundance of failure sources
that include software, hardware, network connectivity, and power issues. While
there is a relative wealth of failure studies of individual components of
storage systems, such as disk drives, relatively little has been
reported so far on the overall availability behavior of large
cloud-based storage services.

We characterize the availability properties of cloud
storage systems based on an extensive one year study of Google's
main storage infrastructure and present statistical models
that enable further insight into the impact of multiple
design choices, such as data placement and replication strategies.
With these models we compare data availability under a variety of
system parameters given the real patterns of failures observed in our fleet.