Skip to Main Content
The reliability of a large-scale storage system is influenced by a complex set of inter-dependent factors. This paper presents a comprehensive and extensible analytical framework that offers quantitative answers to many design tradeoffs. We apply the framework to a number of important design strategies that a designer and/or administrator must face in reality, including topology-aware replica placement, proactive replication that uses small background network bandwidth and unused disk space to create additional copies. We also quantify the impact of slow (but potentially more accurate) failure detection and lazy replacement of failed disks. We use detailed simulation to verify and refine our analytical model. These results demonstrate the versatility of the framework and serve as a solid step towards more quantitative studies of fundamental system tradeoffs between reliability, performance, and cost in large-scale distributed storage systems.