Skip to Main Content
The popularity of large-scale Internet infrastructure services such as AOL, Google, and Hotmail has grown enormously. The scalability and availability requirements of these services have led to system architectures that diverge significantly from those of traditional systems like desktops, enterprise servers, or databases. Given the need for thousands of nodes, cost necessitates the use of inexpensive personal computers wherever possible, and efficiency often requires customized service software. Likewise, addressing the goal of zero downtime requires human operator involvement and pervasive redundancy within clusters and between globally distributed data centers. Despite these services' success, their architectures-hardware, software, and operational-have developed in an ad hoc manner that few have surveyed or analyzed. Moreover, the public knows little about why these services fail or about the operational practices used in an attempt to keep them running 24/7. As a first step toward formalizing the principles for building highly available and maintainable large-scale Internet services, we are surveying existing services' architectures and dependability. This article describes our observations to date.