Skip to Main Content
In this paper, we propose a two-phase methodology for systematically evaluating the performability (performance and availability) of cluster-based Internet services. In the first phase, evaluators use a fault-injection infrastructure to characterize the service's behavior in the presence of faults. In the second phase, evaluators use an analytical model to combine an expected fault load with measurements from the first phase to assess the service's performability. Using this model, evaluators can study the service's sensitivity to different design decisions, fault rates, and other environmental factors. To demonstrate our methodology, we study the performability of a multitier Internet service. In particular, we evaluate the performance and availability of three soft state maintenance strategies for an online bookstore service in the presence of seven classes of faults. Among other interesting results, we clearly isolate the effect of different faults, showing that the tier of Web servers is responsible for an often dominant fraction of the service unavailability. Our results also demonstrate that storing the soft state in a database achieves better performability than storing it in main memory (even when the state is efficiently replicated) when we weight performance and availability equally. Based on our results, we conclude that service designers may want an unbalanced system in which they heavily load highly available components and leave more spare capacity for components that are likely to fail more often.