Skip to Main Content
Fault injection is typically used to characterize failures and to validate and compare fault-tolerant mechanisms. However fault injection is rarely used for all these purposes to guide the design and implementation of a fault tolerant system. We present a systematic and quantitative approach for using software-implemented fault injection to guide the design and implementation of a fault-tolerant system. Our system design goal is to build a write-back file cache on Intel PCs that is as reliable as a write-through file cache. We follow an iterative approach to improve robustness in the presence of operating system errors. In each iteration, we measure the reliability of the system, analyze the fault symptoms that lead to data con option, and apply fault-tolerant mechanisms that address the fault symptoms. Our initial system is 13 times less reliable than a write-through file cache. The result of several iterations is a design that is both more reliable (1.9% vs. 3.1% corruption rate) and 5-9 times as fast as a write-through file cache.