Skip to Main Content
Quality of life in the United States relies, in large measure, on the continuous operations of a complex infrastructure. This infrastructure is comprised of physical and information-based facilities, networks, and assets, which, if disrupted would seriously impact health, safety and security of citizens or effective functioning of governments and industries. This infrastructure system includes telecommunications, energy, banking and financial, transportation, water, healthcare, government and emergency systems. All of these systems are linked through vast physical and cyber networks which have become completely interdependent. These networks present with a multitude of distributed heterogeneous components so tightly interconnected that a focal disaster can lead to widespread failure almost instantaneously. A disaster is an event that can cause system-wide malfunction as a result of one or more failures within a system. Disasters may occur as the result of single or multi-point failure and may occur either simultaneous or sequential. Disaster tolerance is a superset of fault tolerance in that a disaster may be caused by multiple points of failure in a system that occur very close together in time as well as a single point of failure that escalates into a wide catastrophic system failure. Adequate means to ensure continued system operation in the event of a disaster requires highly reliable and survivable system design of distributed and interdependent systems. This paper will evaluate specific methodologies for disaster tolerant systems engineering for improved command and control of critical infrastructure systems. The current state of disaster tolerant application systems is explored including an investigation into the reliability and survivability requirements necessary to achieve disaster tolerant system operation.