Skip to Main Content
Reliability is an important aspect of any system. On-line diagnosis, parity check coding, triple modular redundancy, and other methods have been used to improve the reliability of computing systems. In this paper another aspect of reliable computing systems is explored. The problem is that of recovering error-free information when an error is detected at some stage in the processing of a program. If an error or fault is detected while a program is being processed and if it cannot be corrected immediately, it may be necessary to run the entire program again. The time spent in rerunning the program may be substantial and in some real time applications critical. Recovery time can be reduced by saving states of the program (all the information stored in registers, primary and secondary storage, etc.) at intervals, as the processing continues. If an error is detected the program is restarted from its most recently saved state. However, a price is paid in saving a state in the form of time spent storing all the relevant information in secondary storage. Hence it is expensive to save the state of the program too often. Not saving any state of the program may cause an unacceptably large recovery time. The problem that we solve is the following. Determine the optimum points at which the state of the program should be stored to recover after any malfunction.