Skip to Main Content
Summary form only given. Distributed systems today are ubiquitous and enable many applications, including client-server systems, transaction processing, World Wide Web, and scientific computing, among many others. Distributed systems are not fault-tolerant and the vast computing potential of these systems is often hampered by their susceptibility to failures. Many techniques, like transactions, group communication, and rollback recovery, have been developed to add reliability and high availability to distributed systems. This talk deals with rollback recovery protocols which restore the system back to a consistent state after a failure. Fault tolerance is achieved by periodically saving the state of a process during the failure-free execution, and restarting from a saved state upon a failure to reduce the amount of lost work. The speaker will present his recent results in checkpointing and failure recovery in distributed systems and wireless networks. Specifically, he will present results in a classification of checkpointing algorithms, present a communication-induced checkpointing algorithm that prevents useless checkpoints by tracking and preventing potential Z-cycles, and present the concept of mutable checkpoints for efficient checkpointing in wireless networks. He will conclude the talk with some open problems.