This paper describes the architecture and implementation of Libra, a library for implementing efficient reliable distributed applications. Libra is designed to provide fault-tolerance transparency and a simple easy to use high-level message passing interface so that the development of reliable distributed applications can be significantly simplified. Fault-tolerance is based on distributed consistent checkpointing and rollback-recovery integrated with a user-level network communication protocol. By employing novel mechanisms, Libra minimises communication overhead for taking a consistent distributed checkpoint and catching messages in transit. With efficient implementation techniques, the prototype of Libra has been implemented on a network of Sun workstations and supports reliable distributed computing at low run-time cost. The simplicity and efficiency of Libra make it a promising approach to construct reliable distributed applications
Published in:
Algorithms & Architectures for Parallel Processing, 1996. ICAPP 96. 1996 IEEE Second International Conference on
Date of Conference: 11-13 Jun 1996