Thanks to its beneficial features such as requiring no specialized hardware and lowering highly failure-free overhead of synchronous logging with volatile logging at sender's memory, sender-based message logging (SBML) with checkpointing might be applied into many distributed systems as a low-cost transparent rollback recovery technique. However, the original SBML recovery algorithm may no longer be progressing in some transient communication error cases. This paper proposes a consistent recovery algorithm to solve this problem by piggybacking small log information for unstable messages received on each acknowledgement message for returning the receive sequence number assigned to a message by its receiver. This feature also allows message send operations delayed after having performed some message receive operations during failure-free execution to begin executing much earlier compared with the existing ones.
Published in:
High Performance Computing and Simulation (HPCS), 2011 International Conference on
Date of Conference: 4-8 July 2011