The growth in the number of components that compose parallel computers increases their fault frequency. Currently, in such systems faults are no longer a rare event but a common problem, thus some sort of fault tolerance should be provided. In general, fault tolerance protocols rely on checkpoints. A common question surrounding check pointing is the definition of the checkpoint interval. In this paper we propose the modelling of the relationship established between the parallel applications processes due to the messages exchange in order to incorporate this relationship into current checkpoint interval models. The experimental evaluation shows that the use of our checkpoint interval model based on the definition of the parallel application inter-process dependency factor is effective to calculate the checkpoint interval for parallel applications. Our results demonstrate that the overhead prediction error is smaller than 4% in comparison with the application execution.
Published in:
Distributed Computing Systems (ICDCS), 2011 31st International Conference on
Date of Conference: 20-24 June 2011