Skip to Main Content
Nowadays multicore processors are increasingly being deployed in high performance computing systems. As the complexity of systems increases, the probability of failure increases substantially. Therefore, the system requires techniques for supporting fault tolerance. Checkpointing technique is widely used to reduce the execution time of long-running programs in the presence of failures and to enhance the reliability of such systems. Optimizing the number of checkpoints in a parallel application running on a multicore processor is a complicated and challenging task. Infrequent checkpointing results in long reprocessing time, while too short checkpointing intervals lead to high checkpointing overhead. Since this is a multi-objective optimization problem, trapping in local optimums is very plausible. On the other hand, bio-inspired algorithms are powerful function optimizers that are successfully used to solve problems in many different areas. In this paper, by applying genetic algorithm, which is a well-known bio-inspired computing algorithm, finding optimal checkpoint placement in parallel applications is exercised. Under certain fault conditions, this new checkpoint placement strategy outperforms the existing ones with a significant reduction in the total wasted times. Our experimental results show that our method, which is implementable on any message-passing multicore system, can optimally find the suitable points in which checkpoints should be taken.