Loading [a11y]/accessibility-menu.js
Designing and Modelling Selective Replication for Fault-Tolerant HPC Applications | IEEE Conference Publication | IEEE Xplore

Designing and Modelling Selective Replication for Fault-Tolerant HPC Applications


Abstract:

Fail-stop errors and Silent Data Corruptions (SDCs) are the most common failure modes for High Performance Computing (HPC) applications. There are studies that address fa...Show More

Abstract:

Fail-stop errors and Silent Data Corruptions (SDCs) are the most common failure modes for High Performance Computing (HPC) applications. There are studies that address fail-stop errors and studies that address SDCs. However few studies address both types of errors together. In this paper we propose a software-based selective replication technique for HPC applications for both fail-stop errors and SDCs. Since complete replication of applications can be costly in terms of resources, we develop a runtime-based technique for selective replication. Selective replication provides an opportunity to meet HPC reliability targets while decreasing resource costs. Our technique is low-overhead, automatic and completely transparent to the user.
Date of Conference: 14-17 May 2017
Date Added to IEEE Xplore: 13 July 2017
ISBN Information:
Conference Location: Madrid, Spain

References

References is not available for this document.