Loading [MathJax]/extensions/MathMenu.js
Evaluating the Resilience of Parallel Applications | IEEE Conference Publication | IEEE Xplore

Evaluating the Resilience of Parallel Applications


Abstract:

Reliability is a significant design constraint for supercomputers and large-scale data centers. Modeling the effects of faults on applications targeted to such systems al...Show More

Abstract:

Reliability is a significant design constraint for supercomputers and large-scale data centers. Modeling the effects of faults on applications targeted to such systems allows system architects and software designers to provision resilience features, that improve fidelity of results and reduce runtimes. In this paper, we propose mechanisms to improve existing techniques to model the effect of transient faults on realistic applications. First, we extend the existing Program Vulnerability Factor metric to model multi-threaded applications. Then we demonstrate how to measure the multi-threaded PVF of an application in simulation and introduce the ability to account for software detection of hardware faults, differentiating faults that cause detected, uncorrected errors (DUE) from faults that cause silent data corruption (SDC).
Date of Conference: 08-10 October 2018
Date Added to IEEE Xplore: 06 January 2019
ISBN Information:

ISSN Information:

Conference Location: Chicago, IL, USA

Contact IEEE to Subscribe

References

References is not available for this document.