Loading [MathJax]/extensions/MathZoom.js
Mitigating Silent Data Corruptions in HPC Applications across Multiple Program Inputs | IEEE Conference Publication | IEEE Xplore

Mitigating Silent Data Corruptions in HPC Applications across Multiple Program Inputs


Abstract:

With the ever-shrinking size of transistors, silent data corruptions (SDCs) are becoming a common yet serious issue in HPC. Selective instruction duplication (SID) is a w...Show More

Abstract:

With the ever-shrinking size of transistors, silent data corruptions (SDCs) are becoming a common yet serious issue in HPC. Selective instruction duplication (SID) is a widely used fault-tolerance technique that can obtain high SDC coverage with low performance overhead. However, existing SID methods are confined to single program input in its assessment, assuming that error resilience of a program remains similar across inputs. Nevertheless, we observe that the assumption cannot always hold, leading to a drastic loss in SDC coverage across different inputs, compromising HPC reliability. We notice that the SDC coverage loss correlates with a small set of instructions - we call them incubative instructions, which reveal elusive error propagation characteristics across multiple inputs. We propose Minpsid, an automated SID framework that automatically identifies and re-prioritizes incubative instructions in a given program to enhance SDC coverage. Evaluation shows Minpsid can effectively mitigate the loss of SDC coverage across multiple inputs.
Date of Conference: 13-18 November 2022
Date Added to IEEE Xplore: 23 February 2023
ISBN Information:

ISSN Information:

Conference Location: Dallas, TX, USA

Contact IEEE to Subscribe

References

References is not available for this document.