Abstract:
Reliability is a significant design constraint for supercomputers and large-scale data centers. Modeling the effects of faults on applications targeted to such systems al...Show MoreMetadata
Abstract:
Reliability is a significant design constraint for supercomputers and large-scale data centers. Modeling the effects of faults on applications targeted to such systems allows system architects and software designers to provision resilience features, that improve fidelity of results and reduce runtimes. In this paper, we propose mechanisms to improve existing techniques to model the effect of transient faults on realistic applications. First, we extend the existing Program Vulnerability Factor metric to model multi-threaded applications. Then we demonstrate how to measure the multi-threaded PVF of an application in simulation and introduce the ability to account for software detection of hardware faults, differentiating faults that cause detected, uncorrected errors (DUE) from faults that cause silent data corruption (SDC).
Published in: 2018 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)
Date of Conference: 08-10 October 2018
Date Added to IEEE Xplore: 06 January 2019
ISBN Information:
ISSN Information:
Keywords assist with retrieval of results and provide a means to discovering other relevant content. Learn more.
- IEEE Keywords
- Index Terms
- System Software ,
- Corrupted Data ,
- Transient Faults ,
- Infrastructure ,
- Operating System ,
- Unit Time ,
- Backpropagation ,
- Read Data ,
- Matrix Multiplication ,
- Error Detection ,
- Shared Resource ,
- Operator Of Order ,
- Computing Units ,
- Relaxed Model ,
- Local Memory ,
- Program Execution ,
- Single Thread ,
- Output Error ,
- Multiple Threads ,
- Program Output ,
- Private Resources
Keywords assist with retrieval of results and provide a means to discovering other relevant content. Learn more.
- IEEE Keywords
- Index Terms
- System Software ,
- Corrupted Data ,
- Transient Faults ,
- Infrastructure ,
- Operating System ,
- Unit Time ,
- Backpropagation ,
- Read Data ,
- Matrix Multiplication ,
- Error Detection ,
- Shared Resource ,
- Operator Of Order ,
- Computing Units ,
- Relaxed Model ,
- Local Memory ,
- Program Execution ,
- Single Thread ,
- Output Error ,
- Multiple Threads ,
- Program Output ,
- Private Resources