Journals & Magazines >IEEE Transactions on Parallel... >Volume: 34 Issue: 3

HPC Hardware Design Reliability Benchmarking With HDFIT

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Chips pack ever more, ever smaller transistors. Fault rates increase in turn and become more concerning, particularly at the scale of High-Performance Computing (HPC) sys...Show More

Metadata

Abstract:

Chips pack ever more, ever smaller transistors. Fault rates increase in turn and become more concerning, particularly at the scale of High-Performance Computing (HPC) systems: on one hand, hardware fault protection is costly - more than 10% silicon area for floating-point units; on the other, HPC users expect correct application output after the anticipated time of computation, but workloads are seldom bit-reproducible and tolerances in output are allowed for. Benign hardware faults causing errors within these tolerances are therefore acceptable: however, with abstract reliability targets such as ’undetected failures per time,’ current HPC system design does not allow for pursuing trade-offs between reliability and performance with respect to faults. To address the above, we propose a user-centric reliability benchmark to specify HPC system reliability targets, allowing for better performance optimizations in hardware design, while meeting HPC user expectations. Our open-source Hardware Design Fault Injection Toolkit (HDFIT) enables - for the first time - end-to-end hardware design reliability experiments: from netlist-level fault injection to application output error. In a proof of concept we present an HPC general matrix multiply (GEMM) reliability study, targeting a series of popular applications, and using HDFIT to benchmark an open-source GEMM accelerator.

Published in: IEEE Transactions on Parallel and Distributed Systems ( Volume: 34, Issue: 3, 01 March 2023)

Page(s): 995 - 1006

Date of Publication: 17 January 2023

ISSN Information:

DOI: 10.1109/TPDS.2023.3237777

Contents

References is not available for this document.

HPC Hardware Design Reliability Benchmarking With HDFIT

Abstract:

Metadata

Abstract:

ISSN Information:

References

IEEE Account

Purchase Details

Profile Information

Need Help?

HPC Hardware Design Reliability Benchmarking With HDFIT

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Authors

Figures

References

Citations

Keywords

Metrics

Supplemental Items

Footnotes

References

IEEE Account

Purchase Details

Profile Information

Need Help?