Conferences >2024 IEEE International Confe...

Understanding Mixed Precision GEMM with MPGemmFI: Insights into Fault Resilience

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Emerging deep learning workloads urgently need fast general matrix multiplication (GEMM). Thus, one of the critical features of machine-learning-specific accelerators suc...Show More

Metadata

Abstract:

Emerging deep learning workloads urgently need fast general matrix multiplication (GEMM). Thus, one of the critical features of machine-learning-specific accelerators such as NVIDIA Tensor Cores, AMD Matrix Cores, and Google TPUs is the support of mixed-precision enabled GEMM. For DNN models, lower-precision FP data formats and computation offer acceptable correctness but significant performance, area, and memory footprint improvement. While promising, the mixed-precision computation on error resilience remains unexplored. To this end, we develop a fault injection framework that systematically injects fault into the mixed-precision computation results. We investigate how the faults affect the accuracy of machine learning applications. Based on error resilience characteristics, we offer lightweight error detection and correction solutions that significantly improve the overall model accuracy by 75% if the models experience hardware faults. The solutions can be efficiently integrated into the accelerator's pipelines.

Published in: 2024 IEEE International Conference on Cluster Computing (CLUSTER)

Date of Conference: 24-27 September 2024

Date Added to IEEE Xplore: 07 November 2024

ISBN Information:

ISSN Information:

DOI: 10.1109/CLUSTER59578.2024.00022

Conference Location: Kobe, Japan

Funding Agency:

Contents

References is not available for this document.

Understanding Mixed Precision GEMM with MPGemmFI: Insights into Fault Resilience

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Understanding Mixed Precision GEMM with MPGemmFI: Insights into Fault Resilience

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

Authors

Figures

References

Keywords

Metrics

Footnotes

References

IEEE Account

Purchase Details

Profile Information

Need Help?