Understanding Mixed Precision GEMM with MPGemmFI: Insights into Fault Resilience | IEEE Conference Publication | IEEE Xplore

Understanding Mixed Precision GEMM with MPGemmFI: Insights into Fault Resilience


Abstract:

Emerging deep learning workloads urgently need fast general matrix multiplication (GEMM). Thus, one of the critical features of machine-learning-specific accelerators suc...Show More

Abstract:

Emerging deep learning workloads urgently need fast general matrix multiplication (GEMM). Thus, one of the critical features of machine-learning-specific accelerators such as NVIDIA Tensor Cores, AMD Matrix Cores, and Google TPUs is the support of mixed-precision enabled GEMM. For DNN models, lower-precision FP data formats and computation offer acceptable correctness but significant performance, area, and memory footprint improvement. While promising, the mixed-precision computation on error resilience remains unexplored. To this end, we develop a fault injection framework that systematically injects fault into the mixed-precision computation results. We investigate how the faults affect the accuracy of machine learning applications. Based on error resilience characteristics, we offer lightweight error detection and correction solutions that significantly improve the overall model accuracy by 75% if the models experience hardware faults. The solutions can be efficiently integrated into the accelerator's pipelines.
Date of Conference: 24-27 September 2024
Date Added to IEEE Xplore: 07 November 2024
ISBN Information:

ISSN Information:

Conference Location: Kobe, Japan

Funding Agency:


Contact IEEE to Subscribe

References

References is not available for this document.