Loading [MathJax]/extensions/MathZoom.js
From Correctable Memory Errors to Uncorrectable Memory Errors: What Error Bits Tell | IEEE Conference Publication | IEEE Xplore

From Correctable Memory Errors to Uncorrectable Memory Errors: What Error Bits Tell


Abstract:

Uncorrectable memory errors are one of the major failure causes in datacenters. In this paper, we present an empirical study correlating correctable errors (CEs) and unco...Show More

Abstract:

Uncorrectable memory errors are one of the major failure causes in datacenters. In this paper, we present an empirical study correlating correctable errors (CEs) and uncorrectable errors (UEs) using the large-scale field data across 3 major dual in-line memory module (DIMM) manufacturers from a contemporary server farm of ByteDance. Different from the previous studies, our study is the first to comprehend the error-bit information of CEs and the DIMM part numbers. Unlike the traditional chipkill error correction code (ECC), in contemporary Intel server platforms the ECC gets weakened, not able to tolerate some error-bit patterns from a single chip. Using obtainable coarse-grained ECC knowledge, we derive a new indicator from the error-bit information: risky CE occurrence in terms of ECC guaranteed coverage. From the data, we show that the new indicator has a consistently high sensitivity and specificity in the test of future UE occurrences across DIMMs from different manufacturers. This leads us to conjecture that the weakened ECC substantially contributes to many UEs today. The new risky CE indicator is then applied in predicting the future UE occurrence based on the CE history. We empirically demonstrate how practically useful predictors are constructed in conjunction with other useful attributes such as certain micro-level fault indicators and DIMM part numbers, achieving the state-of-the-art performance.
Date of Conference: 13-18 November 2022
Date Added to IEEE Xplore: 23 February 2023
ISBN Information:

ISSN Information:

Conference Location: Dallas, TX, USA

Contact IEEE to Subscribe

References

References is not available for this document.