Loading [MathJax]/extensions/MathZoom.js
A Systematic Study of DDR4 DRAM Faults in the Field | IEEE Conference Publication | IEEE Xplore

A Systematic Study of DDR4 DRAM Faults in the Field


Abstract:

This paper presents a study of DDR4 DRAM faults in a large fleet of commodity servers, covering several billion memory device-hours of data. The goal of this study is to ...Show More

Abstract:

This paper presents a study of DDR4 DRAM faults in a large fleet of commodity servers, covering several billion memory device-hours of data. The goal of this study is to understand faults in DDR4 DRAM devices to measure the efficacy of existing hardware resilience techniques and aid in designing more resilient systems for future large-scale systems.The study has several key findings about the fault characteristics of DDR4 DRAMs and adds several novel insights about system reliability to the existing literature. Specifically, the data show sixteen unique fault modes in the DDR4 DRAM under study, including several that have not been previously reported. Over 45% of the faults that occurred affected multiple DRAM bits. The time-to-failure characteristics of faults internal to the DRAM die differ from those external to the DRAM die. We also examine faults from multiple DRAM vendors, finding that fault rates vary by more than 1.34x among vendors.Finally, we use the data to compare chipkill ECC and an ECC that covers a DDR5 "bounded fault." Given the fault rates in this data, a bounded fault ECC increases the rate of faults that cause uncorrectable errors by up to 5.71 FIT per DRAM device compared to chipkill ECC.
Date of Conference: 25 February 2023 - 01 March 2023
Date Added to IEEE Xplore: 24 March 2023
ISBN Information:

ISSN Information:

Conference Location: Montreal, QC, Canada

Contact IEEE to Subscribe

References

References is not available for this document.