Skip to Main Content
While attention in the realm of computer design has shifted away from the classic DRAM soft-error rate (SER) and focused instead on SRAM and microprocessor latch sensitivities as sources of potential errors, DRAM SER nonetheless remains a challenging problem. This is true even though both cosmic ray-induced and alpha-particle-induced DRAM soft errors have been well modeled and, to a certain degree, well understood. However, the often-overlooked alignment of a DRAM hard error and a random soft error can have major reliability, availability, and serviceability (RAS) implications for systems that require an extremely long mean time between failures. The net of this effect is that what appears to be a well-behaved, single-bit soft error ends up overwhelming a seemingly state-of-the-art mitigation technique. This paper describes some of the history of DRAM soft-error discovery and the subsequent development of mitigation strategies. It then examines some architectural considerations that can exacerbate the effect of DRAM soft errors and may have system-level implications for today's standard fault-tolerance schemes.
Note: The Institute of Electrical and Electronics Engineers, Incorporated is distributing this Article with permission of the International Business Machines Corporation (IBM) who is the exclusive owner. The recipient of this Article may not assign, sublicense, lease, rent or otherwise transfer, reproduce, prepare derivative works, publicly display or perform, or distribute the Article.