Skip to Main Content
Resilience is becoming an increasingly critical performance requirement for future large-scale computing systems. In data center and high-performance computing systems with many thousands of nodes, errors in main memory can be a significant source of failures. As a result, large-scale memory systems must employ advanced error detection and correction techniques to mitigate failures. Memory devices are primarily designed for density, optimizing memory capacity and throughput, rather than resilience. A strict focus on memory performance instead of resilience risks undermining the overall stability of next-generation computers. In this work, we leverage an optically connected memory system to optimize both memory performance and resilience. A multicast-capable optical interconnection network replaces the traditional electronic bus between a processor and its main memory, allowing for a novel error-correction technique based on dynamic bit-steering. As compared to an electronically connected approach, we demonstrate significantly higher memory bandwidths and reduced latencies, in addition to a 700× improvement in resilience.