Current technology scaling is leading to increasingly fragile components, making hardware reliability a primary design consideration. Recently researchers have proposed low-cost reliability solutions that detect hardware faults through software-level symptom monitoring. SWAT (SoftWare Anomaly Treatment), one such solution, demonstrated with microarchitecture-level simulations that symptom-based solutions can provide high fault coverage and a low Silent Data Corruption (SDC) rate. However, more accurate evaluations are needed to validate such solutions for hardware faults in real-world processor designs. In this paper, we evaluate SWAT's symptom-based detectors on gate-level faults using an FPGA-based, full-system prototype. With this platform, we performed a gate-level accurate fault injection campaign of 51,630 fault injections in the OpenSPARC T1 core logic across five SPECInt 2000 benchmarks. With an overall SDC rate of 0.79%, our results are comparable to previous microarchitecture-level evaluations of SWAT, demonstrating the effectiveness of symptom-based software detectors for permanent faults in real-world designs.
Published in:
Design, Automation & Test in Europe Conference & Exhibition (DATE), 2012
Date of Conference: 12-16 March 2012