Skip to Main Content
A modern digital system includes in a single chip many components: processing cores, large caches, memory controllers, and hardware accelerators. Looking forward, future semiconductor technologies will enable even higher device integration, overall increasing system performance while reducing energy consumption. Unfortunately, prominent experts agree that such technologies will be prone to both permanent and transient faults within their lifetime. With the goal of addressing this issue, we propose Cardio: a low-cost architecture for reliable chip multiprocessors. Our solution is based on a novel hardware/software co-design where silicon failures are detected in hardware and system reconfiguration is managed in software. Comparing Cardio with a state-of-the-art hardware-based resiliency solution, Immunet, we found that our design can achieve a comparable fault response time while requiring a much lower area overhead. The proposed solution relies on a distributed resource manager to collect information about a CMP component's health, and leverages a synchronized distributed control mechanism to recover from permanent failures. Such architecture can operate as long as at least one general-purpose processor is still functional. Our experimental evaluation indicates that the overall performance impact of Cardio is as low as 4.5%, and its dynamic reconfiguration time upon fault detection is comprised between 20 and 50 thousand cycles.