Today's mainstream electronic systems typically assume that transistors and interconnects operate correctly over their useful lifetime. With enormous complexity and significantly increased vulnerability to failures compared to the past, future system designs cannot rely on such assumptions. For coming generations of silicon technologies, several causes of hardware reliability failures, largely benign in the past, are becoming significant at the system level. Robust system design is essential to ensure that future systems perform correctly despite rising complexity and increasing disturbances. This paper describes three techniques that can enable a sea change in robust system design through cost-effective tolerance and prediction of failures in hardware during system operation: 1) efficient soft error resilience; 2) circuit failure prediction; and 3) effective on-line self-test and diagnostics. The need for global optimization across multiple abstraction layers is also demonstrated.
Published in:
Emerging and Selected Topics in Circuits and Systems, IEEE Journal on
(Volume:1
,
Issue:
1
)
Date of Publication: March 2011