REPAIR: Hard-error recovery via re-execution | IEEE Conference Publication | IEEE Xplore

REPAIR: Hard-error recovery via re-execution


Abstract:

Processor reliability at upcoming technology nodes presents significant challenges to designers from increased manufacturing variability, parametric variation and transis...Show More

Abstract:

Processor reliability at upcoming technology nodes presents significant challenges to designers from increased manufacturing variability, parametric variation and transistor wear-out leading to permanent faults. We present a design to tolerate this impact at the microarchitectural level-a chip with n cores together with one or more shared instruction re-execution units (IRUs). Instructions using a faulty component are identified and re-executed on an IRU. This design incurs no slowdown in the absence of errors and allows continued operation of all n cores after multiple hard errors on one or all cores in the structures protected by our scheme. Experiments show that a single-core chip experiences only a 23% slowdown with 1 error, rising to 43% in the presence of 5 errors. In a 4-core scenario with 4 errors on every core and a shared IRU, REPAIR enables performance of 0.68× of a fully functioning system.
Date of Conference: 12-14 October 2015
Date Added to IEEE Xplore: 09 November 2015
ISBN Information:

ISSN Information:

Conference Location: Amherst, MA, USA

Contact IEEE to Subscribe

References

References is not available for this document.