Abstract:
Processor reliability at upcoming technology nodes presents significant challenges to designers from increased manufacturing variability, parametric variation and transis...Show MoreMetadata
Abstract:
Processor reliability at upcoming technology nodes presents significant challenges to designers from increased manufacturing variability, parametric variation and transistor wear-out leading to permanent faults. We present a design to tolerate this impact at the microarchitectural level-a chip with n cores together with one or more shared instruction re-execution units (IRUs). Instructions using a faulty component are identified and re-executed on an IRU. This design incurs no slowdown in the absence of errors and allows continued operation of all n cores after multiple hard errors on one or all cores in the structures protected by our scheme. Experiments show that a single-core chip experiences only a 23% slowdown with 1 error, rising to 43% in the presence of 5 errors. In a 4-core scenario with 4 errors on every core and a shared IRU, REPAIR enables performance of 0.68× of a fully functioning system.
Published in: 2015 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFTS)
Date of Conference: 12-14 October 2015
Date Added to IEEE Xplore: 09 November 2015
ISBN Information: