By Topic

Cardio: Adaptive CMPs for reliability through dynamic introspective operation

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$33 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

2 Author(s)
Andrea Pellegrini ; University of Michigan ; Valeria Bertacco

Current technology scaling enables the integration of tens of processing elements into a single chip, and future technology nodes will soon allow the integration of hundreds of cores per device. While very powerful, many experts agree that these systems will be prone to a significant number of permanent and transient faults during their lifetime. If not properly handled, effects of runtime failures can be dramatic. In this work, we propose Cardio, a distributed architecture for reliable chip multiprocessors. Cardio, a novel approach for on-chip reliability is based on hardware detectors that spot failures and on software routines that reorganize the system to work around faulty components. Compared to previous online reliability solutions, Cardio provides failure reactivity comparable to hardware-only reliable solutions while requiring a much lower area overhead. Cardio operates a distributed resource manager to collect health information about components and leverages a robust distributed control mechanism to manage system-level recovery. Our architecture operational as long as at least one general purpose processor is still functional in the chip. We evaluated our design using a custom simulator and estimate its runtime impact on the SPECMPI benchmarks to be lower than 3%. We estimate its dynamic reconfiguration time to be comprised between 20 and 50 thousand cycles per failure.

Published in:

High Level Design Validation and Test Workshop (HLDVT), 2011 IEEE International

Date of Conference:

9-11 Nov. 2011