Notification:
We are currently experiencing intermittent issues impacting performance. We apologize for the inconvenience.
By Topic

A Holistic Approach to System Reliability in Blue Gene

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

20 Author(s)
Blumrich, M. ; IBM T.J. Watson Res. Center, Yorktown Heights, NY ; Chen, D. ; Chiu, G.L.-T. ; Cipolla, T.
more authors

Optimizing supercomputer performance requires a balance between objectives for processor performance, network performance, power delivery and cooling, cost and reliability. In particular, scaling a system to a large number of processors poses challenges for reliability, availability and serviceability. Given the power and thermal constraints of data centers, the BlueGene/L supercomputer has been designed with a focus on maximizing floating point operations per second per Watt (FLOPS/Watt). This results in a drastic reduction in FLOPS/m2 floor space and FLOPS/dollar, allowing for affordable scale-up. The BlueGene/L system has been scaled to a total of 65,536 compute nodes in 64 racks. A system approach was used to minimize power at all levels, from the processor to the cooling plant. A BlueGene/L compute node consists of a single ASIC and associated memory. The ASIC integrates all system functions including processors, the memory subsystem and communication, thereby minimizing chip count, interfaces, and power dissipation. As the number of components increases, even a low failure rate per-component leads to an unacceptable system failure rate. Additional mechanisms have to be deployed to achieve sufficient reliability at the system level. In particular, the data transfer volume in the communication networks of a massively parallel system poses significant challenges on bit error rates and recovery mechanisms in the communication links. Low power dissipation and high performance, along with reliability, availability and serviceability were prime considerations in BlueGene/L hardware architecture, system design, and packaging. A high-performance software stack, consisting of operating system services, compilers, libraries and middleware, completes the system, while enhancing reliability and data integrity

Published in:

Innovative Architecture for Future Generation High Performance Processors and Systems, 2006. IWIA '06. International Workshop on

Date of Conference:

Jan. 2006