By Topic

Asymmetries in soft-error rates in a large cluster system

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

1 Author(s)
Harris, K.W. ; High Performance Comput. Div., Hewlett-Packard Co., Nashua, NH, USA

Early in the deployment of the ASC Q cluster supercomputer system, an unexpectedly high rate of soft errors were observed in the board-level cache subsystems of the constituent AlphaServer ES45 systems that make up the compute component of this large cluster. A series of tests and experiments was undertaken to validate the hypothesis that this frequency was consistent with the high level of terrestrial secondary cosmic-ray neutron flux resulting from the high elevation of its installation site. The overall success of this effort is reported elsewhere in this issue. This paper reports on three secondary phenomena that were observed during these tests and experiments: Error logs were collected from all servers during a representative period and examined for nonrandom event rates, which would indicate a systematic cause. The only significant result of this exploration was the discovery of a latent soft-error discovery effect, and a self-shielding effect, whereby the servers positioned physically higher in their racks suffered disproportionately higher soft-error rates. This excess was examined and found to be consistent with established shielding effect of the high-Z composition of the constituents of the overlying systems. Experiments with individual ES45 systems in an artificial neutron beam at the Los Alamos Neutron Science Center facility have established that the soft-error rates observed in the SRAM parts is significantly dependent on the incident direction of the neutrons in the beam. These asymmetries could be exploited as part of a strategy for mitigating the frequency of soft errors in future computer systems.

Published in:

Device and Materials Reliability, IEEE Transactions on  (Volume:5 ,  Issue: 3 )