By Topic

Mass Storage on the Terascale Computing System

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$33 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

7 Author(s)
Stone, N.T.B. ; Pittsburgh Supercomputing Center, Pittsburgh, PA 15213 ; Scott, J.R. ; Kochmar, J. ; Sommerfield, J.
more authors

On August 3rd, 2000, the National Science Foundation announced the award of $45 million to the Pittsburgh Supercomputing Center to provide "terascale" computing capability for U.S. researchers in all science and engineering disciplines. This Terascale Computing System (TCS) will be built using commodity components. The computational engines will be Compaq Alpha CPU's in a four processors per node configuration. The final system will have over 682 quad processor nodes, for a total of more than 2700 Alpha CPUs. Each node will have 4 gigabytes of memory, for a total system memory of over 2.5 terabytes. All the nodes will be interconnected by a high speed, very low latency Quadrics switch fabric, constructed in a full "fat-tree" topology. Given the very high component count in this system, it is important to architect the solution to be tolerant of failures. Part of this architecture is the efficient saving of a program's memory state to disk. This checkpointing of the program memory should be easy for the programmer to invoke and sufficiently fast to allow for frequent checkpoints, yet it should not severely impact the performance of the compute nodes or file servers. There will be flexibility in the recovery so that spare nodes can be automatically swapped in for failed nodes and the job restarted from the most recent checkpoint without significant user or system management intervention. It is estimated that the file servers required to collect and store these checkpoints and other temporary storage for executing jobs will collectively have ?27 terabytes of disk storage. The file servers will maintain a disk cache that will be migrated to a Hierarchical Storage Manager serving over 300 terabytes. This paper will discuss the hardware and software architectural design of the TCS machine. The software architecture of this system rests upon Compaq's "AlphaServer SC" software, which includes administration, control, accounting, and scheduling software. We will describe - our accounting, scheduling, and monitoring systems and their relation to the software included in AlphaServer SC.

Published in:

Mass Storage Systems and Technologies, 2001. MSS '01. Eighteenth IEEE Symposium on

Date of Conference:

17-20 April 2001