By Topic

Transparent system-level migration of PGAS applications using Xen on InfiniBand

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

7 Author(s)
Scarpazza, D.P. ; Pacific Northwest Nat. Lab., Richland, WA ; Mullaney, P. ; Villa, O. ; Petrini, F.
more authors

Checkpoint-restart is considered one of the most natural approaches to achieving fault-tolerance in a high-performance cluster. While experiences has focused attention on user-level solutions, the advent of efficient system-level virtualization software, such as Xen and VMWare, has opened the door to the possibility of efficient and scalable cluster-level virtualization. In this paper we present an innovative approach to cluster fault tolerance by integrating the Xen virtualization with the latest generation of the InfiniBand network. A major contribution of this approach is the automatic identification of global recovery lines to freeze the status of the machine. Our focus is on the partitioned global address space (PGAS) programming models. PGAS models has been receiving an increasing amount of attention in the recent years. We have developed a global coordination mechanism and deployed it in the Aggregate Remote Memory Copy Interface (ARMCI) one-sided communication library that has been used as a run-time system for several PGAS languages and libraries. The experimental results show that it is possible to virtualize communication and computation with minimal overhead and to provide seamless migration capabilities.

Published in:

Cluster Computing, 2007 IEEE International Conference on

Date of Conference:

17-20 Sept. 2007