Close category search window
 

Utilizing the Multi-threading Techniques to Improve the Two-Level Checkpoint/Rollback System for MPI Applications

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

2 Author(s)
Yuan Tang ; Software Sch., Fudan Univ., Shanghai ; Yunquan Zhang

With the increasing number of processors in modern HPC (high performance computing) systems, there are two emergent problems to solve. One is scalability, the other is fault tolerance. In our previous work, we proposed an MPI operation level checkpoint/rollback system. The main benefits of the system is that it offers the opportunity to employ in-memory (disk-less) checkpoint/rollback techniques which has demonstrated a much better performance over its on-disk counterpart, and the opportunity to have a concurrent two level recover-and-continue MPI system which has been proven to have a high efficiency. To the scope of my knowledge, this is the first concurrent two-level checkpoint/recovery system in use. With the coming of multi-core era, it's time to utilize the multi-threading techniques to improve the performance of in-memory checkpointing algorithm. In this paper, we present two versions of MPI operation level checkpoint/rollback system, one is of single-threaded, the other is of multi-threaded. Also, we provide an in-depth performance analysis between these two approaches to illustrate the benefits of multi-threading techniques on multi-core platform. With the progress of our work, a picture of the hierarchy of future generation fault tolerant HPC system is gradually unrolled.

Published in:
High Performance Computing and Communications, 2008. HPCC '08. 10th IEEE International Conference on

Date of Conference: 25-27 Sept. 2008

Need Help?


IEEE Advancing Technology for Humanity About IEEE Xplore | Contact | Help | Terms of Use | Nondiscrimination Policy | Site Map | Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest professional association for the advancement of technology.
© Copyright 2013 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.