Abstract:
There exists a class of scientific applications for which utilizing distributed resources is critical for reducing the time-to-solution. In this paper, we discuss a speci...Show MoreMetadata
Abstract:
There exists a class of scientific applications for which utilizing distributed resources is critical for reducing the time-to-solution. In this paper, we discuss a specific class of applications - Replica-Exchange simulations - where the orchestration of many distributed jobs in a dynamic and inherently unreliable distributed environment is essential for a successful completion. We describe the design, development and deployment of a unique framework for constructing fault-tolerant distributed simulations. The framework consists of two primary components - SAGA and Migol. SAGA is a high-level programmatic abstraction layer that provides a standardised interface for the primary distributed functionality required for application development. We present details of a newly developed functionality in SAGA - the Checkpoint and Recovery (CPR) API. Migol is an adaptive middleware, which supports the fault-tolerance of distributed applications by providing the capability to recover applications from checkpoint files transparently. In addition to describing the integration of SAGA-CPR with the Migol infrastructure, we outline our experiences with running a large scale, general-purpose Replica-Exchange application in a production distributed environment.
Published in: 2008 IEEE Fourth International Conference on eScience
Date of Conference: 07-12 December 2008
Date Added to IEEE Xplore: 06 January 2009
ISBN Information: