Due to the large-scale and long running of scientific computation under the dynamic and unsteady grid architecture, the capability of fault-tolerance of scientific workflow management system becomes more and more important. In order to handle inevitable failures of activities in workflow, we present a three-level recovery strategy in this paper: in the service level, we provide a distributed Service Agent (SA) for each activity to monitor the execution status of workflow activities and implement the retry-based recovery strategy by submitting the failed activity multiple times; then in the workflow level, workflow engine implements replication-based strategy by request the Service Factory (SF) to create another service instance on a different node and invoke the new service instance for replacement; while in the user level, we provide a user interface for the users to handle the failure on demand. At last, a reliable Problem Solving Environment (PSE) in climate domain called Ensemble Prediction Scientific Workflow (EPSWFlow) is presented. This approach can seamlessly embed the complex control-flow intensive recovery strategies within the dataflow process network. Moreover, it can enable the prediction process more robust and more reusable.
Published in:
Asia-Pacific Services Computing Conference, 2008. APSCC '08. IEEE
Date of Conference: 9-12 Dec. 2008