Skip to Main Content
Complex scientific workflows are now commonly executed on global grids. With the increasing scale complexity, heterogeneity and dynamism of grid environments the challenges of managing and scheduling these workflows are augmented by dependability issues due to the inherent unreliable nature of large-scale grid infrastructure. In addition to the traditional fault tolerance techniques, specific checkpoint-recovery schemes are needed in current grid workflow management systems to address these reliability challenges. Our research aims to design and develop mechanisms for building an autonomic workflow management system that will exhibit the ability to detect, diagnose, notify, react and recover automatically from failures of workflow execution. In this paper we present the development of a Fault Tolerance and Recovery component that extends the ActiveBPEL workflow engine. The detection mechanism relies on inspecting the messages exchanged between the workflow and the orchestrated Web Services in search of faults. The recovery of a process from a faulted state has been achieved by modifying the default behavior of ActiveBPEL and it basically represents a non-intrusive checkpointing mechanism. We present the results of several scenarios that demonstrate the functionality of the Fault Tolerance and Recovery component, outlining an increase in performance of about 50% in comparison to the traditional method of resubmitting the workflow.