Skip to Main Content
This paper addresses the challenges of providing fault-tolerance in scientific workflow management. The specification and handling of faults in scientific workflows should be defined precisely in order to ensure the consistent execution against the process-specific requirements. We identified a number of typical failure patterns that occur in real-life scientific workflow executions. Following the intuitive recovery strategies that correspond to the identified patterns, we developed the methodologies that integrate recovery fragments into fault-prone scientific workflow models. Compared to the existing fault-tolerance mechanisms, the propositions reduce the effort of workflow designers by defining recovery fragments automatically. Furthermore, the developed framework implements the necessary mechanisms to capture the faults from the different layers of a scientific workflow management architecture. Experience indicates that the framework can be employed effectively to model, capture and tolerate the typical failure patterns that we identified.