Skip to Main Content
Even though highly distributed environments such as Clouds and Grids are increasingly used for e-science high performance applications, they still cannot deliver the robustness and reliability needed for widespread acceptance as ubiquitous scientific tools. To overcome this problem, existing systems resort to fault tolerance mechanisms such as task replication and task resubmission. In this paper we propose a new heuristic called resubmission impact to enhance the fault tolerance support for scientific workflows in highly distributed systems. In contrast to related approaches, our method can be used effectively on systems even in the absence of historic failure trace data. Simulated experiments of three real scientific workflows in the Austrian Grid environment show that our algorithm drastically reduces the resource waste compared to conservative task replication and resubmission techniques, while having a comparable execution performance and only a slight decrease in the success probability.