Skip to Main Content
Scientific workflows are a topic of great interest in the grid community that sees in the workflow model an attractive paradigm for programming distributed wide-area grid infrastructures. Traditionally, the grid workflow execution is approached as a pure best effort scheduling problem that maps the activities onto the grid processors based on appropriate optimization or local matchmaking heuristics such that the overall execution time is minimized. Even though such heuristics often deliver effective results, the execution in dynamic and unpredictable grid environments is prone to severe performance losses that must be understood for minimizing the completion time or for the efficient use of high-performance resources. In this paper, we propose a new systematic approach to help the scientists and middleware developers understand the most severe sources of performance losses that occur when executing scientific workflows in dynamic grid environments. We introduce an ideal model for the lowest execution time that can be achieved by a workflow and explain the difference to the real measured grid execution time based on a hierarchy of performance overheads for grid computing. We describe how to systematically measure and compute the overheads from individual activities to larger workflow regions and adjust well-known parallel processing metrics to the scope of grid computing, including speedup and efficiency. We present a distributed online tool for computing and analyzing the performance overheads in real time based on event correlation techniques and introduce several performance contracts as quality-of-service parameters to be enforced during the workflow execution beyond traditional best effort practices. We illustrate our method through postmortem and online performance analysis of two real-world workflow applications executed in the Austrian grid environment.