Skip to Main Content
Recently, cloud computing has emerged as a promising computing infrastructure for performing scientific workflows by providing on-demand resources. Meanwhile, it is convenient for scientific collaboration since different cloud environments used by the researchers are connected through Internet. However, the significant latency arising from frequent access to large datasets and the corresponding data movements across geo-distributed data centers has been an obstacle to hinder the efficient execution of data-intensive scientific workflows. In this paper, we propose a novel graph-cut based data and task co scheduling strategy for minimizing the data transfer across geo-distributed data centers. Specifically, a dependency graph is firstly constructed from workflow provenance and cut into sub graphs according to the datasets which must appear in fixed data centers by a multiway cut algorithm. Then, the sub graphs might be recursively cut into smaller ones by a minimum cut algorithm referring to data correlation rules until all of them can well fit the capacity constraints of the data centers where the fixed location datasets reside. In this way, the datasets and tasks are distributed into target data centers while the total amount of data transfer between them is minimized. Additionally, a runtime scheduling algorithm is exploited to dynamically adjust the data placement during execution to prevent the data centers from overloading. Simulation results demonstrate that the total volume of data transfer across different data centers can be significantly reduced and the cost of performing scientific workflows on the clouds will be accordingly saved.