Skip to Main Content
Recently emerged cloud computing offers a promising platform for executing scientific workflow applications due to its similar performance compared to the grid, lower cost, elasticity and so on. Collaborative cloud environments, which share resources of multiple geographically distributed data centers owned by different organizations enable researchers from all over the world to conduct their large scale data intensive research together through Internet. However, since scientific workflows consume and generate huge amount of data, it is thus essential to manage the data effectively for the purpose of high performance and cost effectiveness. In this paper, we propose intelligent data placement strategy to improve performance of workflows while minimizing data transfer among data centers. Specifically, at the startup stage, the whole dataset is divided into small data items which are then distributed among multiple data centers by considering these data centers' computation capability, storage budget, data item correlation, etc. During the runtime stage, when intermediate data is generated, it is placed on the suitable data centers using linear discriminant analysis by taking into account the same metrics as at the startup stage, as well as data centers' past behaviors (i.e., trustworthiness in terms of task delay). Simulation results demonstrate the promise of our data placement strategy by showing that compared to existing data placement strategies, our proposal effectively places the data to improve computation progress on the whole while minimizing the communication overheads incurred by data movement.