Skip to Main Content
The execution of data-intensive workflow applications in scientific and enterprise grids has gained popularity in recent times. Such applications process large and dynamic data sets, and often present scope for optimized data handling that can be exploited for performance. Traditionally, core grid middleware technologies of scheduling and orchestration, have treated data management as a background activity - decoupled from job management and handled at the storage and/or network protocol level. We believe that an important requirement for building data-aware grid technologies lies in managing data flows at the application level, in conjunction with their computation counterparts. To this end, we present Data-WISE, an end-to-end framework for management of data-intensive workflows as first class citizens, that addresses aspects of data flow orchestration, co-scheduling and runtime management. The optimizations are focused on exploiting application structure for use of data parallelism, replication, and runtime adaptations. We implement data-WISE on a real testbed and demonstrate significant improvements in terms of application response time, resource utilization, and adaptability to varying resource conditions. The proposed framework acts as an important step towards making distributed execution of data-intensive workflows a reality.