Skip to Main Content
MapReduce is currently an attractive model for data intensive application due to easy interface of programming, high scalability and fault tolerance capability. It is well suited for applications requiring processing large data with distributed processing resources such as web data analysis, bio informatics, and high performance computing area. There are many studies of job scheduling mechanism in shared cluster for MapReduce. However there is a need for scheduling workflow service composed of multiple MapReduce tasks with precedence dependency in multiple processing nodes. The contribution of this paper is proposing a scheduling mechanism for a workflow service containing multiple MapReduce jobs. The workflow application has precedence dependency constraints among multiple tasks, represented as directed acyclic graph (DAG). Also, for less data transfer cost in limited bisection bandwidth, data dependency criterion should be considered for scheduling multiple map-reduce jobs in a workflow. The proposed scheduling mechanism provides 1) scheduling MapReduce tasks regarding precedence constraints and 2) pre-data placement method considering data dependency constraints for saving data transfer cost over network.