Skip to Main Content
Data locality has recently been extensively exploited in Cloud computing to improve system performance. However, when schedule Map tasks in Hadoop MapReduce framework working in a heterogeneous environment, existing methods either cannot reduce the occurrence of these Map tasks or injure fairness, thus degrading the system performance. In order to address this problem, this paper proposes a data locality aware scheduling method to improve the Hadoop MapReduce system performance in heterogeneous computing environments. After receiving a request from a requesting node, our method preferentially schedules the task whose input data is stored on the requesting node. If no such tasks exist, our method will select the task whose input data is nearest to the requesting node, and then make a decision on whether to reserve the task for the node storing the input data or schedule the task to the requesting node by transferring the input data to the requesting node on the fly. As a proof of concept, we implement the method in Hadoop-0.20.2. In order to evaluate the performance, we carry out an experimental comparison study on our proposed method against the default scheduling method used in Hadoop-0.20.2. The experiment results show that our proposed method improves the data locality and reduces the normalized execution time as well as the response time of jobs.