Skip to Main Content
MapReduce has emerged as an important and widely used programming model for distributed and parallel computing, due to its ease of use, generality and scalability. This model is proposed to mainly solve large-scale data processing, i.e. data-intensive jobs, and it is optimized for homogenous environment, in which computing nodes are identical and dedicated. Today enterprise IT systems preserve massive, historical management and operational data, which need both data-intensive and computation-intensive analysis while using heterogeneous computing resources. In order to support enterprise data analysis application with the MapReduce model, it is important to improve MapReduce's task scheduling algorithm that can reduce the overall completion time with multi-type jobs and in heterogeneous environments. This paper formulates the scheduling problem as an optimization problem. Based on the job shop scheduling theory and existing approximation algorithms, we propose a new dispatching-rule-based and online scheduling policy LPT-θ. By using LPT-θ, the tasks with larger processing time and within a θ-space would be assigned with higher priorities. Numerical results show that LPT-θ can achieve a 12%~45% performance gain compared with the original scheduling algorithm in MapReduce.