Skip to Main Content
MapReduce is a distributed computing programming framework which provides an effective solution to the data processing challenge. As an open-source implementation of MapReduce, Hadoop has been widely used in practice. The performance of Hadoop MapReduce heavily depends on its configuration settings, so tuning these configuration parameters could be an effective way to improve its performance. However, picking out the optimal configuration settings is not easy for the time consuming nature of MapReduce together with the high dimensional and nonlinear features of its configuration optimization. In this paper, we introduce Predator, an experience guided configuration optimizer, which does not treat the optimization problem as a pure black-box problem but utilizes useful experience learnt from Hadoop MapReduce configuration practice to assist the optimizing process. The optimizer uses job execution time estimated by a practical MapReduce cost model as the objective function, and classifies Hadoop MapReduce parameters into different groups by their different tunable levels to shrink search space. Furthermore, the optimization algorithm of the optimizer uses the idea of subspace division to prevent local optimum problem, and it could also reduce the searching time by cutting down the cost in visiting unpromising points in search space. Experiments on Hadoop clusters demonstrate the effectiveness and efficiency of the optimizer.