Abstract:
Many companies are piloting the use of Hadoop for advanced data analytics over large datasets. Typically, such MapReduce programs represent workflows of MapReduce jobs. C...Show MoreMetadata
Abstract:
Many companies are piloting the use of Hadoop for advanced data analytics over large datasets. Typically, such MapReduce programs represent workflows of MapReduce jobs. Currently, a user must specify the number of reduce tasks for each MapReduce job. The choice of the right number of reduce tasks is non-trivial and depends on the cluster size, input dataset of the job, and the amount of resources available for processing this job. In the workflow of MapReduce jobs, the output of one job becomes the input of the next job, and therefore the number of reduce tasks in the previous job may impact the performance and processing efficiency of the next job. In this work,1 we offer a novel performance evaluation framework for easing the user efforts of tuning the reduce task settings while achieving performance objectives. The proposed framework is based on two performance models: a platform performance model and a workflow performance model. A platform performance model characterizes the execution time of each generic phase in the MapReduce processing pipeline as a function of processed data. The complementary workflow performance model evaluates the completion time of a given workflow as a function of i) input dataset size(s) and ii) the reduce tasks' settings in the jobs that comprise a given workflow. We validate the accuracy, effectiveness, and performance benefits of the proposed framework using a set of realistic MapReduce applications and queries from the TPC-H benchmark.
Date of Conference: 27-31 May 2013
Date Added to IEEE Xplore: 01 August 2013
ISBN Information:
Print ISSN: 1573-0077
Conference Location: Ghent, Belgium