By Topic

Characterization of Hadoop Jobs Using Unsupervised Learning

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

3 Author(s)

MapReduce programming paradigm and its open source implementation, Apache Hadoop, is increasingly being used for data-intensive applications in cloud computing environments. An understanding of the characteristics of workloads running in MapReduce environments benefits both the cloud service providers and their users. This work characterizes Hadoop jobs running on production clusters at Yahoo! using unsupervised learning. Unsupervised clustering techniques have been applied to many important problems - ranging from Social Network Analysis to Biomedical Research. We use these techniques to cluster Hadoop MapReduce jobs that are similar in characteristics., Hadoop framework generates metrics for every MapReduce job, such as number of map and reduce tasks, number of bytes read/written to local file system and HDFS etc. We use these metrics and job configuration features such as format of the input/output files, type of compression used etc to find similarity among Hadoop jobs. We study the centroids and densities of these job clusters. We also perform comparative analysis of real production workload and workload emulated by our benchmark tool, Grid Mix, by comparing job clusters of both workloads.

Published in:

Cloud Computing Technology and Science (CloudCom), 2010 IEEE Second International Conference on

Date of Conference:

Nov. 30 2010-Dec. 3 2010