Skip to Main Content
MapReduce-Hadoop has emerged as an effective framework for large-scale data analytics, providing support for executing jobs and storing data in a parallel and distributed manner. MapReduce has been shown to perform very well on large datacenters running applications where the data can be effectively divided into homogeneous chunks running across homogeneous hardware. However, the performance of MapReduceHadoop is far from ideal when either or both hardware and datasets are heterogeneous. Such heterogeneity is unavoidable in many academic computing environments that use multiple generations of hardware, and share resources among users. Heterogeneity is also unavoidable in scientific applications that process a varying number of datasets of different sizes. In these cases, the performance of MapReduce-Hadoop can be a concern. In this paper, we implement MapReduce on top of CometCloud to address the issue of heterogeneity and support applications classes that involve irregular datasets (e.g. large number of small data files or datasets of varying sizes). Furthermore, we develop an autonomic manager that can schedule MapReduce tasks based on user objective, provision resources accordingly, and support on-demand scale up and cloudbursts. These resources can be selected from a hybrid infrastructure such as local clusters, data centers, and public clouds. The performance of the developed solution is verified using a protein data mining application operating on data from the Protein Data Bank. The application is deployed, based on deadline and budget constraints, on a cluster at Rutgers and/or Amazon EC2 resources. The experimental results show that the MapReduce-CometCloud framework can effectively support applications operating on large numbers of small data files on a heterogeneous and distributed environment, and satisfy user objective autonomically using cloudbursts.