Minimizing Remote Accesses in MapReduce Clusters | IEEE Conference Publication | IEEE Xplore

Minimizing Remote Accesses in MapReduce Clusters


Abstract:

MapReduce, in particular Hadoop, is a popular framework for the distributed processing of large datasets on clusters of relatively inexpensive servers. Although Hadoop cl...Show More

Abstract:

MapReduce, in particular Hadoop, is a popular framework for the distributed processing of large datasets on clusters of relatively inexpensive servers. Although Hadoop clusters are highly scalable and ensure data availability in the face of server failures, their efficiency is poor. We study data placement as a potential source of inefficiency. Despite networking improvements that have narrowed the performance gap between map tasks that access local or remote data, we find that nodes servicing remote HDFS requests see significant slowdowns of collocated map tasks due to interference effects, whereas nodes making these requests do not experience proportionate slowdowns. To reduce remote accesses, and thus avoid their destructive performance interference, we investigate an intelligent data placement policy we call 'partitioned data placement'. We find that, in an unconstrained cluster where a job's map tasks may be scheduled dynamically on any node over time, Hadoop's default random data placement is effective in avoiding remote accesses. However, when task placement is restricted by long-running jobs or other reservations, partitioned data placement substantially reduces remote access rates (e.g., by as much as 86% over random placement for a job allocated only one-third of a cluster).
Date of Conference: 20-24 May 2013
Date Added to IEEE Xplore: 31 October 2013
Electronic ISBN:978-0-7695-4979-8
Conference Location: Cambridge, MA, USA

Contact IEEE to Subscribe

References

References is not available for this document.