Abstract:
Recent years have witnessed a surge of new generation applications involving big data. The de facto framework for big data processing, MapReduce, has been increasingly em...Show MoreMetadata
Abstract:
Recent years have witnessed a surge of new generation applications involving big data. The de facto framework for big data processing, MapReduce, has been increasingly embraced by both academic and industrial users. Data locality seeks to co-locate computation with data, which effectively reduces remote data access and improves MapReduce's performance in physical machine clusters. State-of-the-art public clouds heavily rely on virtualization to enable resource sharing and scaling for massive users, however. In this article, through real-world experiments, we show strong evidence that the conventional notion of data locality is unfortunately not always beneficial for MapReduce in a virtualized environment. The observations suggest that the measure of node-local must be extended to distinguish physical and virtual entities. We develop vLocality, a comprehensive and practical solution for data locality in virtualized environments. It incorporates a novel storage architecture that efficiently mitigates the shared disk contention, and an enhanced task scheduling algorithm that prioritizes co-located VMs. We have implemented a prototype of vLocality based on Hadoop 1.2.1, and have validated its effectiveness on a typical virtualized cloud platform consisting of 22 nodes. Our experimental results demonstrate that vLocality can improve the job finish time to around a quarter of that for typical Hadoop benchmark applications.
Published in: IEEE Network ( Volume: 31, Issue: 1, January/February 2017)
Keywords assist with retrieval of results and provide a means to discovering other relevant content. Learn more.
- IEEE Keywords
- Index Terms
- Big Data ,
- Final Time ,
- Real-world Experiments ,
- Virtual Machines ,
- Task Scheduling ,
- Remote Access ,
- Big Data Processing ,
- Public Cloud ,
- Local Notions ,
- Running ,
- Data Center ,
- Shortest Path ,
- Multi-core ,
- Data Exchange ,
- PageRank ,
- Priority Level ,
- Amazon Web Services ,
- Remarkable Impact ,
- Degree Of Localization ,
- Map Tasks ,
- Hadoop Distributed File System ,
- Master Node ,
- Cloud Providers ,
- Configuration Time ,
- Default System ,
- Resource Contention
Keywords assist with retrieval of results and provide a means to discovering other relevant content. Learn more.
- IEEE Keywords
- Index Terms
- Big Data ,
- Final Time ,
- Real-world Experiments ,
- Virtual Machines ,
- Task Scheduling ,
- Remote Access ,
- Big Data Processing ,
- Public Cloud ,
- Local Notions ,
- Running ,
- Data Center ,
- Shortest Path ,
- Multi-core ,
- Data Exchange ,
- PageRank ,
- Priority Level ,
- Amazon Web Services ,
- Remarkable Impact ,
- Degree Of Localization ,
- Map Tasks ,
- Hadoop Distributed File System ,
- Master Node ,
- Cloud Providers ,
- Configuration Time ,
- Default System ,
- Resource Contention