Skip to Main Content
Since its introduction, MapReduce implementations have been primarily focused towards static compute cluster sizes. In this paper, we introduce the concept of dynamic elasticity to MapReduce. We present the design decisions and implementation tradeoffs for DELMA, (Dynamically Elastic MapReduce), a framework that follows the MapReduce paradigm, just like Hadoop MapReduce, but that is capable of growing and shrinking its cluster size, as jobs are underway. In our study, we test DELMA in diverse performance scenarios, ranging from diverse node additions to node additions at various points in the application run-time with various dataset sizes. The applicability of the MapReduce paradigm extends far beyond its use with large-scale data intensive applications, and can also be brought to bear in processing long running distributed applications executing on small-sized clusters. In this work, we focus both on the performance of processing hierarchical data in distributed scientific applications, as well as the processing of smaller but demanding input sizes primarily used in small clusters. We run experiments for datasets that require CPU intensive processing, ranging in size from Millions of input data elements to process, up to over half a billion elements, and observe the positive scalability patterns exhibited by the system. We show that for such sizes, performance increases accordingly with data and cluster size increases. We conclude on the benefits of providing MapReduce with the capability of dynamically growing and shrinking its cluster configuration by adding and removing nodes during jobs, and explain the possibilities presented by this model.