Abstract:
High-level parallel dataflow systems, such as Pig and Hive, have lately gained great popularity in the area of big data processing. These systems often consist of a decla...Show MoreMetadata
Abstract:
High-level parallel dataflow systems, such as Pig and Hive, have lately gained great popularity in the area of big data processing. These systems often consist of a declarative query language and a set of compilers, which transform queries into execution plans and submit them to a distributed engine for execution. Apart from the useful abstraction and support for common analysis operations, high-level processing systems also offer great opportunities for automatic optimizations. Existing studies on execution traces from big data centers and industrial clusters show that there is significant computation redundancy in analysis programs, i.e., there exist similar or even identical queries on the same datasets in different jobs. Furthermore, workload characterization of MapReduce traces from large organizations suggest that there is a big need for caching job results, that will enable their reuse and improve execution time. In this paper, we propose m2r2, an extensible and language-independent framework for results materialization and reuse in high-level dataflow systems for big data analytics. Our prototype implementation is built on top of the Pig dataflow system and handles automatic results caching, common sub-query matching and rewriting, as well as garbage collection. We have evaluated m2r2 using the TPC-H benchmark for Pig and report reduced query execution time by 65% on average.
Date of Conference: 03-05 December 2013
Date Added to IEEE Xplore: 06 March 2014
Electronic ISBN:978-0-7695-5096-1