Skip to Main Content
Existing MapReduce-based warehousing systems are not specially optimized for time-based big data analysis applications. Such applications have two characteristics: 1) data are continuously generated and are required to be stored persistently for a long period of time, 2) applications usually process data in some time period so that typical queries use time-related predicates. Time-based big data analytics requires both high data loading speed and high query execution performance. However, existing systems including current MapReduce-based solutions do not solve this problem well because the two requirements are contradictory. We have implemented a MapReduce-based system, called Mastiff, which provides a solution to achieve both high data loading speed and high query performance. Mastiff exploits a systematic combination of a column group store structure and a lightweight helper structure. Furthermore, Mastiff uses an optimized table scan method and a column-based query execution engine to boost query performance. Based on extensive experiments results with diverse workloads, we will show that Mastiff can significantly outperform existing systems including Hive, HadoopDB, and GridSQL.