Skip to Main Content
MapReduce is a programming framework introduced by Google for large-scale data processing. It is usually used in a scan-centric fashion where all the data are split into blocks and Maps are generated for each block to scan and process the data in the block, then Reduces merge outputs from all the Maps. When a query intends to process only a subset of the data selected by a predicate, this brute-force method may cause extra I/O overhead spent on irrelevant data, and the overhead for initiating so many Maps may be non-trivial given that the actually interesting data for the query is comparatively small in volume. We propose an approach to integrate the index into the MapReduce execution in which only an appropriate number of Maps are generated, each of which accesses the data using an index. This approach incurs random I/O and remote access to data, so the overall performance depends on both system parameters and the query characteristics. We build a cost model for both this index access execution and the traditional full scan execution. This cost model can be used to choose between the two execution modes before executing a query. Experiments show that the index access execution can greatly outperform full scan execution when the selectivity of the predicate is low, and the cost model predicts the actual execution cost very well so can be used to determine the execution plan for a query.