Skip to Main Content
Map-reduce has been a topic of much interest in the last 2-3 years. While it is well accepted that the map-reduce APIs enable significantly easier programming, the performance aspects of the use of map-reduce are less well understood. This paper focuses on comparing the map-reduce paradigm with a system that was developed earlier at Ohio State, FREERIDE (FRamework for Rapid Implementation of Datamining Engines). The API and the functionality offered by FREERIDE has many similarities with the map-reduce API. However, there are some differences in the API. Moreover, while FREERIDE was motivated by data mining computations, map-reduce was motivated by searching, sorting, and related applications in a data-center. We compare the programming APIs and performance of the Hadoop implementation of map-reduce with FREERIDE. For our study, we have taken three data mining algorithms, which are k-means clustering, apriori association mining, and k-nearest neighbor search. We have also included a simple data scanning application, word-count. The main observations from our results are as follows. For the three data mining applications we have considered, FREERIDE outperformed Hadoop by a factor of 5 or more. For word-count, Hadoop is better by a factor of up to 2. With increasing dataset sizes, the relative performance of Hadoop becomes better. Overall, it seems that Hadoop has significant overheads related to initialization, I/O, and sorting of (key, value) pairs. Thus, despite an easy to program API, Hadoop's map-reduce does not appear very suitable for data mining computations on modest-sized datasets.