TY - GEN
T1 - Comparing map-reduce and FREERIDE for data-intensive applications
AU - Jiang, Wei
AU - Ravi, Vignesh T.
AU - Agrawal, Gagan
PY - 2009
Y1 - 2009
N2 - Map-reduce has been a topic of much interest in the last 2-3 years. While it is well accepted that the map-reduce APIs enable significantly easier programming, the performance aspects of the use of map-reduce are less well understood. This paper focuses on comparing the map-reduce paradigm with a system that was developed earlier at Ohio State, FREERIDE (FRamework for Rapid Implementation of Datamining Engines). The API and the functionality offered by FREERIDE has many similarities with the map-reduce API. However, there are some differences in the API. Moreover, while FREERIDE was motivated by data mining computations, map-reduce was motivated by searching, sorting, and related applications in a data-center. We compare the programming APIs and performance of the Hadoop implementation of mapreduce with FREERIDE. For our study, we have taken three data mining algorithms, which are k-means clustering, apriori association mining, and k-nearest neighbor search. We have also included a simple data scanning application, word-count. The main observations from our results are as follows. For the three data mining applications we have considered, FREERIDE outperformed Hadoop by a factor of 5 or more. For word-count, Hadoop is better by a factor of up to 2. With increasing dataset sizes, the relative performance of Hadoop becomes better. Overall, it seems that Hadoop has significant overheads related to initialization, I/O, and sorting of (key, value) pairs. Thus, despite an easy to program API, Hadoop's map-reduce does not appear very suitable for data mining computations on modest-sized datasets.
AB - Map-reduce has been a topic of much interest in the last 2-3 years. While it is well accepted that the map-reduce APIs enable significantly easier programming, the performance aspects of the use of map-reduce are less well understood. This paper focuses on comparing the map-reduce paradigm with a system that was developed earlier at Ohio State, FREERIDE (FRamework for Rapid Implementation of Datamining Engines). The API and the functionality offered by FREERIDE has many similarities with the map-reduce API. However, there are some differences in the API. Moreover, while FREERIDE was motivated by data mining computations, map-reduce was motivated by searching, sorting, and related applications in a data-center. We compare the programming APIs and performance of the Hadoop implementation of mapreduce with FREERIDE. For our study, we have taken three data mining algorithms, which are k-means clustering, apriori association mining, and k-nearest neighbor search. We have also included a simple data scanning application, word-count. The main observations from our results are as follows. For the three data mining applications we have considered, FREERIDE outperformed Hadoop by a factor of 5 or more. For word-count, Hadoop is better by a factor of up to 2. With increasing dataset sizes, the relative performance of Hadoop becomes better. Overall, it seems that Hadoop has significant overheads related to initialization, I/O, and sorting of (key, value) pairs. Thus, despite an easy to program API, Hadoop's map-reduce does not appear very suitable for data mining computations on modest-sized datasets.
UR - http://www.scopus.com/inward/record.url?scp=72049116296&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=72049116296&partnerID=8YFLogxK
U2 - 10.1109/CLUSTR.2009.5289199
DO - 10.1109/CLUSTR.2009.5289199
M3 - Conference contribution
AN - SCOPUS:72049116296
SN - 9781424450121
T3 - Proceedings - IEEE International Conference on Cluster Computing, ICCC
BT - 2009 IEEE International Conference on Cluster Computing and Workshops, CLUSTER '09
T2 - 2009 IEEE International Conference on Cluster Computing and Workshops, CLUSTER '09
Y2 - 31 August 2009 through 4 September 2009
ER -