By Topic

An Efficient Hierarchical Clustering Method for Large Datasets with Map-Reduce

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

6 Author(s)
Tianyang Sun ; Chinese Acad. of Sci., Grad. Univ., Beijing, China ; Chengchun Shu ; Feng Li ; Haiyan Yu
more authors

Large datasets become common in applications like Internet services, genomic sequence analysis and astronomical telescope. The demanding requirements of memory and computation power force data mining algorithms to be parallelized in order to efficiently deal with the large datasets. This paper introduces our experience of grouping internet users by mining a huge volume of Web access log of up to 100 gigabytes. The application is realized using hierarchical clustering algorithms with Map-Reduce, a parallel processing framework over clusters. However, the immediate implementation of the algorithms suffers from efficiency problem for both inadequate memory and higher execution time. This paper present an efficient hierarchical clustering method of mining large datasets with Map-Reduce. The method includes two optimization techniques: ¿Batch Updating¿ to reduce the computational time and communication costs among cluster nodes, and ¿Co-occurrence based feature selection¿ to decrease the dimension of feature vectors and eliminate noise features. The empirical study shows the first technique can significantly reduce the IO and distributed communication overhead, reducing the total execution time to nearly 1/15. Experimentally, the second technique efficiently simplifies the features while obtains improved accuracy of hierarchical clustering.

Published in:

Parallel and Distributed Computing, Applications and Technologies, 2009 International Conference on

Date of Conference:

8-11 Dec. 2009