Skip to Main Content
Clustering is one of the basic data mining tasks. Clustering high-dimensional and massive data points is a particularly important task in cluster analysis. But some existing clustering algorithms are merely suitable for small and medium sized datasets. Meanwhile, clustering multi-density datasets is also a very difficult task for some clustering methods. In this paper, to address these issues, we present a novel parallel grid-based clustering algorithm for multi-density datasets, called PGMCLU, based on the idea of data parallelism and merging local clusters. The proposed algorithm uses new measure, called grid compactness, which reflects the degree of tightness between data points within grid. Furthermore, it introduces the notion of grid feature for summarizing the information about grid, and proposes the novel approaches of data partition, local clustering and merging local clusters. Extensive theoretical analysis and experiment results on both real and synthetic datasets show that PGMCLU algorithm is effective and scalable, and has approximately linear speedup.