Skip to Main Content
Clustering is one of the important techniques in data mining. The objective of clustering is to group objects into clusters such that objects within a cluster are more similar to each other than objects in different clusters. The similarity between two objects is defined by a distance function, e.g., the Euclidean distance, which satisfies the triangular inequality. Distance calculation is computationally very expensive and many algorithms have been proposed so far to solve this problem. This paper considers the gradual clustering problem. From practice, we noticed that the user often begins clustering on a small number of attributes, e.g., two. If the result is partially satisfying the user will continue clustering on a higher number of attributes, e.g., ten. We refer to this problem as the gradual clustering problem. In fact gradual clustering can be considered as vertically incremental clustering. Approaches are proposed to solve this problem. The main idea is to reduce the number of distance calculations by using the triangle inequality. Our method first stores in an index the distances between a representative object and objects in n-dimensional space. Then these pre-computed distances are used to avoid distance calculations in (n+m)-dimensional space. Two experiments on real data sets demonstrate the added value of our approaches. The implemented algorithms are based on the DBSCAN algorithm with an associated M-Tree as index tree. However the principles of our idea can well be integrated with other tree structures such as MVP-Tree, R*-Tree, etc., and with other clustering algorithms.