The DRk-M for Clustering Categorical Datasets With Uncertainty | IEEE Journals & Magazine | IEEE Xplore

The DRk-M for Clustering Categorical Datasets With Uncertainty


Abstract:

The problem of categorical clustering has attracted much attention, during the last years, since many real world applications tend to produce or consume categorical data ...Show More

Abstract:

The problem of categorical clustering has attracted much attention, during the last years, since many real world applications tend to produce or consume categorical data types. The k-modes were among the first algorithms developed for categorical clustering using the notion of modes as cluster centroids. However, their major drawback is the random update of the modes in each iteration. In this article, it is proposed to identify the most adequate modes among a list of candidate ones in the mode update step of the process. The proposed algorithm, called density rough k-modes (DRk-M), computes the modes’ density to characterize its observations’ distribution and the rough set theory (RST) to deal with the uncertainty involved in this process. The DRk-M was experimented using UCI datasets and compared to many state-of-the-art methods such as the k-modes (1998), the Ng's method (2007), Cao's method (2012), and their variants. The obtained results pointed an average performance improvement reaching 17% in some cases and more than 25.5% of the total experiments with an average improvement more than 7% between these methods.
Published in: IEEE Intelligent Systems ( Volume: 36, Issue: 5, 01 Sept.-Oct. 2021)
Page(s): 113 - 121
Date of Publication: 17 November 2020

ISSN Information:

References is not available for this document.

C lustering aims to partition a dataset composed of N observations embedded in a d-dimensional space into K distinct clusters. Data points within the same cluster are more similar to each other than to data points in other clusters. Many methods were proposed for categorical clustering such as the k-modes and its variants. The k-modes was inspired from the partitional k-means by using the simple matching dissimilarity measure, computing the modes instead of means to represent the cluster centroids and implementing a frequency-based method to update them. These corrections have removed the numeric-only limitation of the k-means and permitted its use to efficiently cluster large categorical datasets. Besides, the k-modes use the same partitional paradigm than the k-means to ensure high scalability and low complexity since partitional clustering methods are well known for their low time and space complexity when compared to hierarchical methods.

References is not available for this document.

Contact IEEE to Subscribe

References

References is not available for this document.