Skip to Main Content
The traditional text clustering algorithm often uses the unsupervised feature selection method to select the feature. In this paper we propose a new text clustering algorithm SFFCM which use the supervised feature selection method to select the feature. The SFFCM is based on the EM algorithm. In the E-step, to calculate the expectation, we use the supervised feature selection algorithm to calculate the relevancy score for each term. In the M step we use the FCM algorithm to obtain the cluster results based on the selected terms. Our experimental results on standard document clustering benchmark corpuses: OHSUMED, 20-Newsgroups and Reuters-21578 show that the SFFCM text clustering algorithm can generate better clustering results than other control clustering methods and the supervised feature selection can improve the performance of the text clustering algorithm. We also propose a supervised feature selection measure CRF-CHI measure which is based on the chi2 statistic and the category relative frequency. The experimental results also confirm that the CRF-CHI is an effective supervised feature selection measure.