Skip to Main Content
This paper presents a corpus-based approach for extracting keywords from a text written in a language that has no word boundary. Based on the concept of Thai character cluster, a Thai running text is preliminarily segmented into a sequence of inseparable units, called TCCs. To enable the handling of a large-scaled text, a sorted sistring (or suffix array) is applied to calculate a number of statistics of each TCC. Using these statistics, we applied three alternative supervised machine learning techniques, naive Bayes, centroid-based and k-NN, to learn classifiers for keyword identification. Our method is evaluated using a medical text extracted from WWW. The result showed that k-NN achieves the highest performance of 79.5 % accuracy.