By Topic

An automatic indexing technique for Thai texts using frequent max substring

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$33 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

3 Author(s)
Todsanai Chumwatana ; School of Information Technology Murdoch University, South St, Murdoch Western Australia 6150 ; Kok Wai Wong ; Hong Xie

Thai language is considered as a non-segmented language where words are a string of symbols without explicit word boundaries, and also the structure of written Thai language is highly ambiguous. This problem causes an indexing technique has become a main issue in Thai text retrieval. To construct an inverted index for Thai texts, an index terms extraction technique is usually required to segment texts into index term schemes. Although index terms can be specified manually by experts, this process is very time consuming and labor-intensive. Word segmentation is one of the many techniques that are used to automatically extract index terms from Thai texts. However, most of the word segmentation techniques require linguistic knowledge and the preparation of these approaches is time consuming. An n-gram based approach is another automatic index terms extraction method that is often used as indexing technique for Asian languages including Thai. This approach is language independent which does not require any linguistic knowledge or dictionary. Although the n-gram approach out performs many indexing techniques for Asian languages in term of retrieval effectiveness, the disadvantage of n-gram approach is it suffers from large storage space and long retrieval time. In this paper we present the frequent max substring mining to extract index terms from Thai texts. Our method is language-independent and it does not rely on any dictionary or language grammatical knowledge. Frequent max substring mining is based on text mining that describes a process of discovering useful information or knowledge from unstructured texts. This approach uses the analysis of frequent max substring sets to extract all long and frequently-occurred substrings. We aim to employ the frequent max substring mining algorithm to address the drawback of n-gram based approach by keeping only frequent max substrings to reduce disk space requirement for storing index terms and to reduce the retrieval time in or- der to deal with the rapid growth of Thai texts.

Published in:

Natural Language Processing, 2009. SNLP '09. Eighth International Symposium on

Date of Conference:

20-22 Oct. 2009