By Topic

TFIDF, LSI and multi-word in information retrieval and text categorization

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

3 Author(s)
Wen Zhang ; Sch. of Knowledge Sci., Japan Adv. Inst. of Sci. & Technol., Tatsunokuchi ; Yoshida, T. ; Xijin Tang

Text representation, which is a fundamental and necessary process for text-based intelligent information processing, includes the tasks of determining the index terms for documents and producing the numeric vectors corresponding to the documents. In this paper, multi-word, which is regarded as containing more contextual semantics than individual word and possessing the favorable statistical characteristics, is proposed as an alternative index terms in vector space model for text representation with theoretical support. We investigate the traditional indexing methods as TF*IDF (term frequency inverse document frequency) and LSI (latent semantic indexing) for comparative study. The performances of TF*IDF, LSI and multi-word are examined on the tasks of text classification, which includes information retrieval (IR) and text categorization (TC), in Chinese and English document collection respectively. We also attempt to tune the rescaling factor of LSI and observe its effectiveness in text classification. The experimental results demonstrate that TF*IDF and multi-word are comparable when they are used for IR and TC and LSI is the poorest one of them. Moreover, the rescaling factor of LSI has an insignificant influence on its effectiveness on text classification for both Chinese and English text classification.

Published in:

Systems, Man and Cybernetics, 2008. SMC 2008. IEEE International Conference on

Date of Conference:

12-15 Oct. 2008