By Topic

Query-relevant document representation for text clustering

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$33 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

1 Author(s)
Makrehchi, M. ; Thomson Reuters, Eagan, MN, USA

In text categorization, one well-known document representation is bag-of-words. Although it is simple and popular, it ignores semantics, underlying linguistic information, and word correlations. In this paper, a new representation for text data is proposed which is called Bag-Of-Queries (BOQ). First, a taxonomy of the terms in the local vocabulary is extracted. Extracting a taxonomy is performed by learning term dependencies using an information theoretic inclusion index. Next, the taxonomy is partitioned to generate a set of correlated terms or bag of queries. Since every two partitions belong to different concepts, they are considered semantically orthogonal queries. This provides a new space of orthogonal features, which is necessary for an efficient categorization. Finally, instead of using terms as features, we use them to build a set of queries. Documents are ranked in response to the queries using a similarity measure. The similarity indices are considered as new features in a vector space model representation. The proposed approach outperforms bag of word based clustering. It also extracts new non-redundant features and at the same time reduces dimensionality.

Published in:

Digital Information Management (ICDIM), 2010 Fifth International Conference on

Date of Conference:

5-8 July 2010