Skip to Main Content
In text categorization, one well-known document representation is bag-of-words. Although it is simple and popular, it ignores semantics, underlying linguistic information, and word correlations. In this paper, a new representation for text data is proposed which is called Bag-Of-Queries (BOQ). First, a taxonomy of the terms in the local vocabulary is extracted. Extracting a taxonomy is performed by learning term dependencies using an information theoretic inclusion index. Next, the taxonomy is partitioned to generate a set of correlated terms or bag of queries. Since every two partitions belong to different concepts, they are considered semantically orthogonal queries. This provides a new space of orthogonal features, which is necessary for an efficient categorization. Finally, instead of using terms as features, we use them to build a set of queries. Documents are ranked in response to the queries using a similarity measure. The similarity indices are considered as new features in a vector space model representation. The proposed approach outperforms bag of word based clustering. It also extracts new non-redundant features and at the same time reduces dimensionality.
Date of Conference: 5-8 July 2010