In literature, texts to be classified are generally represented in the large dimensional bag of words space in which every dimension equals to a word or ngram. In this study, firstly the words are placed in a semantic space. The word's coordinates in semantic spaces needs the similarity of the words according to their meanings. Harris states that two words' semantic similarity is related to the number of documents which the words are both in. We used his hypothesis for Turkish words. Firstly, we obtained word co-occurrence matrix from a Web corpus. Then, the numerical coordinates of the words are calculated by using multi dimensional scaling. Texts coordinates are obtained from word coordinates which passes in the texts. In our experiments, Turkish news texts are classified into 5 classes. We get more successful results than the traditional bag of words space. Our approach is not for only Turkish words/texts, but also for all other languages.
Published in:
Signal Processing and Communications Applications Conference, 2009. SIU 2009. IEEE 17th
Date of Conference: 9-11 April 2009