By Topic

Comparison of different lemmatization approaches for information retrieval on Turkish text collection

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

2 Author(s)
Ozturkmenoglu, O. ; Dept. of Comput. Eng., Dokuz Eylul Univ., Izmir, Turkey ; Alpkocak, A.

In this paper, we compare the performance of different lemmatization approaches for information retrieval over Turkish text collection. A lemma is simply the "dictionary form" of a word and lemmatization is the process of determining the lemma for a given word where different inflected forms of a word can be analyzed as a single item. We compared three different lemmatizer and one fixed length truncation approaches over Turkish text collection. The first one is based on morphological analyzer for Turkish using with finite state language processing technology; another one is Dictionary-based Turkish Lemmatizer (DTL), which uses radix-trie data structure; the third one is a simple dictionary based top-down parser and the last one is truncation of words at fix length. We have assessed the performance of lemmatizers on Bilkent University Milliyet collection, which contains more than 400K documents. The comparison of performance analysis was done by the well-known IR evaluation metrics and experimented in the IR system. The results we obtained show that the lemmatization process improves IR performance and we achieved the best results using with Turkish Lemmatizer that is DTL radix-trie data structure and it used the minimum number of terms in IR system.

Published in:

Innovations in Intelligent Systems and Applications (INISTA), 2012 International Symposium on

Date of Conference:

2-4 July 2012