Skip to Main Content
In this paper, we provide the first comprehensive comparison of methods for part-of-speech tagging and chunking for Hindi. We present an analysis of the application of three major learning algorithms (viz. Maximum entropy models  , Conditional random fields  and Support Vector Machines ) to part-of-speech tagging and chunking for Hindi Language using datasets of different sizes. The use of language independent features make this analysis more general and capable of concluding important results for similar South and South East Asian Languages. The results show that CRFs outperform SVMs and Maxent in terms of accuracy. We are able to achieve an accuracy of 92.26% for part-of-speech tagging and 93.57% for chunking using Conditional Random Fields algorithm. The corpus we have used had 138177 annotated instances for training. We report results for three learning algorithms by varying various conditions (clustering, BIEO notation vs. BIES notation, multiclass methods for SVMs etc.) and present an extensive analysis of the whole process. These results will give future researchers an insight into how to shape their research keeping in mind the comparative performance of major algorithms on datasets of various sizes and in various conditions.