Part-of-speech tagging in real-world applications is performed on text in domains which are different from the publicly available large training data sets. The two most successful part-of-speech taggers are trained on the Wall Street Journal corpus, a corpus of millions of words. We compare their performance on a test set from a different domain-astronomy-from documents that are available on the World Wide Web. The Maximum Entropy Part of Speech Tagger (MXPOST) and the Transformation-Based Learning Tagger are well-known and widely used in language research and development systems. The two taggers were tested in several modes: (1) after training on the Wall Street Journal corpus only, (2) after training on only a small body of text from our astronomy domain, (3) with and without an auxiliary lexicon derived from many astronomy-related Web documents, and (4) after incremental training-that is, having been trained on the Wall Street Journal, with additional training from the specific domain. One conclusion from the experiment is that different taggers exhibit different biases when trained on the same data
Published in:
Information Intelligence and Systems, 1999. Proceedings. 1999 International Conference on
Date of Conference: 1999