By Topic

Using Statistical Machine Translation to Grade Training Data

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

2 Author(s)
Finch, A. ; Language Transition Group, Keihanna Science City, Japan ; Sumita, E.

One of the main causes of errors in statistical machine translation are the erroneous phrase pairs that can find their way into the phrase table. These phrases are the result of poor word-to-word alignments during the training of the translation model. These word alignment errors in turn cause errors during the phrase extraction phase, and these erroneous bilingual phrase pairs are then used during the decoding process and appear in the output of the machine translation system. Machine translation training data is never perfect, often bilingual sentence pairs are incorrectly aligned sentence-by-sentence, or these pairs are poor translations of each other due to human error. Even when sentence pairs in the corpus are good translations of each other the translations may not be literal enough to admit to the sort of phrase-by-phrase translation necessary to make good training data for a phrase-based statistical machine translation (SMT) system. This is because such SMT systems operate on the assumption that source can be transformed into target simply by translating phrase-by-phrase with re-ordering. In the real world, many perfectly correct translations are not of this form, and these sentences even though correct translations, make poor training data for training the translation models of a phrase-based SMT system. This paper presents a technique in which preliminary machine translation systems are built with the sole purpose of indicating those sentence pairs in the training corpus that the systems are able to generate using their models, the hypothesis being that these sentence pairs are likely to make good training data for an SMT system of the same type. These sentences are then used to bootstrap a second SMT system, and those sentences identified as good training data are given additional weight during the training process for building the translation models. Using this technique we were able to improve the performance of a Japanese-to-English SMT system by 1.2-- 1.5 BLEU points on unseen evaluation data.

Published in:

Universal Communication, 2008. ISUC '08. Second International Symposium on

Date of Conference:

15-16 Dec. 2008