Skip to Main Content
This paper proposes a method to improve the accuracy of bilingual texts (bitexts) dependency parsing by using an auto-generated bilingual treebank created with the help of statistical machine translation (SMT) systems. Previous bitext parsing methods use human-annotated bilingual treebanks that are costly and troublesome to obtain. In the proposed method, we use an auto-generated bilingual treebank to train the parsing models. First, an SMT system is used to translate a monolingual treebank into the target language; then, a monolingual parser for the target language is used to parse the translated sentences. Since the auto-translated sentences and auto-parsed trees in the auto-generated bilingual treebank are far from perfect, the bilingual constraints are not sufficiently reliable. To overcome this problem, we propose a method to verify the reliability of the constraints using a large amount of target monolingual and bilingual unannotated data. Finally, we design a set of effective bilingual features for parsing models on the basis of the verified constraints. We conduct the experiments using a standard test data. The experimental results show that our bitext parser significantly outperforms monolingual parsers. Moreover, our method is still able to provide improvement when we use a larger monolingual treebank containing over 50 000 sentences. We also test the proposed method with different SMT systems and the results show that our method is very robust to the noise. In particular, the proposed method can be used in a purely monolingual setting with the help of SMT. That is, it does not need the human translation of the test set as previous methods do.
Audio, Speech, and Language Processing, IEEE Transactions on (Volume:20 , Issue: 5 )
Date of Publication: July 2012