Skip to Main Content
Bilingual parallel corpora provide rich source of translate information for tasks such as cross-language information retrieval and data-driven machine translation systems. However, they are often scarce resources: limited in size, language coverage and language register. Researchers have to struggle to transfer and adapt the available technologies because only some small scale corpora are suitable. In this paper we introduce a large parallel corpus of Chinese and English constructed by crawling and processing bilingual web documents from Internet. It currently contains about 300,000 parallel sentence pairs. The tools and methodology used in this collection project are also described. In a cross-language retrieval task we experimented with this self-constructed corpus to improve the quality of query translation, so as to achieve better retrieval performance.