Skip to Main Content
As the demand for global information increases significantly, multilingual corpora has become a valuable linguistic resource for applications to cross-lingual information retrieval and natural language processing. A Web-based English-Chinese bilingual parallel corpus of automatic Construction Technology solved the shortage of bilingual English-Chinese Parallel Corpus. First, some web pages which may be set translation dig of from a particular source, and then from the web pages focused on the external characteristics according to the similarity to extract the candidate web pages in parallel pairs, use of content-based methods on parallel web pages for each of these candidates assessed. In the assessment of the candidate pairs of parallel web pages, this paper design ECVS models of bilingual text similarity assessed based on the classic vector space model.