Skip to Main Content
Parallel corpus has become a very essential resource for multilingual natural language processing and there are large scale of parallel texts available on the Internet these days. In this paper, we propose a simple but reliable method to construct an English-Vietnamese parallel corpus through Web mining. Our system can automatically download and detect parallel Web pages on a given domain to construct a parallel corpus that is well-aligned at paragraph level with completely clean texts. The proposed technique can be easily applied to other language pairs. Experiments have been made and shown promising results.