Skip to Main Content
Statistical machine translation is the state-of-the- art technique based on sentence-level aligned parallel corpora. The improvement of this kind of technique is constrained by the lack of parallel corpora publicly available. The booming of the World Wide Web stands a fair chance that we can construct parallel corpora in a big scale more easily. In this paper, we summarize the current strategies fetching parallel corpora from the Web and classify them into three classes: the structure-based, the content-based and the hybrid. We compare these approaches and bring out some ideas that may be useful for improving the performance of the algorithms. In the discussion section, we put forward some problems that should be considered in future research.