Skip to Main Content
This paper proposed a LDA-based cross-language retrieval model that did not rely on word-by-word translation of query or document. Instead, a parallel corpus was used to estimate a cross-language LDA (Latent Dirichlet Allocation) model. We assumed that a topic variable Z in LDA could generate both an English token and a Chinese token, given that the parallel corpus contained two languages: English and Chinese. Therefore, the LDA model was easy to be extended to multi-language information retrieval as long as a multi-lingual parallel corpus was provided. The proposed LDA-based crosslanguage retrieval model was compared with three popular retrieval models: LDA-based mono-lingual document model; Mono-lingual TF.IDF retrieval model; Cross-lingual Latent Semantic Indexing retrieval model on CNKI datasets. Experimental results showed that this model was very effective and achieved very good performance.