Skip to Main Content
We present a language independent approach for conflation that does not depend on predefined rules or prior knowledge of the target language. The proposed unsupervised method is based on an enhancement of the pure n-gram model that is used to group related words based on a revised string-similarity measure. In order to detect and eliminate terms that are created by this process, but that are most likely not relevant for the query (”noisy terms”), an approach based on mutual information scores computed based on web statistical cooccurrences data is proposed. Furthermore, an evaluation of this approach is presented.