Skip to Main Content
This paper focuses on a solution to better adapt ASR systems, whose language models (LM) are usually trained on topic-independent corpora, to new topics, in particular in the case of broadcast news. We propose a new complete and fully unsupervised technique that selects keywords from each segment using information retrieval methods, to build a thematically coherent adaptation corpus from the Internet. The LM used for the initial transcription is then adapted before rescoring word lattices. Experimental results demonstrate the validity of the proposed adaptation technique with a significant reduction of the perplexity after LM adaptation. Word error rates are also improved in some cases though to a lesser extent.
Date of Conference: March 31 2008-April 4 2008