Skip to Main Content
A speech recognizer operating in a mobile environment has to be robust to two distortion sources: ambient noise (additive distortion) and microphone changes (convolutive distortion). Explicitly and simultaneously modeling the two distortion sources has been a great challenge for speech recognition in adverse environments. In this paper, two log-spectral domain components are introduced in speech acoustic models to represent additive and convolutive distortions. A method, called JAC, jointly compensates both additive and convolutive distortions. For each utterance to be recognized, it adapts HMM mean vectors with a noise estimate and a channel estimate. The noise estimate is calculated from the pre-utterance pause and the channel estimate is calculated using an EM algorithm from speech utterances produced in the distortion environment. The algorithm is evaluated on a noisy speech database recorded in-vehicle with a hands-free distant microphone in several sessions, including parked, stop-and-go, and highway driving conditions. Experiments show that the method typically reduces recognition word error rate by an order of magnitude. The method makes it possible to obtain high performance for speaker-independent recognition in changing noisy environments without collecting any noisy speech for training.