Skip to Main Content
This paper presents an efficient method for combining models of speech and noise for robust speech recognition applications in noisy environments. This method decomposes the cepstrum domain representation of noise-corrupted speech into clean speech cepstrum and background noise cepstrum components using a minimum mean squared error-log spectral amplitude (MMSE-LSA) criterion. Speech recognition is then performed on noisy cepstrum domain observations using a model that is formed by parallel combination of cepstrum domain clean speech distributions and background noise distributions estimated using this MMSE-LSA based noise decomposition. This method is far more efficient than other parallel model combination (PMC) procedures because model combination is performed directly in the cepstrum domain rather than in the linear spectral domain. Whereas background noise model estimation is addressed as a separate issue in existing PMC procedures, this method explicitly incorporates a mechanism to continually update background noise models and signal-to-noise ratio (SNR) estimates over time. The performance of the proposed cepstrum-domain model combination method is compared with a well known implementation of PMC which uses a log-normal approximation when combining speech and background noise model means and variances on a connected digit string recognition task which is subjected to mismatched channel and environment conditions. As a result, it is shown that the proposed model combination technique gives a word error rate that is comparable to PMC when background noise information and SNR are known prior to estimation. The paper will also present the results of experiments where a combination of cepstrum-domain feature compensation and model combination are applied to this task.