By Topic

Cepstrum-Domain Model Combination Based on Decomposition of Speech and Noise Using MMSE-LSA for ASR in Noisy Environments

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

2 Author(s)
Hong Kook Kim ; Dept. of Inf. & Commun., Gwangju Inst. of Sci. & Technol., Gwangju ; Rose, R.C.

This paper presents an efficient method for combining models of speech and noise for robust speech recognition applications in noisy environments. This method decomposes the cepstrum domain representation of noise-corrupted speech into clean speech cepstrum and background noise cepstrum components using a minimum mean squared error-log spectral amplitude (MMSE-LSA) criterion. Speech recognition is then performed on noisy cepstrum domain observations using a model that is formed by parallel combination of cepstrum domain clean speech distributions and background noise distributions estimated using this MMSE-LSA based noise decomposition. This method is far more efficient than other parallel model combination (PMC) procedures because model combination is performed directly in the cepstrum domain rather than in the linear spectral domain. Whereas background noise model estimation is addressed as a separate issue in existing PMC procedures, this method explicitly incorporates a mechanism to continually update background noise models and signal-to-noise ratio (SNR) estimates over time. The performance of the proposed cepstrum-domain model combination method is compared with a well known implementation of PMC which uses a log-normal approximation when combining speech and background noise model means and variances on a connected digit string recognition task which is subjected to mismatched channel and environment conditions. As a result, it is shown that the proposed model combination technique gives a word error rate that is comparable to PMC when background noise information and SNR are known prior to estimation. The paper will also present the results of experiments where a combination of cepstrum-domain feature compensation and model combination are applied to this task.

Published in:

Audio, Speech, and Language Processing, IEEE Transactions on  (Volume:17 ,  Issue: 4 )