Skip to Main Content
This paper proposes a novel feature-space VTLN (vocal tract length normalization) method that models frequency warping as a linear interpolation of contiguous Mel filter-bank energies. The presented technique aims to reduce the distortion in the Mel filter-bank energy estimation due to the harmonic composition of voiced speech intervals and DFT (discrete Fourier transform) sampling when the central frequency of band-pass filters is shifted. This paper also proposes an analytical maximum likelihood (ML) method to estimate the optimal warping factor in the cepstral space. The presented interpolated filter-bank energy-based VTLN leads to relative reductions in WER (word error rate) as high as 11.2% and 7.6% when compared with the baseline system and standard VTLN, respectively, in a medium-vocabulary continuous speech recognition task. Also, the proposed VTLN scheme can provide significant reductions in WER when compared with state-of-the-art VTLN methods based on linear transforms in the cepstral feature-space. The warping factor estimated with the proposed VTLN approach shows more dependence on the speaker and more independence of the acoustic-phonetic content than the warping factor resulting from standard and state-of-the-art VTLN methods. Finally, the analytical ML-based optimization scheme presented here achieves almost the same reductions in WER as the ML grid search version of the technique with a computational load 20 times lower.
Audio, Speech, and Language Processing, IEEE Transactions on (Volume:21 , Issue: 1 )
Date of Publication: Jan. 2013