This paper describes an approach to the optimization of the nonlinear component of a physiologically motivated feature extraction system for automatic speech recognition. Most computational models of the peripheral auditory system include a sigmoidal nonlinear function that relates the log of signal intensity to output level, which we represent by a set of frequency dependent logistic functions. The parameters of these rate-level functions are estimated to maximize the a posteriori probability of the correct class in training data. The performance of this approach was verified by the results of a series of experiments conducted with the CMU S phinx-III speech recognition system on the DARPA Resource Management, Wall Street Journal databases, and on the AURORA 2 database. In general, it was shown that feature extraction that incorporates the learned rate-nonlinearity, combined with a complementary loudness compensation function, results in better recognition accuracy in the presence of background noise than traditional MFCC feature extraction without the optimized nonlinearity when the system is trained on clean speech and tested in noise. We also describe the use of lattice structure that constraints the training process, enabling training with much more complicated acoustic models.
Published in:
Audio, Speech, and Language Processing, IEEE Transactions on
(Volume:20
,
Issue:
3
)
Date of Publication: March 2012