By Topic

Sparse Inverse Covariance Matrices for Low Resource Speech Recognition

Sign In

Full text access may be available.

To access full text, please use your member or institutional sign in.

Formats Non-Member Member
$33 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

2 Author(s)
Weibin Zhang ; Dept. of Electron. & Comput. Eng., Hong Kong Univ. of Sci. & Technol., Hong Kong, China ; Pascale Fung

We propose to use sparse inverse covariance matrices for acoustic model training when there is insufficient training data. Acoustic models trained with inadequate training data tend to over fit, generalizing poorly to unseen test data, especially when full covariance matrices are used. We address this problem by adding an L1 regularization term to the traditional objective function for maximum likelihood estimation, to penalize complex models. The structure of the inverse covariance matrices will be automatically sparsified using this new objective function. The Expectation Maximization algorithm is used to learn the parameters of the hidden Markov model using the new objective function. It is shown that the training procedures for all the hidden Markov model parameters are the same as that of maximum likelihood estimation except the inverse covariance matrices. The update equation for the inverse covariance matrices is concave and can be solved efficiently. Our experiments show that this proposed method can correctly learn the underlying correlations among the random variables of the speech feature vector. Experimental results on the Wall Street Journal data show that our proposed model significantly outperforms the diagonal covariance model and the full covariance model by 10.9% and 16.5% relative recognition accuracy, when only about 14 hours of training data are available. On our collected low resource language data-the Cantonese data set, the proposed model also significantly outperforms the diagonal covariance model and the full covariance model.

Published in:

IEEE Transactions on Audio, Speech, and Language Processing  (Volume:21 ,  Issue: 3 )