By Topic

A speaker-independent continuous speech recognition system using continuous mixture Gaussian density HMM of phoneme-sized units

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

1 Author(s)
Zhao, Y. ; Panasonic Technol. Inc., Santa Barbara, CA, USA

The author describes a large vocabulary, speaker-independent, continuous speech recognition system which is based on hidden Markov modeling (HMM) of phoneme-sized acoustic units using continuous mixture Gaussian densities. A bottom-up merging algorithm is developed for estimating the parameters of the mixture Gaussian densities, where the resultant number of mixture components is proportional to both the sample size and dispersion of training data. A compression procedure is developed to construct a word transcription dictionary from the acoustic-phonetic labels of sentence utterances. A modified word-pair grammar using context-sensitive grammatical parts is incorporated to constrain task difficulty. The Viterbi beam search is used for decoding. The segmental K-means algorithm is implemented as a baseline for evaluating the bottom-up merging technique. The system has been evaluated on the TIMIT database (1990) for a vocabulary size of 853. For test set perplexities of 24, 104, and 853, the decoding word accuracies are 90.9%, 86.0%, and 62.9%, respectively. For the perplexity of 104, the decoding accuracy achieved by using the merging algorithm is 4.1% higher than that using the segmental K-means (22.8% error reduction), and the decoding accuracy using the compressed dictionary is 3.0% higher than that using a standard dictionary (18.1% error reduction)

Published in:

Speech and Audio Processing, IEEE Transactions on  (Volume:1 ,  Issue: 3 )