By Topic

Unsupervised training of acoustic models for large vocabulary continuous speech recognition

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$33 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

2 Author(s)
F. Wessel ; Lehrstuhl fur Informatik, Univ. of Technol., Aachen, Germany ; H. Ney

For large vocabulary continuous speech recognition systems, the amount of acoustic training data is of crucial importance. In the past, large amounts of speech were thus recorded from various sources and had to be transcribed manually. It is thus desirable to train a recognizer with as little manually transcribed acoustic data as possible. Since untranscribed speech is available in various forms nowadays, the unsupervised training of a speech recognizer on recognized transcriptions is studied in this paper. A low-cost recognizer trained with between one and six h of manually transcribed speech is used to recognize 72 h of untranscribed acoustic data. These transcriptions are then used in combination with a confidence measure to train an improved recognizer. The effect of the confidence measure which is used to detect possible recognition errors is studied systematically. Finally, the unsupervised training is applied iteratively. Starting with only one h of transcribed acoustic data, a recognition system is trained fully automatically. With this iterative training procedure, the word error rates are reduced from 71.3% to 38.3% on the Broadcast News'96 evaluation test set and from 65.6% to 29.3% on the Broadcast News'98 evaluation test set. In comparison with an optimized system trained with the manually generated transcriptions of the complete 72 h training corpus, the word error rates increase by 14.3% relative and 18.6% relative, respectively.

Published in:

IEEE Transactions on Speech and Audio Processing  (Volume:13 ,  Issue: 1 )