Skip to Main Content
For large vocabulary continuous speech recognition systems, the amount of acoustic training data is of crucial importance. In the past, large amounts of speech were thus recorded from various sources and had to be transcribed manually. It is thus desirable to train a recognizer with as little manually transcribed acoustic data as possible. Since untranscribed speech is available in various forms nowadays, the unsupervised training of a speech recognizer on recognized transcriptions is studied in this paper. A low-cost recognizer trained with between one and six h of manually transcribed speech is used to recognize 72 h of untranscribed acoustic data. These transcriptions are then used in combination with a confidence measure to train an improved recognizer. The effect of the confidence measure which is used to detect possible recognition errors is studied systematically. Finally, the unsupervised training is applied iteratively. Starting with only one h of transcribed acoustic data, a recognition system is trained fully automatically. With this iterative training procedure, the word error rates are reduced from 71.3% to 38.3% on the Broadcast News'96 evaluation test set and from 65.6% to 29.3% on the Broadcast News'98 evaluation test set. In comparison with an optimized system trained with the manually generated transcriptions of the complete 72 h training corpus, the word error rates increase by 14.3% relative and 18.6% relative, respectively.