Skip to Main Content
Modern automatic speech recognition systems handle large vocabularies of words, making it infeasible to collect enough repetitions of each word to train individual word models. Instead, large-vocabulary recognizers represent each word in terms of subword units. Typically the subword unit is the phone, a basic speech sound such as a single consonant or vowel. Each word is then represented as a sequence, or several alternative sequences, of phones specified in a pronunciation dictionary. Other choices of subword units have been studied as well. The choice of subword units, and the way in which the recognizer represents words in terms of combinations of those units, is the problem of subword modeling. Different subword models may be preferable in different settings, such as high-variability conversational speech, high-noise conditions, low-resource settings, or multilingual speech recognition. This article reviews past, present, and emerging approaches to subword modeling. To make clean comparisons between many approaches, the review uses the unifying language of graphical models.