Skip to Main Content
Most statistical speech recognition systems make use of segment-level features, derived mainly from spectral envelope characteristics of the signal, but ignore supra-segmental cues that carry additional information likely to be useful for speech recognition. These cues, which constitute the prosody of the utterance and occur at the syllable, word and utterance level, are closely related to the lexical and syntactic organization of the utterance. In this paper, we explore the use of acoustic and lexical correlates of a subset of these cues in order to improve recognition performance on a read-speech corpus, using word error rate (WER) as the metric. Using the features and methods described in this paper, we were able to obtain a relative WER improvement of 1.3% over a baseline ASR system on the Boston University Radio News Corpus.