I. Introduction
Prosody refers to specific patterns of rhythm, intonation, and lexical stress characteristic of human speech. These patterns are conveyed through modulation of the fundamental frequency (F0) contour, selective emphasis on certain words or syllables, and durational cues such as vowel lengthening, pauses, and hesitation. These acoustic correlates of prosody are referred to as suprasegmental cues, since they are usually associated with linguistic elements larger than phonemes, such as syllables, words, phrases, or even entire utterances; traditional speech processing and automatic speech recognition (ASR) systems typically operate at the segmental level and ignore such suprasegmental information. While the prosody of an utterance conveys information complementary to the segment-level spectral features used in many ASR systems, it is often difficult to exploit due to its suprasegmental nature. Moreover, the acoustic correlates of prosody exhibit high variability depending on a variety of factors, including context and the speaker's emotional state; the link between them and linguistic elements (typically words) is language specific, and for American English, tenuous at best. This makes it difficult to integrate prosody within spoken language systems, except in ad-hoc ways for very specific applications.