Scheduled System Maintenance:
On Monday, April 27th, IEEE Xplore will undergo scheduled maintenance from 1:00 PM - 3:00 PM ET (17:00 - 19:00 UTC). No interruption in service is anticipated.
By Topic

Toward spontaneous speech Synthesis-utilizing language model information in TTS

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

4 Author(s)
Werner, S. ; Tech. Univ. Dresden, Germany ; Eichner, Matthias ; Wolff, Matthias ; Hoffmann, R.

State-of-the-art speech synthesis systems achieve a high overall quality. However, synthesized speech still lacks naturalness. To produce more natural and colloquial synthetic speech, our research focuses on integration of effects present in spontaneous speech. Conventional speech synthesis systems do not consider the probability of a word in its context. Recent investigations on corpora of natural speech showed that words that are very likely to occur in a given context are pronounced less accurately and faster than improbable ones. In this paper three approaches are introduced to model this effect found in spontaneous speech. The first algorithm changes the speaking rate directly by shortening or lengthening the syllables of a word depending on the language model probability of that word. Since probable words are not only pronounced faster but also less accurately this approach was extended by selecting appropriate pronunciation variants of a word according to the language model probability. This second algorithm changes the local speaking rate indirectly by controlling the grapheme-phoneme conversion. In a third stage, a pronunciation sequence model was used to select the appropriate variants according to their sequence probability. In listening experiments test participants were asked to rate the synthesized speech in the categories colloquial impression and naturalness. Our approaches achieved a significant improvement in the category colloquial impression. However, no significantly higher naturalness could be observed. The observed effects will be discussed in detail.

Published in:

Speech and Audio Processing, IEEE Transactions on  (Volume:12 ,  Issue: 4 )