By Topic

Hierarchical Prosody Conversion Using Regression-Based Clustering for Emotional Speech Synthesis

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$33 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

4 Author(s)
Chung-Hsien Wu ; Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan, R.O.C. ; Chi-Chun Hsia ; Chung-Han Lee ; Mai-Chun Lin

This paper presents an approach to hierarchical prosody conversion for emotional speech synthesis. The pitch contour of the source speech is decomposed into a hierarchical prosodic structure consisting of sentence, prosodic word, and subsyllable levels. The pitch contour in the higher level is encoded by the discrete Legendre polynomial coefficients. The residual, the difference between the source pitch contour and the pitch contour decoded from the discrete Legendre polynomial coefficients, is then used for pitch modeling at the lower level. For prosody conversion, Gaussian mixture models (GMMs) are used for sentence- and prosodic word-level conversion. At subsyllable level, the pitch feature vectors are clustered via a proposed regression-based clustering method to generate the prosody conversion functions for selection. Linguistic and symbolic prosody features of the source speech are adopted to select the most suitable function using the classification and regression tree for prosody conversion. Three small-sized emotional parallel speech databases with happy, angry, and sad emotions, respectively, were designed and collected for training and evaluation. Objective and subjective evaluations were conducted and the comparison results to the GMM-based method for prosody conversion achieved an improved performance using the hierarchical prosodic structure and the proposed regression-based clustering method.

Published in:

IEEE Transactions on Audio, Speech, and Language Processing  (Volume:18 ,  Issue: 6 )