A fusion scheme of phone duration models (PDMs) is presented in this work. Specifically, a support vector regression (SVR)-fusion model is fed with the predictions of a group of independent PDMs operating in parallel. The American-English KED TIMIT and the Greek WCL-1 databases are used for evaluating the PDMs and the fusion scheme. The fusion scheme contributes to the accuracy improvement over the best individual model, achieving a relative reduction of the mean absolute error (MAE) and the root mean square error (RMSE), by 1.9% and 2.0% on KED TLVHT, and 2.6% and 1.8% respectively on WCL-1. Moreover, for evaluating the impact the accuracy improvement will have on synthetic speech, perceptual evaluation test was performed. This test showed that the accuracy improvement achieved by the SVR-fusion would contribute to the improvement of the naturalness of synthetic speech.
Published in:
Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on
Date of Conference: 22-27 May 2011