In HMM-based speech synthesis, we usually use complex, context dependent models to characterize prosodically and linguistically rich speech units. It is therefore difficult to prepare training data which can cover all combinatorial possibilities of contexts. A common approach to cope with this insufficient training data problem is to build a clustered tree via the MDL criterion. However, an MDL-based tree still tends to be inadequate in its power to predict unseen data. In this paper, we adopt the cross-validation principle to build such a decision tree to minimize the generation error of unseen contexts. An efficient training algorithm is implemented by exploiting the sufficient statistics. Experimental results show that the proposed method can achieve better speech synthesis results, both objectively and subjectively, than the baseline results of the MDL-based decision tree.
Published in:
Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on
Date of Conference: 14-19 March 2010