Skip to Main Content
The level of quality that can be attained in concatenative text-to-speech (TTS) synthesis is primarily governed by the inventory of units used in unit selection. This has led to the collection of ever larger corpora in the quest for ever more natural synthetic speech. As operational considerations limit the size of the unit inventory, however, pruning is critical to removing any instances that prove either spurious or superfluous. This paper proposes a novel pruning strategy based on a data-driven feature extraction framework separately optimized for each unit type in the inventory. A single distinctiveness/redundancy measure can then address, in a consistent manner, the two different problems of outliers and redundant units. Detailed analysis of an illustrative case study exemplifies the typical behavior of the resulting unit pruning procedure, and listening evidence suggests that both moderate and aggressive inventory pruning can be achieved with minimal degradation in perceived TTS quality. These experiments underscore the benefits of unit-centric feature mapping for database optimization in concatenative synthesis.
Audio, Speech, and Language Processing, IEEE Transactions on (Volume:16 , Issue: 1 )
Date of Publication: Jan. 2008