Skip to Main Content
In this paper, we propose a novel approach to automatic speech segmentation for unit-selection based text-to-speech systems. Instead of using a single automatic segmentation machine (ASM), we make use of multiple independent ASMs to produce a final boundary time-mark. Specifically, given multiple boundary time-marks provided by separate ASMs, we first compensate for the potential ASM-specific context-dependent systematic error (or a bias) of each time-mark and then compute the weighted sum of the bias-removed time-marks, yielding the final time-mark. The bias and weight parameters required for the proposed method are obtained beforehand for each phonetic context (e.g., /p/-/a/) through a training procedure where manual segmentations are utilized as the references. For the training procedure, we first define a cost function in order to quantify the discrepancy between the automatic and manual segmentations (or the error) and then minimize the sum of costs with respect to bias and weight parameters. In case a squared error is used for the cost, the bias parameters are easily obtained by averaging the errors of each phonetic context and then, with the bias parameters fixed, the weight parameters are simultaneously optimized through a gradient projection method which is adopted to overcome a set of constraints imposed on the weight parameter space. A decision tree which clusters all the phonetic contexts is utilized to deal with the unseen phonetic contexts. Our experimental results indicate that the proposed method improves the percentage of boundaries that deviate less than 20 ms with respect to the reference boundary from 95.06% with a HMM-based procedure and 96.85% with a previous multiple-model based procedure to 97.07%.