By Topic

On Using Multiple Models for Automatic Speech Segmentation

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

2 Author(s)
Seung Seop Park ; Sch. of Electr. Eng. & INMC, Seoul Nat. Univ., Seoul ; Nam Soo Kim

In this paper, we propose a novel approach to automatic speech segmentation for unit-selection based text-to-speech systems. Instead of using a single automatic segmentation machine (ASM), we make use of multiple independent ASMs to produce a final boundary time-mark. Specifically, given multiple boundary time-marks provided by separate ASMs, we first compensate for the potential ASM-specific context-dependent systematic error (or a bias) of each time-mark and then compute the weighted sum of the bias-removed time-marks, yielding the final time-mark. The bias and weight parameters required for the proposed method are obtained beforehand for each phonetic context (e.g., /p/-/a/) through a training procedure where manual segmentations are utilized as the references. For the training procedure, we first define a cost function in order to quantify the discrepancy between the automatic and manual segmentations (or the error) and then minimize the sum of costs with respect to bias and weight parameters. In case a squared error is used for the cost, the bias parameters are easily obtained by averaging the errors of each phonetic context and then, with the bias parameters fixed, the weight parameters are simultaneously optimized through a gradient projection method which is adopted to overcome a set of constraints imposed on the weight parameter space. A decision tree which clusters all the phonetic contexts is utilized to deal with the unseen phonetic contexts. Our experimental results indicate that the proposed method improves the percentage of boundaries that deviate less than 20 ms with respect to the reference boundary from 95.06% with a HMM-based procedure and 96.85% with a previous multiple-model based procedure to 97.07%.

Published in:

Audio, Speech, and Language Processing, IEEE Transactions on  (Volume:15 ,  Issue: 8 )