Issue 1 • Date Jan. 2001
Why speech synthesis? (in memory of Prof. Jonathan Allen, 1934-2000) [Special issue intro.]Page(s): 1 - 2| | PDF (14 KB)
This paper describes the application of the harmonic plus noise model (HNM) for concatenative text-to-speech (TTS) synthesis. In the context of HNM, speech signals are represented as a time-varying harmonic component plus a modulated noise component. The decomposition of a speech signal into these two components allows for more natural-sounding modifications of the signal (e.g., by using different and better adapted schemes to modify each component). The parametric representation of speech using HNM provides a straightforward way of smoothing discontinuities of acoustic units around concatenation points. Formal listening tests have shown that HNM provides high-quality speech synthesis while outperforming other models for synthesis (e.g., TD-PSOLA) in intelligibility, naturalness, and pleasantness View full abstract»
Current speech synthesis methods based on the concatenation of waveform units can produce highly intelligible speech capturing the identity of a particular speaker. However, the quality of concatenated speech often suffers from discontinuities between the acoustic units, due to contextual differences and variations in speaking style across the database. In this paper, we present methods to spectrally modify speech units in a concatenative synthesizer to correspond more closely to the acoustic transitions observed in natural speech. First, a technique called “unit fusion” is proposed to reduce spectral mismatch between units. In addition to concatenation units, a second, independent tier of units is selected that defines the desired spectral dynamics at concatenation points. Both unit tiers are “fused” to obtain natural transitions throughout the synthesized utterance. The unit fusion method is further extended to control the perceived degree of articulation of concatenated units. A signal processing technique based on sinusoidal modeling is also presented that enables high-quality resynthesis of units with a modified spectral shape View full abstract»
A Japanese TTS system based on multiform units and a speech modification algorithm with harmonics reconstructionPage(s): 3 - 10
This paper proposes a new text-to-speech (TTS) system that utilizes large numbers of speech segments to produce very natural and intelligible synthetic speech. There are two innovations; new multiform synthesis units and a new speech modification algorithm based on a vocoder that offers harmonics reconstruction. The multiform units make it possible to reduce acoustic discontinuities at concatenation points and unnatural sound by preparing synthesis units with various lengths and various F0 contours. The new speech modification algorithm, on the other hand, improves the quality of prosody modified speech. This algorithm is extremely effective in synthesizing speech whose prosodic parameters are quite different from those of synthesis units. Listening tests confirm that the new synthesis units yield speech with high intelligibility and naturalness, and that the new speech modification algorithm is superior to all other conventional vocoders and waveform domain algorithms including TD-PSOLA, especially when modifying the F0 frequency upward View full abstract»
The increasing availability of carefully designed and collected speech corpora opens up new possibilities for the statistical estimation of formal multivariate prosodic models. At Apple Computer, statistical prosodic modeling exploits the Victoria corpus, created to broadly support ongoing speech synthesis research and development. This corpus is composed of five constituent parts, each designed to cover a specific aspect of speech synthesis: polyphones, prosodic contexts, reiterant speech, function word sequences, and continuous speech. This paper focuses on the use of the Victoria corpus in the statistical estimation of duration and pitch models for Apple's next-generation text-to-speech system in Macintosh OS X. Duration modeling relies primarily on the subcorpus of prosodic contexts, which is instrumental to uncover empirical evidence in favor of a piecewise linear transformation in the well-known sums-of-products approach. Pitch modeling relies primarily on the subcorpus of reiterant speech, which makes possible the optimization of superpositional pitch models with more accurate underlying smooth contours. Experimental results illustrate the improved prosodic representation resulting from these new duration and pitch models View full abstract»
One of the most successful approaches to synthesizing speech, concatenative synthesis, combines recorded speech units to build full utterances. However, the prosody of the stored units is often not consistent with that of the target utterance and must be altered. Furthermore, several types of mismatch can occur at unit boundaries and must be smoothed. Thus, both pitch and time-scale modification techniques as well as smoothing algorithms play a crucial role in such concatenation based systems. In this paper, we describe novel approaches to each of these issues. First, we present a conceptually simple technique for pitch and time-scale modification of speech. Our method is based upon a harmonic coding of each speech frame, and operates entirely within the original sinusoidal model. Crucially, it makes no use of “pitch pulse onset times.” Instead, phase coherence, and thus shape invariance, is ensured by exploiting the harmonic relation existing between the sine waves used to code each analysis frame so that their phases at each synthesis frame boundary are consistent with those derived during analysis. Secondly, a smoothing algorithm, aimed specifically at correcting phase mismatches at unit boundaries, is described. Results are presented showing our prosodic modification techniques to be highly suitable for use within a concatenative speech synthesizer View full abstract»
A common problem in diphone synthesis is discussed, viz., the occurrence of audible discontinuities at diphone boundaries. Informal observations show that spectral mismatch is the most likely the clause of this phenomenon. We first set out to find an objective spectral measure for discontinuity. To this end, several spectral distance measures are related to the results of a listening experiment. Then, we studied the feasibility of extending the diphone database with context-sensitive diphones to reduce the occurrence of audible discontinuities. The number of additional diphones is limited by clustering consonant contexts that have a similar effect on the surrounding vowels on the basis of the best performing distance measure. A listening experiment has shown that the addition of these context-sensitive diphones significantly reduces the amount of audible discontinuities View full abstract»
Aims & Scope
Covers the sciences, technologies and applications relating to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language.
This Transactions ceased publication in 2005. The current retitled publication is IEEE/ACM Transactions on Audio, Speech, and Language Processing.