Skip to Main Content
It is technically challenging to make a machine talk as naturally as a human so as to facilitate “frictionless” interactions between machine and human. We propose a trajectory tiling-based approach to high-quality speech rendering, where speech parameter trajectories, extracted from natural, processed, or synthesized speech, are used to guide the search for the best sequence of waveform “tiles” stored in a pre-recorded speech database. We test the proposed unified algorithm in both Text-To-Speech (TTS) synthesis and cross-lingual voice transformation applications. Experimental results show that the proposed trajectory tiling approach can render speech which is both natural and highly intelligible. The perceived high quality of rendered speech is also confirmed in both objective and subjective evaluations.
Audio, Speech, and Language Processing, IEEE Transactions on (Volume:21 , Issue: 2 )
Date of Publication: Feb. 2013