Conferences >2024 Asia Pacific Signal and ...

PPHiFi-TTS: Phonetic Preserved High-Fidelity Text-to-Speech for Long-Term Speech Dependencies

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

In recent years, TTS models have significantly improved the quality of synthesized speech, making it more natural-sounding and intelligible, particularly with the interac...Show More

Metadata

Abstract:

In recent years, TTS models have significantly improved the quality of synthesized speech, making it more natural-sounding and intelligible, particularly with the interaction of neural network-based models. Further, the neural vocoders-based on the Generative Adversarial Networks (GANs) have shown the potential to generate the raw speech waveform of the unseen speaker in the natural style. We introduce a novel architecture PPHiFiGAN, by combining the TTS model with the HiFi-GAN phoneme vocoder, where the Generator (G), and Discriminator (D) aim to enhance the synthesis quality and capture the in-depth phonetic nuances from the dictionary. This approach preserves the fine gradient details and captures the long-term speech characteristics. Our proposed method attained a Mean Opinion Score (MOS) of 4.23 with the LJSpeech recipe and 4.05 with the VCTK recipe, demonstrating the effectiveness of model in generating high-quality synthesized speech relative to proposed existing TTS architectures.

Published in: 2024 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

Date of Conference: 03-06 December 2024

Date Added to IEEE Xplore: 27 January 2025

ISBN Information: