Abstract:
In recent years, TTS models have significantly improved the quality of synthesized speech, making it more natural-sounding and intelligible, particularly with the interac...Show MoreMetadata
Abstract:
In recent years, TTS models have significantly improved the quality of synthesized speech, making it more natural-sounding and intelligible, particularly with the interaction of neural network-based models. Further, the neural vocoders-based on the Generative Adversarial Networks (GANs) have shown the potential to generate the raw speech waveform of the unseen speaker in the natural style. We introduce a novel architecture PPHiFiGAN, by combining the TTS model with the HiFi-GAN phoneme vocoder, where the Generator (G), and Discriminator (D) aim to enhance the synthesis quality and capture the in-depth phonetic nuances from the dictionary. This approach preserves the fine gradient details and captures the long-term speech characteristics. Our proposed method attained a Mean Opinion Score (MOS) of 4.23 with the LJSpeech recipe and 4.05 with the VCTK recipe, demonstrating the effectiveness of model in generating high-quality synthesized speech relative to proposed existing TTS architectures.
Published in: 2024 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)
Date of Conference: 03-06 December 2024
Date Added to IEEE Xplore: 27 January 2025
ISBN Information:
ISSN Information:
Keywords assist with retrieval of results and provide a means to discovering other relevant content. Learn more.
- IEEE Keywords
- Index Terms
- Long-term Speech ,
- Generative Adversarial Networks ,
- Speech Synthesis ,
- Mean Opinion Score ,
- Convolutional Layers ,
- Fast Fourier Transform ,
- Fundamental Frequency ,
- Long Short-term Memory ,
- Attention Mechanism ,
- Dynamic Stress ,
- Submodule ,
- Tokenized ,
- Natural Speech ,
- Input Text ,
- Voice Changes ,
- Human Speech ,
- Dynamic Time Warping ,
- Speech Quality ,
- Speech Output ,
- Phonetic Transcription
- Author Keywords
Keywords assist with retrieval of results and provide a means to discovering other relevant content. Learn more.
- IEEE Keywords
- Index Terms
- Long-term Speech ,
- Generative Adversarial Networks ,
- Speech Synthesis ,
- Mean Opinion Score ,
- Convolutional Layers ,
- Fast Fourier Transform ,
- Fundamental Frequency ,
- Long Short-term Memory ,
- Attention Mechanism ,
- Dynamic Stress ,
- Submodule ,
- Tokenized ,
- Natural Speech ,
- Input Text ,
- Voice Changes ,
- Human Speech ,
- Dynamic Time Warping ,
- Speech Quality ,
- Speech Output ,
- Phonetic Transcription
- Author Keywords