Loading [MathJax]/extensions/MathMenu.js
PPHiFi-TTS: Phonetic Preserved High-Fidelity Text-to-Speech for Long-Term Speech Dependencies | IEEE Conference Publication | IEEE Xplore

PPHiFi-TTS: Phonetic Preserved High-Fidelity Text-to-Speech for Long-Term Speech Dependencies


Abstract:

In recent years, TTS models have significantly improved the quality of synthesized speech, making it more natural-sounding and intelligible, particularly with the interac...Show More

Abstract:

In recent years, TTS models have significantly improved the quality of synthesized speech, making it more natural-sounding and intelligible, particularly with the interaction of neural network-based models. Further, the neural vocoders-based on the Generative Adversarial Networks (GANs) have shown the potential to generate the raw speech waveform of the unseen speaker in the natural style. We introduce a novel architecture PPHiFiGAN, by combining the TTS model with the HiFi-GAN phoneme vocoder, where the Generator (G), and Discriminator (D) aim to enhance the synthesis quality and capture the in-depth phonetic nuances from the dictionary. This approach preserves the fine gradient details and captures the long-term speech characteristics. Our proposed method attained a Mean Opinion Score (MOS) of 4.23 with the LJSpeech recipe and 4.05 with the VCTK recipe, demonstrating the effectiveness of model in generating high-quality synthesized speech relative to proposed existing TTS architectures.
Date of Conference: 03-06 December 2024
Date Added to IEEE Xplore: 27 January 2025
ISBN Information:

ISSN Information:

Conference Location: Macau, Macao

I. Introduction

Text-to-Speech (TTS) technology has undergone remarkable research and advances for the last several years. TTS enables the conversion of text into spoken natural speech, and helps to facilitate communication for individuals with better accessibility, and expressiveness with human interaction experience using digital devices. In recent years, TTS has been implemented in an autoregressive way of data distribution corresponding to the input, e.g., WaveNet [1], Tacotron [2]. However, autoregressive models could be more efficient in addressing the TTS quality. Later on, non-autoregressive TTS works on the motive of the teacher-student to increase the training-inference speed, and minimise latency issues of the autoregressive TTS, e.g., FastSpeech 2 [3] generates speech frames in parallel rather than sequentially, and transformer-based TTS [4] uses the multi-head attention mechanism to replace the Tacotron2 [5] to addresses a long-range dependency problem by directly connecting two input at different times using the self-attention mechanism. Due to the self-attention mechanism, it demonstrated the remarkable capability to generate high quality speech synthesis. Leveraging the ongoing research, TTS methods improved the naturalness and fluency for synthesized speech. However, due to the trade-off between speed vs. quality, non-autoregressive models frequently compromise on synthesis quality in favor of faster processing. Also, maintaining true parallelism by coherence, and fidelity in the synthesized speech degrades speech variability, such as different speaking rates, accents, or emotional expressions.

Contact IEEE to Subscribe

References

References is not available for this document.