A Comparative Study of Text-to-Speech (TTS) Models and Vocoder Combinations for High-Quality Synthesized Speech | IEEE Conference Publication | IEEE Xplore

A Comparative Study of Text-to-Speech (TTS) Models and Vocoder Combinations for High-Quality Synthesized Speech


Abstract:

Deep neural networks and machine learning have significantly improved the quality and naturalness of TTS, a method for converting written text to spoken text. The present...Show More

Abstract:

Deep neural networks and machine learning have significantly improved the quality and naturalness of TTS, a method for converting written text to spoken text. The present work proposes a comprehensive comparison of multiple Text-to-Speech (TTS) models and vocoders. The primary objective is to identify the most effective TTS model-vocoder combination and comprehend their advantages and disadvantages. We conduct rigorous evaluations of various TTS model-vocoder pairings to achieve this, utilizing the Lj-Speech-en dataset. We evaluated the naturalness of the synthesized speech by employing subjective Mean Opinion Score (MOS) assessments from 40 listeners. Experimental results demonstrate that the FastSpeech2 and MB-MelGAN combination outperforms all other configurations, yielding remarkably high-quality audio with an MOS score of 4.3595.
Date of Conference: 22-24 November 2023
Date Added to IEEE Xplore: 09 February 2024
ISBN Information:
Conference Location: Coimbatore, India

Contact IEEE to Subscribe

References

References is not available for this document.