Abstract:
Deep neural networks and machine learning have significantly improved the quality and naturalness of TTS, a method for converting written text to spoken text. The present...Show MoreMetadata
Abstract:
Deep neural networks and machine learning have significantly improved the quality and naturalness of TTS, a method for converting written text to spoken text. The present work proposes a comprehensive comparison of multiple Text-to-Speech (TTS) models and vocoders. The primary objective is to identify the most effective TTS model-vocoder combination and comprehend their advantages and disadvantages. We conduct rigorous evaluations of various TTS model-vocoder pairings to achieve this, utilizing the Lj-Speech-en dataset. We evaluated the naturalness of the synthesized speech by employing subjective Mean Opinion Score (MOS) assessments from 40 listeners. Experimental results demonstrate that the FastSpeech2 and MB-MelGAN combination outperforms all other configurations, yielding remarkably high-quality audio with an MOS score of 4.3595.
Published in: 2023 7th International Conference on Electronics, Communication and Aerospace Technology (ICECA)
Date of Conference: 22-24 November 2023
Date Added to IEEE Xplore: 09 February 2024
ISBN Information: