Abstract:
This paper proposes a multi-emotion, multi-lingual, and multi-speaker text-to-speech (MELS-TTS) system, employing disentangled style tokens for effective emotion transfer...Show MoreMetadata
Abstract:
This paper proposes a multi-emotion, multi-lingual, and multi-speaker text-to-speech (MELS-TTS) system, employing disentangled style tokens for effective emotion transfer. In speech encompassing various attributes, such as emotional state, speaker identity, and linguistic style, disentangling these elements is crucial for an efficient multi-emotion, multi-lingual, and multi-speaker TTS system. To accomplish this purpose, we propose to utilize separate style tokens to disentangle emotion, language, speaker, and residual information, inspired by the global style tokens (GSTs). Through the attention mechanism, each style token learns its respective speech attribute from the target speech. Our proposed approach yields improved performance in both objective and subjective evaluations, demonstrating the ability to generate cross-lingual speech with diverse emotions, even from a neutral source speaker, while preserving the speaker’s identity.
Published in: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Date of Conference: 14-19 April 2024
Date Added to IEEE Xplore: 18 March 2024
ISBN Information: