Conferences >2022 30th European Signal Pro...

Multi-stage attention for fine-grained expressivity transfer in multispeaker text-to-speech system

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

The main goal of this work is to provide fine-grained transfer of expressivity in various speaker's voices for which no expressive speech data is available. Our approach ...Show More

Metadata

Abstract:

The main goal of this work is to provide fine-grained transfer of expressivity in various speaker's voices for which no expressive speech data is available. Our approach conditions a multispeaker Tacotron 2 system with latent embeddings extracted from phoneme sequence, speaker identity, and reference expres-sive Mel spectrogram. The proposed system utilizes attention modules for discovering local and global expressivity attributes. Additionally, location-sensitive attention is applied in the decoder to learn the alignment between phoneme sequence-Mel spectro-gram pair. In addition to conventional objective metrics for speech synthesis, we used cosine similarity and character error rate (CER) measures for the evaluation of transfer of expressivity and intelligibility. The obtained results demonstrate the presented cosine similarity metric for speaker and expressivity is consistent with the subjective evaluation. Thus, the usage of multiple evaluation measures provides a way to estimate the strength of emotions and the speaker's voice for transferred expressivity in the target speaker's voice. The obtained results show that presented fine-grained TTS systems performed better than the Tacotron 2 based baseline systems.

Published in: 2022 30th European Signal Processing Conference (EUSIPCO)

Date of Conference: 29 August 2022 - 02 September 2022

Date Added to IEEE Xplore: 18 October 2022

ISBN Information: