Abstract:
The main goal of this work is to provide fine-grained transfer of expressivity in various speaker's voices for which no expressive speech data is available. Our approach ...Show MoreMetadata
Abstract:
The main goal of this work is to provide fine-grained transfer of expressivity in various speaker's voices for which no expressive speech data is available. Our approach conditions a multispeaker Tacotron 2 system with latent embeddings extracted from phoneme sequence, speaker identity, and reference expres-sive Mel spectrogram. The proposed system utilizes attention modules for discovering local and global expressivity attributes. Additionally, location-sensitive attention is applied in the decoder to learn the alignment between phoneme sequence-Mel spectro-gram pair. In addition to conventional objective metrics for speech synthesis, we used cosine similarity and character error rate (CER) measures for the evaluation of transfer of expressivity and intelligibility. The obtained results demonstrate the presented cosine similarity metric for speaker and expressivity is consistent with the subjective evaluation. Thus, the usage of multiple evaluation measures provides a way to estimate the strength of emotions and the speaker's voice for transferred expressivity in the target speaker's voice. The obtained results show that presented fine-grained TTS systems performed better than the Tacotron 2 based baseline systems.
Date of Conference: 29 August 2022 - 02 September 2022
Date Added to IEEE Xplore: 18 October 2022
ISBN Information:
ISSN Information:
Keywords assist with retrieval of results and provide a means to discovering other relevant content. Learn more.
- IEEE Keywords
- Index Terms
- Speech Recognition ,
- Subjective Evaluation ,
- Speaker Recognition ,
- Objective Metrics ,
- Speech Synthesis ,
- Evaluation Of Transfer ,
- Latent Variables ,
- Similarity Score ,
- Recurrent Network ,
- Attention Mechanism ,
- Objective Evaluation ,
- Multiple Stages ,
- Latent Space ,
- Latent Representation ,
- Variational Autoencoder ,
- Attention Weights ,
- Speech Samples ,
- Encoder Output ,
- Emotional Prosody ,
- Acoustic Model ,
- Mean Opinion Score ,
- Speech Corpus ,
- Speech Utterances ,
- Text Encoder ,
- Vector Quantization ,
- Self-attention Layer
- Author Keywords
Keywords assist with retrieval of results and provide a means to discovering other relevant content. Learn more.
- IEEE Keywords
- Index Terms
- Speech Recognition ,
- Subjective Evaluation ,
- Speaker Recognition ,
- Objective Metrics ,
- Speech Synthesis ,
- Evaluation Of Transfer ,
- Latent Variables ,
- Similarity Score ,
- Recurrent Network ,
- Attention Mechanism ,
- Objective Evaluation ,
- Multiple Stages ,
- Latent Space ,
- Latent Representation ,
- Variational Autoencoder ,
- Attention Weights ,
- Speech Samples ,
- Encoder Output ,
- Emotional Prosody ,
- Acoustic Model ,
- Mean Opinion Score ,
- Speech Corpus ,
- Speech Utterances ,
- Text Encoder ,
- Vector Quantization ,
- Self-attention Layer
- Author Keywords