Effective Emotion Transplantation in an End-to-End Text-to-Speech System

In this paper, we propose an effective technique to transplant a source speaker’s emotional expression to a new target speaker’s voice within an end-to-end text-to-speech (TTS) framework. We modify an expressive TTS model pre-trained using a source speaker’s emotional speech database to reflect the voice characteristics of a target speaker for which only a neutral speech database is available. We set two adaptation criteria to achieve this. One criterion is to minimize the reconstruction loss between the target speaker’s recorded and synthesized speech, such that the synthesized speech has the target speaker’s voice characteristics. The other criterion is to minimize the emotion loss between the emotion embedding vectors extracted from the reference expressive speech and the target speaker’s synthesized expressive speech, which is essential to preserve expressiveness. Since the two criteria are applied alternately in the adaptation process, we are able to avoid the kind of bias issues frequently encountered in similar tasks. The proposed adaptation technique demonstrates more effective performance compared to conventional approaches in both quantitative and qualitative evaluations.


I. INTRODUCTION
The task of generating natural speech from the input text, i.e., text-to-speech (TTS), is becoming increasingly important, as it is a key module in building human-computer interaction systems. Thanks to the powerful modeling capabilities of deep learning technologies, the sound quality and naturalness of synthesized speech have substantially improved in recent years [1]- [6]. In particular, end-to-end framework-based TTS models that infer acoustic features directly from input character sequences without laborious feature-engineering tasks have shown great success [6]- [8].
Because of the success of end-to-end text-to-speech (E2E-TTS) models, researchers have been trying to expand this framework to synthesize more expressive speech [9]- [13]. Unlike emotionally neutral speech (narrative speech) which has monotonic prosody, expressive speech has many variations in prosody. Thus, a key challenge in synthesizing expressive speech lies in determining distinctive The associate editor coordinating the review of this manuscript and approving it for publication was Inês Domingues . characteristics of different expressions and representing them using condition vectors to control the expressive TTS model.
Condition vectors can either be handcrafted or learned in the TTS model's training stage. An E2E-TTS framework mainly uses learned vectors, so-called embedding vectors or latent variables. These are jointly trained with the weights of the TTS model using backpropagation [14]. For example, [10] and [11] trained embedding vectors in a supervised manner using emotion labels. Recently, several studies have adopted an unsupervised method in which embedding vectors are trained in a deep learning framework, but without annotated labels [12], [13], [15]. This method is useful when it is difficult to obtain labeled data or when the speech data contains ambiguous styles that are difficult to classify. In [12] and [13], networks were trained to directly extract embedding vectors from a reference speech waveform during the overall training process, and the style of synthesized speech was controlled using the embedding vectors.
Although E2E-TTS models with condition vectors are very effective, it is often difficult to deploy them in real-world applications because of database issues. In such applications, high-quality expressive speech databases with VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ multiple voice identities are required. This means that expressive speech databases recorded by many professional voice actors and actresses are required. However, it is expensive and time-consuming to construct these kinds of expressive speech databases. In addition, it is difficult to utter expressive speech while maintaining consistent expressiveness. An effective way to solve this problem is to use a technique like speaker adaptation [16]- [20], in which a baseline model is trained using a large database, then adjusted to a target speaker using only a small amount of data. This approach can similarly be applied to expressiveness tasks through emotion transplantation, i.e. training an expressive TTS model using available other speaker's expressive speech database and adjusting the pre-trained model to the target speaker's voice [21]- [24]. Even when there is only a small amount of expressive speech data of the target speaker, a target speaker's expressive TTS model can be obtained fairly easily by adapting the pre-trained model to minimize reconstruction loss, which is an error between recorded and synthesized expressive-speech of the target speaker.
In this paper, we deal with the case in which the target speaker has only neutral speech data. We experimentally found that the style of synthesized speech becomes ambiguous as the model adaptation progresses; eventually, the output synthesized speech does not faithfully present the expressiveness style. Because the model is adapted to reconstruct the target speaker's voice using neutral speech, the model's capability for generating expressive speech is impaired. To deal with this, we propose an effective emotion transplantation technique that guides the pre-trained model to preserve expressiveness characteristics during the adaptation process. The model update procedure has two alternating steps: (1) modifying the voice characteristics of the pre-trained model to match those for the target speaker and (2) preserving its capability for generating expressive speech. More specifically, when adapting the TTS model to minimize the reconstruction loss for the target speaker's neutral-style speech, the proposed technique synthesizes expressive speech with the target speaker's voice from the adapted TTS model and updates the model to minimize a metric we call the emotion loss. The emotion loss is the distance between the expressive condition vector extracted from the target speaker's synthesized expressive-speech and the input expressive condition vector used to synthesize speech expressively. That is, the model is updated so that the emotional style of synthesized expressive-speech matches the emotional style included in the input expressiveness condition vector. Here, the condition vector is extracted from the source speaker's expressive speech because the target speaker does not have expressive speech data. For the same reason, the emotion loss function compares the expressiveness condition vectors instead of the target speaker's expressive speech data. The condition vectors extracted from the synthesized target speaker's expressive speech should have an emotional style identical to that of the input condition vector. These two steps are repeated alternately until the model converges.
The remainder of this paper is organized as follows. In Section II, we describe the end-to-end expressive TTS model architecture to understand our proposed approach. In Section III, we explain the proposed effective emotion transplantation approach. Section IV provides objective and subjective experimental results, and Section V summarizes and concludes the paper.

II. MODEL ARCHITECTURE
The end-to-end expressive TTS model used in this paper consists of two components: (1) an emotion encoder which outputs an expressiveness condition vector based on a reference expressive speech input, and (2) an E2E-TTS model which synthesizes expressive speech using text input and expressiveness condition vectors.

A. EMOTION ENCODER
To obtain the expressiveness condition vector, we adopt the global style token (GST) approach [13], which derives the expressiveness condition vector on the fly by passing reference speech. Figure 1 illustrates the GST architecture. The GST, which consists of a reference encoder [12] and a style token layer [13], is jointly trained while training a TTS model as follows. The reference encoder outputs a prosody embedding vector based on a reference speech input, typically having a mel-scale spectrogram (mel-spectrogram) format. The prosody embedding vector is used as the input to the style token layer, comprising an attention module and a set of token embeddings [13]. The style of the given reference speech is then presented as the weighted sum of each token, where the weights are the contributions of each token to the prosody embedding vector. The style can be defined in various meanings such as voice characteristics, speaking style, emotions, etc. For this study, we focus on emotions and use the term emotion embedding vector rather than expressiveness condition vector for the remainder of this paper.

B. END-TO-END TTS MODEL
Among various E2E-TTS frameworks, we use the deep convolutional TTS (DCTTS) framework because of its fast training speed and stable alignment [8]. The DCTTS framework consists of a text-to-mel-spectrogram network (Text2Mel) and a spectrogram super-resolution network (SSRN). The Text2Mel module predicts a coarse mel-spectrogram, that is, a down-sampled mel-spectrogram in the time axis, based on input texts and the predicted mel-spectrogram at the previous time step, in an auto regressive manner. The SSRN module predicts a linear-scale spectrogram (spectrogram), that is up-sampled in time and frequency axis, from the coarse mel-spectrogram.
More specifically, text embedding and audio embedding sequences are extracted from text and audio encoders, respectively. Here, the mel-spectrogram predicted at the previous time step (i.e., the audio decoder output) enters the audio encoder input. Attention values between these are multiplied to the text embedding vectors, after which the attended text embedding vectors are concatenated to audio embedding vectors. In this paper, the emotion embedding vector inferred by the emotion encoder is also concatenated. The audio decoder then autoregressively infers the mel-spectrogram from these combined embeddings. Finally, the spectrogram predicted by the SSRN module is converted into a time-domain speech waveform using either the Griffin-Lim algorithm [25] or any type of generative model such as WaveNet [5], [26].

III. PROPOSED EMOTION TRANSPLANTATION APPROACH
In this section, we describe an effective adaptation method that successfully transplants the emotional expressiveness of a pre-trained model to the target speaker's voice.
Assume that we have an expressive TTS model pre-trained with a source speaker's expressive speech database, but only a neutral speech database is available for the target speaker. A simple idea is to adapt the pre-trained source speaker's expressive TTS model with the target speaker's neutral speech database. However, in a preliminary experiment, we found that this simple adaptation approach could not maintain the TTS model's ability to express appropriate emotions in the synthesized speech. Therefore, we designed the proposed approach by considering two aspects: generating the target speaker's voice characteristics and expressing appropriate emotions. During the process of adapting the TTS model based on a target speaker's neutral speech, we synthesize expressive speech with the target speaker's voice by inputting an emotion embedding vector extracted from the source speaker's expressive speech to the adapted TTS model. We then update the model by minimizing the emotion loss between the input emotion embedding vector and the emotion embedding vector extracted from the synthesized expressivespeech. The detailed procedure is described in Algorithm 1. Figure 2 and Phase 1 of Algorithm 1 describe the procedure for training a baseline expressive TTS model using a source speaker's expressive speech database.

A. BUILDING A SOURCE SPEAKER'S EXPRESSIVE TTS MODEL
Training the expressive TTS model requires input text X, output expressive speech S src,emo , and input reference expressive speech S src,emo ref from a source speaker's expressive speech database D src,emo . In this work, we chose the reference speech to be non-parallel with the input text so that the emotion encoder is robust to the reference speech signal content; Then, e is passed as input to M TTS with X. The training process is identical to conventional TTS models in that it minimizes a reconstruction loss between recorded and synthesized expressive-speech, namely S src,emo andŜ src,emo . We define the loss function of M TTS , L TTS , as the sum of L1 loss L 1 and a binary divergence function D bd following the DCTTS system [8], namely, where S andŜ are recorded and synthesized speech, respectively, in mel-spectrogram format. In this study, we do not use the guided attention loss introduced in [8]. This guided attention loss prompts the attention matrix to be nearly diagonal, but it is not effective when there are many variations in the attention pattern caused by various speaking speeds dependent on the emotion classes.   The loss function for voice L voice follows Eq. (1). In the second step, the adapted M TTS is updated by optimizing the emotion loss function,

B. ADAPTING VOICE CHARACTERISTICS WHILE MAINTAINING EXPRESSIVENESS CHARACTERISTICS
where e andê are emotion embedding vectors extracted from recorded and synthesized expressive speech, respectively. We use L 1 as a distance metric Dist. In our preliminary experiments, other distance metrics such as cosine distance also showed similar results. In this work,ê is extracted from a synthesized expressive speech of a target speaker's voiceŜ tgt,emo . e, which is an input of M TTS for synthesizing speech expressively, is extracted from S src,emo ref , since the target speaker's speech database does not contain expressive speech. By comparing the emotion embedding vectors, the TTS model can be updated even when there is no expressive speech of the target speaker.

IV. EXPERIMENTS AND ANALYSIS A. EXPERIMENTAL SETUP
We prepared two Korean speech databases: one expressive and one neutral style speech. The expressive speech database for the source speaker consists of four emotion classes, namely neutral (NEU), joyful (JOY), angry (ANG), and sad (SAD), recorded by a single professional voice actress. The total amount of speech is about 11 hours. The scripts for each emotion class were different from the others. The target speaker's neutral speech database was recorded by another professional actress and consisted of approximately 1 hour of speech. Both databases were recorded at a 16 kHz sampling rate. The amounts of data for the training, validation, and test sets were set to 90%, 5%, and 5%, respectively. To enhance trainability, we excluded speech data longer than 10 seconds and trimmed the silence regions at the beginning and end of each sentence.
The network architectures and hyperparameters of the E2E-TTS module and the emotion encoder module were set to follow the original papers [8], [13], except that the emotion embedding vector was concatenated with the text embedding vectors. Consequently, the dimension of the audio embedding vectors was also adjusted to match that of the concatenated embedding vector. 80-dimensional mel-spectrograms were extracted at 12.5 ms frame intervals with 50 ms frame lengths from speech segments. All networks were trained using the Adam optimization algorithm [27] with a learning rate of 0.001 when training the baseline expressive TTS model, and 0.0001 when performing adaptations to obtain the target speaker's expressive TTS model.

B. EXPERIMENTAL RESULTS
To quantitatively verify the effectiveness of the proposed approach, we measured the equal error rate (EER) for speaker verification and the emotion classification accuracy for the target speaker's synthesized expressive-speech samples: lower EER indicates higher speaker similarity, and high classification accuracy indicates that the synthesized speech faithfully expresses emotion corresponding to the expressiveness style of the input emotion embedding vector. For the speaker verification task, we utilized the Kaldi toolkit [28]. We built a universal background model with data from 100 Korean speakers and enrolled an additional 11 speakers including the target speaker. For the emotion classification task, we used the naïve Bayes classification method [29]. The model parameters, such as the mean and variance of each class, were obtained from emotion embedding vectors extracted from the source speaker's recorded expressive speech. The emotion classification accuracy for the expressive speech synthesized by the source speaker's expressive TTS model was 93.5% (baseline). Figure 4 shows evaluation results for the conventional and proposed methods as the adaptation progresses, where emoTgtConv and emoTgtProp are the target speaker's expressive speech synthesized from the TTS model adapted by the conventional (w/o emotion loss) and proposed approaches (w/ emotion loss), respectively. In both approaches, the synthesized speech became more similar to the target speaker's voice until the 40-th epoch, and emotion classification accuracy decreased. However, emoTgtProp maintained expressiveness characteristics with much higher accuracy than emoTgtConv, even though it changed voice characteristics slowly. In Figure 5, emotion embedding vectors, extracted from expressive speech synthesized from emoTgtConv and emoTgtProp, were plotted using the t-distributed Stochastic Neighbor Embedding (t-SNE) algorithm [30]. Emotion embedding vectors of the same class grouped were more condensed for emoTgtProp, which means that much more distinct expressive speech was synthesized. On the other hand, the emotion embedding vectors for emoTgtConv showed an ambiguous boundary between emotions, especially NEU and ANG. This means that the synthesized expressive-speech did not faithfully express the emotion corresponding to the given input emotion embedded vector.

C. SUBJECTIVE LISTENING TESTS
We conducted mean opinion score (MOS) tests on expressiveness, speaker similarity, sound quality, and naturalness for the synthesized speech 1 to evaluate the performance of the proposed approach. 20 native Korean speakers participated in the tests. A total of 20 sentences were randomly selected from the test set and speech samples were generated using each method identified above. Considering the variation of L TTS and L emo , we chose the TTS model trained up to the 50-th epoch. For the MOS tests, speech samples were synthesized using a neural vocoder, WaveGlow [31], [32], trained using both source and target speakers' speech databases.
To evaluate expressiveness, we asked participants to rate the expressiveness degree of a speech signal given the emotion label, using the following five responses: 1 = Absolutely different from the annotated emotion label; 2 = Ambiguous to the annotated emotion label; 3 = Slightly expressive; 4 = Very expressive; 5 = Extremely expressive. The source speaker's recorded expressive speech (emoSrcRec) was also compared to provide an upper bound. The top of Figure 6 shows that emoTgtProp scored slightly worse than emoSrcRec but was much more expressive than emoTgtConv, confirming the superiority of the proposed approach over the conventional one. To evaluate speaker similarity, participants were asked to rate voice similarity for the synthesized speech with the target speaker's voice, using a scale from 1 to 5: 1 = Dissimilar; 2 = Slightly dissimilar; 3 = Similar; 4 = Very similar; 5 = Absolutely the same speaker. Each synthesized speech sample was compared to recorded speech samples randomly selected from the target speaker's neutral speech database. We asked listeners to focus only on the degree of expressiveness and voice similarity, excluding speech content or emotional state differences. The bottom of Figure 6 shows that there were no significant differences between emoTgt-Prop and emoTgtConv except for ANG. In the case of ANG, because angry speech generated by emoTgtConv sounded like neutral speech, listeners felt that it was similar to the target speaker's voice. Both approaches received low scores for SAD. Sad speech has a low pitch and trembling characteristics, which results in missing speaker characteristics. Although we asked listeners to evaluate only voice similarity, excluding emotional state and content differences between speech samples, they were still somewhat influenced by the features of emotion.
Participants also rated the naturalness and sound quality of the synthesized speech samples. Table 1 shows that there were no significant differences between the two approaches.

V. CONCLUSION
This paper proposed an effective emotion transplantation approach within an E2E-TTS framework for generating expressive speech for a target speaker whose speech database includes only neutral style speech. By alternately updating a pre-trained expressive TTS model in two directions, we not only generated the target speaker's voice characteristics but also successfully maintained the expressiveness characteristics of the pre-trained model. Thus, the proposed method successfully transplanted the ability to express appropriate emotions in the target speaker's voice. We verified the superior performance of the proposed approach through various quantitative and qualitative evaluations.