Speech Synthesis with Mixed Emotions

Emotional speech synthesis aims to synthesize human voices with various emotional effects. The current studies are mostly focused on imitating an averaged style belonging to a specific emotion type. In this paper, we seek to generate speech with a mixture of emotions at run-time. We propose a novel formulation that measures the relative difference between the speech samples of different emotions. We then incorporate our formulation into a sequence-to-sequence emotional text-to-speech framework. During the training, the framework does not only explicitly characterize emotion styles, but also explores the ordinal nature of emotions by quantifying the differences with other emotions. At run-time, we control the model to produce the desired emotion mixture by manually defining an emotion attribute vector. The objective and subjective evaluations have validated the effectiveness of the proposed framework. To our best knowledge, this research is the first study on modelling, synthesizing, and evaluating mixed emotions in speech.


INTRODUCTION
H UMANS can feel multiple emotional states at the same time [1].Consider some bittersweet moments such as remembering a lost love with warmth or the first time leaving home for college, it is possible to experience the co-occurrence of different types of emotions -even two oppositely valenced emotions (e. g., happy and sad) [2], [3].Emotional speech synthesis aims to add emotional effects to a synthesized voice [4].Synthesizing mixed emotions will mark a milestone for achieving human-like emotions in speech synthesis, thus enabling a higher level of emotional intelligence in human-computer interaction [5], [6], [7].
Synthesizing a mixed emotional effect is a challenging task.One of the reasons is the subtle nature of human emotions [21].Therefore, it is not straightforward to precisely characterize speech emotion.Besides, speech emotion is inherently supra-segmental, complex with multiple acoustic cues such as timbre, pitch and rhythm [22], [23].Both spectral and prosodic variants need to be studied when modelling speech emotion.The early studies on emotional speech synthesis rely on statistical modelling of different speech parameters with hidden Markov models (HMM) [24], [25] and Gaussian mixture model (GMM) [26], [27].Deep neural networks (DNN) [28], [29] and deep bi-directional long-short-term memory network (DBLSTM) [30], [31] represent the recent advances.The end-to-end neural architecture [32], [33] becomes popular because of its superior performance.We note that there are generally two types of methods in the literature to learn emotion information: one uses auxiliary emotion labels as the condition of the framework [34], [35], and the other imitates the emotion style of the reference speech [36], [37].However, these methods learn the global temporal structure of speech emotion, resulting in a monotonous expressiveness in synthesized speech.In this way, these frameworks can only synthesize several emotion types exhibited in the database.These disadvantages limit the flexibility and controllability of the above frameworks.For example, it is hard to synthesize mixed emotional effects with existing emotional speech synthesis frameworks.
For the first time, we study the modelling of mixed emotions in speech synthesis.In psychology, there have been studies [38], [39] to understand the paradigms and measures of mixed emotions.However, the study of mixed emotions in speech synthesis is not given attention yet, where there exist two main research problems: (1) how to characterize and quantify the mixture of speech emotions, and (2) how to evaluate the synthesized speech.In this article, we will address these two challenges.
The main contributions of this article are listed as follows: • For the first time, we study the modelling of mixed emotions for speech synthesis, which brings us a step closer to achieving emotional intelligence; • We introduce a novel scheme to measure the relative difference between emotion categories, with which the emotional text-to-speech framework learns to quantify the differences between the emotion styles of speech samples during the training.At run-time, we control the model to produce the desired emotion mixture by manually defining an emotion attribute vector; • We carefully devise objective and subjective evaluations to confirm the effectiveness of the proposed framework and the emotional expressiveness of the speech.
This paper is organized as follows: In Section 2, we motivate our study by introducing the background and related work.In Section 3, we present the details of our proposed framework, and we introduce our experiments in Section 4. We provide further investigations in Section 5.The study is concluded in Section 6.

BACKGROUND AND RELATED WORK
This work is built on several previous studies on the characterization of emotions, sequence-to-sequence emotion modelling for speech synthesis and controllable emotional speech synthesis.We briefly introduce the related studies to set the stage for our research and rationalize the novelty of our contributions.

Characterization of Emotions
Understanding human emotions (e. g., their nature and functions) has been gaining lots of attention in psychology [40], [41], [42].This study is inspired by several previous research, including the theory of the emotion wheel and the ordinal nature of emotions.

Theory of the Emotion Wheel
Humans can experience around 34, 000 different emotions [43].While it is hard to understand all these distinct emotions, Plutchik proposed 8 primary emotions: anger, fear, sadness, disgust, surprise, anticipation, trust and joy, and arranged them in an emotion wheel [44] as shown in Figure 1.All other emotions can be regarded as mixed or derivative states of these primary emotions [44].According to the theory of the emotion wheel, the changes in intensity could produce the diverse amount of emotions we can feel.Besides, the adding up of primary emotions could produce new emotion types.For example, delight can be produced by combining joy and surprise [45].
Despite these efforts in psychology, there is almost no attempt to model the mixed emotions in the literature of speech synthesis.Inspired by the theory of the emotion wheel, we believe it is possible to combine different primary emotions and synthesize mixed emotions in speech.This technique will also allow us to create new emotion types that are hard to collect in real life, which could help us better mimic human emotions and further enhance the engagement in human-robot interaction.Fig. 1: An illustration of the theory of the emotion wheel [44], where all emotions occur as the mixed or derivative states of eight primary emotions.

The Ordinal Nature of Emotions
Emotions are intrinsically relative, and their annotations and analysis should follow the ordinal path [46], [47].Instead of assigning an absolute score or an emotion category, ordinal methods characterize emotions through comparative assessments (e. g., is sentence one happier than sentence two?).
The key idea of ordinal methods is to learn a ranking according to the given criterion.An example is preference learning [51], where the task is to establish preferences between samples.Once the preferences are established, ranking samples [52], [53], [54] is straightforward.Other rankbased methods [55], [56], [57] also show the effectiveness of modelling the affect for speech emotion recognition.As for emotional speech synthesis, researchers also explore the ordinal nature of emotions to model the emotion intensity [58], [59], [60], [61], where the intensity of an emotion is treated as the relative difference between neutral and emotional samples.Inspired by the previous studies, we aim to study rank-based methods to quantify the relative differences between the speech samples from different emotion categories, which we discuss later.

Sequence-to-Sequence Emotion Modelling for Speech Synthesis
The sequence-to-sequence model with attention mechanism was first studied in machine translation [62] and later on found effective in speech synthesis [12], [63].We consider that sequence-to-sequence models are suitable for modelling speech emotion.Sequence-to-sequence models are more effective in modelling the long-term dependencies at different temporal levels such as word, phrase and utterance [64].By learning attention alignment, sequence-to-sequence models can capture the dynamic prosodic variants within an utterance [65].They also allow for the prediction of the speech duration at run-time, which is a critical prosodic factor of the speech emotion [66].
There are generally two types of methods in the literature to model speech emotions: 1) explicit label-based and 2) reference-based approaches.Next, we will briefly introduce these two approaches in sequence-to-sequence modelling.

Learn to Associate with Explicit Labels
It is the most straightforward to characterize emotion by using explicit emotion labels [34], [35], where the model learns to associate labels with emotion styles.In [34], an emotion label vector is taken by the attention-based decoder to produce the desired emotion.In [35], a low-resourced emotional text-to-speech is built using model adaptation with a few emotion labels.In addition to the explicit labels of discrete emotion categories, there are attempts to condition the decoder with continuous variables [67].

Learn to Imitate a Reference
Another approach is to use a style encoder to imitate and transplant the reference style [32].Global style token (GST) [36] is an example to learn style embeddings from the reference audio in an unsupervised manner.Some studies incorporate additional emotion recognition loss [33], [68], perceptual loss [60], [69] or adversarial training [70] to help with the emotion rendering.Other studies [71], [72], [73], [74] replace the global style embedding with phoneme or segmental level prosody embedding to capture multiscale emotion variants.Similar approaches have also been applied to emotional voice conversion research.In [75], the style encoder further acts as the emotion encoder to learn actual emotion information through a two-stage training.In [76], a speaker encoder is further introduced to preserve the speaker information.
These successful attempts motivate us to leverage the sequence-to-sequence mechanism to enable emotion modelling for speech synthesis.

Controllable Emotional Speech Synthesis
Speech emotion is often manifested in various prosody aspects [77].Emotion rendering can be controlled by modifying different prosodic cues.Current studies [78], [79] mainly focus on designing the prosody embedding as a control vector that is derived from a representation learning framework.For example, style tokens [36] are designed to represent high-level styles such as speaker style, pitch range and speaking rate.Emotion rendering can be controlled by choosing specific tokens.Recent attempts [80], [81] study a way to include a hierarchical, fine-grained prosody representation into the style token-based diagram [36].Some other studies also use variational autoencoders (VAE) [82] to control the speech style by learning, scaling or combining disentangled representations [83], [84].
Recently, emotion intensity control has attracted much attention in emotional speech synthesis.Emotion intensity is considered to be correlated with all the acoustic cues that contribute to speech emotion [85], which makes itself even more subjective and challenging to model.Some studies use Fig. 2: Block diagram of our proposed relative scheme applied to emotional text-to-speech at run-time.
auxiliary features such as a state of voiced, unvoiced and silence (VUS) [86], attention weights or a saliency map [87] to control the emotion intensity.Other studies manipulate the internal emotion representations through interpolation [88], scaling [76] or distance-based quantization [89].In [58], [59], [60], [61], relative attributes are introduced to learn a more interpretable representation of emotion intensity.However, none of these frameworks studied the correlation and interplay between different emotions.This contribution aims to fill this research gap.

Summary of Research Gap
We briefly summarize the gaps in the current literature on speech synthesis that we aim to address in this study: • The synthesis of mixed emotions has not been studied in speech synthesis, which limits the capability of current systems to imitate human emotions; • Despite much progress in psychology, it is still challenging to characterize and quantify the mixture of emotions in speech; • Current evaluation methods are inadequate to assess mixed emotional effects.The rethinking of the current evaluation for mixed emotions is needed.This study is a departure from the current studies on emotional speech synthesis.We seek to display the possibilities to synthesize mixed emotions that are subtle but do exist in our real life.

MIXED EMOTION MODELLING AND SYNTHESIS
We propose a novel relative scheme that allows for manually manipulating the synthesized emotion, i.e. mixing multiple different emotion styles.As shown in Figure 2, the proposed scheme allows for flexible control of the extent of each contributing emotion in the speech.At run-time, the framework transfers the reference emotion into a new utterance with the text input, also known as emotional text-to-speech.
We first describe our method of characterizing mixed emotions in speech and highlight our contributions to designing a novel relative scheme.Then, we present the details of the sequence-to-sequence emotion training with the proposed relative scheme.Lastly, we show the flexible control of the proposed framework for synthesizing mixed emotions.

Characterization of Mixed Emotions in Speech
Emotion can be characterized with either categorical [90], [91] or dimensional representations [92], [93].With designated emotion labels, the emotion category approach is the  most straightforward way to represent emotions.However, such representation ignores the subtle variations of emotions.Another approach seeks to model the physical properties of speech emotion with dimensional representations.An example is Russell's circumplex model [92], where emotions are distributed in a two-dimensional circular space, containing arousal and valence dimensions.
One of the most straightforward ways to characterize mixed emotions is to inject different emotion styles into a continuous space.Mixed emotions could be synthesized by adjusting each dimension carefully.However, only a few emotional speech databases [94], [95] provide such annotations.These dimensional annotations are subjective and expensive to collect.Therefore, we only utilize discrete emotion labels available in most databases.We first make an assumption based on the theory of the emotion wheel [44]: Mixed emotions are characterized by combinations, mixtures, or compounds of primary emotions.While it is not straightforward to add up emotions, we explore the ordinal nature of emotions instead.
We propose a rank-based relative scheme to quantify the relative difference between speech recordings with different emotion types.Mixed emotions can be characterized by adjusting the relative difference with other emotion types.The relative difference value can also quantify the level of engagement of each emotion.We introduce our design of a novel relative scheme next.

Design of a Novel Relative Scheme
One of the challenges of synthesizing mixed emotions is quantifying the association or the interplay between different emotions.Inspired by the ordinal nature of emotions, we propose a novel relative scheme to address this challenge.We first make two assumptions according to the theory of the emotion wheel: (1) all emotions are related to some extent; (2) each emotion has stereotypical styles.In our proposal, we not only characterize the identifiable styles of each emotion but also seek to quantify the similarity between different emotion styles.Fig. 4: The training diagram of the proposed framework.The pre-trained relative scheme learns to generate an emotion attribute vector that measures the relative difference between the input emotion style ('Happy') and other primary emotion styles ('Angry', 'Sad', 'Surprise' and 'Neutral').
We study a rank-based method to measure the relative difference between emotion categories, which can offer more informative descriptions and thus be closer to human supervision [96].In computer vision, the relative attribute [96] represents an effective way to model the relative difference between two categories of data.Inspired by the success in various computer vision tasks [97], [98], [99], we believe relative attributes bridge between the low-level features and high-level semantic meanings, which allows us to model the relative difference between emotions only with discrete emotion labels.In this way, we regard the identifiable emotion style as an attribute of speech data, which can be represented with a rich set of emotion-related acoustic features.The relative difference of the emotion styles can be modelled as a relative attribute, which is called "emotion attribute" in this article.The emotion attribute can be learned through a max-margin optimization problem as explained below: Given a training set T = {x n }, where x n is the acoustic features of the n th training sample, and T = A ∪ B, where A and B are two different emotion sets, we aim to learn a ranking function given as below: where W is a weighting matrix indicating the difference in emotion styles.According to hypotheses (1) and ( 2), we propose the following constraints: The weighting matrix W is estimated by solving the following problem similar to that of a support vector machine [100]: where C is the trade-off between the margin and the size of slack variables ξ i,j and γ i,j .Through Eq. ( 4) -( 7), we learn a wide-margin ranking function that enforces the ordering on each training point.As shown in Figure 3(a), we train a relative ranking function f (x) between each emotion pair.At the inference phase, the trained function can estimate an emotion attribute of unseen data, as shown in Figure 3(b).In practice, each emotion attribute value is normalized to [0, 1], where a smaller value indicates a similar emotional style.All the normalized emotion attributes form an emotion attribute vector.The emotion attribute vector bridges the discrete primary emotion labels and is further incorporated in sequence-tosequence emotion training.

Training Strategy
We adopt an emotional text-to-speech framework with the joint training of voice conversion as in [75].As both textto-speech and voice conversion share a common goal of generating realistic speech from the internal representations, the joint training was shown effective [101], [102], [103], [104].The text-to-speech task could benefit from the phone-embedding vectors [105], [106], or the prosody style introduced by a reference encoder [32].A shared decoder between text-to-speech and voice conversion contributes to a robust decoding process [107], [108], [109].
The overall emotional text-to-speech framework is an encoder-decoder model that is trained as a sequence-tosequence system, as shown in Figure 4, where the text Fig. 5: The run-time diagram of the proposed emotional textto-speech framework.The emotion rendering can be manually controlled via the relative scheme.By assigning the appropriate percentage to the attribute vector, we produce a target emotion mixture.encoder and linguistic encoder generate an embedding sequence for the input, while the emotion encoder generates one embedding that encapsulates the whole reference speech sample.
Given the text or speech as input, the text and the linguistic encoder learn to predict the linguistic embedding from the text or speech, respectively.The decoder takes the linguistic embedding from the text or speech in an alternative manner, depending on whether the epoch number is odd or even.Similar to [102], a contrastive loss is used to ensure the similarity between these two types of linguistic embeddings.The adversarial training strategy with an emotion classifier is employed on the acoustic linguistic embedding to eliminate the residual emotion information.
An emotion encoder is used to extract an emotion embedding vector from the input speech under the supervision of an emotion label.Meanwhile, an emotion attribute vector is generated by the pre-trained relative scheme described in Section 3.2, and then produced by a fully connected (FC) layer, resulting in a relative embedding.The emotion embedding describes the emotion styles of the input speech, while the emotion attribute vector indicates the difference between the input emotion style and other emotion styles.Finally, the decoder learns to reconstruct the input emotion style from a combination of emotion and relative embeddings.
The whole training procedure can be viewed as a recognition-synthesis process at the sequence level.Our proposed framework does not only learn the abundant emotion variance that is exhibited in a database but also the correlation or association across different emotion categories.It allows us to explicitly adjust the difference level at run-time and further enables mixed emotion synthesis and the flexible control of emotion rendering at the same time, which will be discussed next.

Control of Emotion Rendering
We illustrate our proposed emotional text-to-speech framework in Figure 5, which renders controllable emotional speech at run-time.The framework consists of three main modules, the content encoder, the emotion controller, and the decoder.
The text encoder projects the linguistic information from the input text into an internal representation.The emotion encoder captures the emotion style in an embedding from the reference speech, while the relative scheme further introduces the characteristics of other emotion types with a manually assigned attribute vector.By varying the percentage for each primary emotion in the attribute vector, we can easily synthesize the desired emotional effects and control the emotion rendering in synthesized speech.

EXPERIMENTS AND EVALUATIONS
In this section, we report our experimental settings and results.As shown in Table 1, for all the experiments, we synthesize mixed emotional effects by mixing a primary emotion (Surprise) with three reference emotions (Happy, Angry and Sad) respectively.We expect to synthesize mixed emotional effects similar with the secondary emotions such as Delight, Outrage and Disappointment, respectively.We choose these three combinations because they are thought to be easier to perceive for the listeners and have been studied in psychology [1], [44].
Since this contribution serves as a pioneer in related fields, there is no literature or reference method before this study, to our best knowledge.Therefore, we could not include any baselines in our experiments.Instead, we adopt objective and subjective metrics widely used in previous literature and carefully design evaluation methods to show the effectiveness of our proposal.We have made the source codes and speech demos available to the public 1 .We encourage readers to listen to the speech samples on our demo website to best understand this work.

Experimental Setup
We use acoustic features and phoneme sequences as inputs to the proposed framework during the training.The acoustic features are 80-dimensional logarithm Mel-spectrograms extracted every 12.5 ms with a frame size of 50 ms for shorttime Fourier transform (STFT).We convert text to phoneme with the Festival [110] G2P tool to serve as the input to the text encoder.At run-time, we synthesize emotional speech from the text input.

Network Configuration
Our proposed framework can be regarded as a sequencelevel recognition-synthesis structure similar to that of [102], [111].Both the linguistic encoder and the decoder have a sequence-to-sequence encoder-decoder structure.The linguistic encoder consists of an encoder, a 2-layer 256-cell BLSTM and a decoder, a 1-layer 512-cell BLSTM with an attention layer followed by a full-connected (FC) layer with  an output channel of 512.The decoder has the same model architecture as that of Tacotron [12].The text encoder is a 3-layer 1D CNN with a kernel size of 5 and a channel number of 512.The text encoder is followed by a 1-layer of 256-cell BLSTM and an FC layer with an output channel number of 512.The style encoder is a 2-layer 128-cell BLSTM followed by an FC layer with an output channel number of 64.The classifier is a 4-layer FC with channel numbers of {512, 512, 512, 5}.

Training Pipeline
We first pre-train a relative ranking function between each emotion pair using an emotional speech dataset.We implement the relative ranking function following an opensource repository 2 .We use a standardized set of 384 acoustic features extracted with openSMILE [112] as the input features.These features include zero-crossing rate, frame energy, pitch frequency, and Mel-frequency cepstral coefficient (MFCC) used in the Interspeech Emotion Challenge [113].The trained ranking functions reported a classification accuracy of 97% on the test set.
We then conduct a two-stage training strategy to train our text-to-speech framework, which consists of (1) Multispeaker text-to-speech training with the VCTK Corpus [114] and (2) Emotion Adaptation for text-to-speech with a single speaker from the ESD dataset [115], [116].The proposed text-to-speech framework learns abundant speaker styles with a multi-speaker corpus and then learns the actual emotion information with a small amount of emotional speech data.The training strategy we used is similar to that of [75].During the training, we use the Adam optimizer [117] and set the batch size to 64 and 4 for multi-speaker text-tospeech training and emotion adaptation, respectively.We set the learning rate to 0.001 and the weight decay to 0.0001 for multi-speaker text-to-speech training.We halve the learning rate every seven epochs during the emotion adaptation.

Data Preparation
We select the VCTK Corpus [114] to perform multi-speaker text-to-speech training, where we use 99 speakers and the total duration of training speech data is about 30 hours.We select the ESD dataset [115], [116] to perform emotion adaptation and relative ranking training.We choose one English male ('0013') and one English female ('0019') speaker from the ESD.We consider five emotions: Neutral, Angry, Happy, Sad and Surprise, and for each emotion, we follow the data partition given in the ESD.For each speaker and each emotion, we use 300, 30 and 20 utterances for training, testing, and evaluation, respectively.The total duration of emotional speech training data is around 50 minutes.

Objective Evaluation
We first perform objective evaluations to validate the proposed mixed emotion synthesis.We demonstrate the effectiveness of our proposals and provide analysis with a pretrained speech emotion recognition (SER) model.We calculate Mel-cepstral distortion (MCD) and Pearson correlation coefficient (PCC) as objective evaluation metrics.

Analysis with Speech Emotion Recognition
We train a speech emotion recognition model on the ESD dataset [115] with the same data partition described in Section 4.1.3.To improve the robustness of SER, data augmentation is performed by adding white Gaussian noise during the SER training [118], [119], [120], [121].
The SER architecture is the same as that in [122], which includes: 1) a three-dimensional (3-D) CNN layer; 2) a BLSTM; 3) an attention layer; and 4) a fully connected (FC) layer.We evaluate our synthesized mixed emotions with the pre-trained SER.We use the classification probabilities derived from the softmax layer of the SER to analyze the effects of mixed emotions.As a high-level feature, the classification probabilities summarize the useful emotion information from the previous layers for final decisionmaking.The classification probabilities offer us an effective tool to justify how well each emotional component can be perceptually recognized by the SER from the emotion mixture.
We first report the classification probabilities for a male speaker ('0013') in Figure 6.We evaluate four different combinations where we gradually increase the percentage (0%, 30%, 60%, 90%) of Angry, Happy or Sad while keeping that of Surprise always being 100%.As shown in Figure 6(a), we observe that the probability of Angry increases while we increase the percentage of Angry from 0% to 90%.In the meanwhile, the probability of Surprise decreases but still remains to be higher than for others.The probability of Angry achieves 0.25 when the percentage of Angry reaches 90%.We also note similar observations for Happy and Sad as shown in Figure 6(b) and (c).
We then report the classification probabilities for a female speaker ('0019') in Figure 7. Similar to that of the male speaker, we report four different percentages (0%, 30%, 60%, 90%) of Angry, Happy or Sad while keeping that of Surprise being 100%.For Happy, we observe the probability of Happy considerably increases while we increase the percentage of Happy in mixed emotions as shown in Figure 7(b).For Angry and Sad, we find similar observations as in Figure 7(a) and (c).These observations indicate that the mixed emotions can be perceptually recognized by a pre-trained SER.

Pearson Correlation Coefficient
Pitch is considered a major prosodic factor contributing to speech emotion, closely correlated to the activity level [125], [126].In practice, the pitch is often represented by the fundamental frequency (F0), which can be estimated with the harvest algorithm [127].We calculate the Pearson Correlation Coefficient (PCC) of F0 to measure the linear dependency between two F0 sequences, which has been used in previous studies [128], [129], [130].The PCC between two F0 sequences is given as: where cov(•) represents the covariance function, σ F s 0 and σ F t 0 are the standard deviations of the synthesized sequences (F s 0 ) and the target F0 sequences (F t 0 ), respectively.
A higher PCC value represents a higher degree of similarity in prosody.

Discussion of the MCD and PCC Results
To show the effectiveness of synthesizing mixed emotions, we calculate MCD and PCC between the synthesized results and the reference emotions (Angry, Happy and Sad).We choose one male ('0013') and one female speaker ('0019') from the ESD dataset [115].For each speaker, we use 20 utterances for evaluation.We report four different percentages of Angry, Happy and Sad that are: 0%, 30%, 60% and 90%.Again, we keep Surprise as the primary emotion that has a percentage of Surprise is always 100%.We first compare spectrum similarity as shown in Figure 8.For all three different combinations, we observe that the MCD values decrease as the percentage of reference emotions (Angry, Happy and Sad) increases as shown in Figure 8(a), (b) and (c).These results show that the synthesized emotion becomes more similar to the reference emotions in the spectrum as we increase the percentage of the reference emotions.
We have similar observations for prosody similarity as shown in Figure 9.As the percentage of reference emotions (Angry, Happy and Sad) increases, we observe that the PCC value consistently increases.It indicates that the synthesized mixed emotions have a stronger correlation with the reference emotions (Angry, Happy and Sad) in terms of the prosody variance.These results show that we can effectively synthesize and further control the rendering of mixed emotions in terms of the spectrum and prosody.

Subjective Evaluation
We conduct subjective evaluations with human listeners, whom we ask to focus on two aspects: (1) Speech Quality and (2) Emotion Perception.

Speech Quality
We first conduct the Mean Opinion Score (MOS) test to evaluate speech quality, covering the speech's naturalness, intelligibility and listening efforts.All participants are asked to listen to the reference speech ("Ground truth") and the synthesized speech with mixed emotions and score the "quality" of each speech sample on a 5-point scale ('5' for excellent, '4' for good, '3' for fair, '2' for poor, and '1' for bad).20 subjects listened to 80 speech samples in total (80 = 5 x 4 (# of percentages) x 3 (Angry, Happy and Sad) + 20 (# of Ground truth)).The actual speech samples can be found in our demo website.We report the MOS results in Table 2, which show that our synthesized mixed emotions retain the speech quality between fair and good.

Emotion Perception
We then conduct the best-worst scaling (BWS) test to evaluate the emotion perception of synthesized mixed emotions.All participants are asked to listen to the speech samples and choose the best and the worst one according to their perception of a specific emotion type.20 subjects listened to 168 speech samples in total (168 = 7 x 4 (# of percentages) x 6 (Angry, Happy, Sad, Outrage, Delight and Disappointment)).
The actual speech samples can be found on our demo website.
We first evaluate the perception of the reference emotions (Angry, Happy and Sad) that are mixed with Surprise.As shown in Table 3a, 3b and 3c, the mixed emotion with  90% of the reference emotions consistently achieves the highest percentage of the "Best" score; also, the "Best" score increases as the percentage of reference emotion increases.
Similarly, the highest "Worst" score is observed when the reference emotion is added at the lowest percentage (0%).These results confirm the effectiveness of controlling the rendering of mixed emotions.We also observe a slight rise of the worst rating when the percentage of Happy and Sad exceeds 60% in Table 3b, and 3c.This observation we attribute to the unnatural emotional expressions that may be created to influence listeners' preferences.
We then take one step further to evaluate the perception of Outrage, Delight and Disappointment in synthesized speech.In psychology, there is evidence that those feelings could be produced by combining several emotions.We observe that participants can perceive such feelings, and most of them choose those with 90% of reference emotions as the "Best", as shown in Table 4a, 4b and 4c.As for the rating of "Worst", we also have similar observations to those in Table 3.These results show that we can synthesize new emotion types that are subtle and hard to collect in real life, which will significantly benefit the research community.

Ablation Study
We further conduct ablations studies to validate the contributions of the proposed relative scheme on emotional expression.We compare the proposed framework with or without the relative scheme through several XAB preference tests, where the participants are asked to listen to the reference emotional speech first, then choose the one closer to the reference in terms of emotional expression.20 subjects listened to 60 speech samples in total (60 = 5 x 2 (# of frameworks) x 4 (# of emotions) + 20 (# of ground truth)).
We report the XAB results in Figure 10 where we observe that "Proposed w/ Relative Scheme" consistently and considerably outperforms "Proposed w/o Relative Scheme" for all emotions (Angry, Happy, Sad and Surprise).Besides, the p values calculated between those two pairs ("Proposed w/ Relative Scheme" and "Proposed w/o Relative Scheme") are always lower than 0.05, indicating that the out-performance did not occur by chance.These results demonstrate that our relative scheme can improve emotional intelligibility in synthesized emotional speech.

FURTHER INVESTIGATIONS AND DISCUSSION
In this section, we expand our experiments and show the ability of our proposed methods on other interesting topics.We first investigate the mixed emotional effects of Happy and Sad, which are two oppositely valenced emotions.We then build an emotion transition system with our proposed method.We do not seek to conduct comprehensive evaluations but to provide some interesting insights into mixed emotion synthesis and its applications.All the speech samples are provided on the demo page.

Oppositely Valenced Emotions: Happy and Sad
In our experiments, we mostly focus on mixing Surprise with other emotions (Angry, Happy and Sad), which is thought to be easier to perceive for human listeners.Here, we move one step further to study a more challenging task, which is to synthesize mixed effects of Happy and Sad.In Russell's valence-arousal model [92], Happy and Sad are two conflicting emotions with opposite valance (Pleasant and Unpleasant).There are some debates that agree with the co-existence of conflicting emotions [131], [132].In real life, there are also some terms to describe such feelings in different cultures, for example, "Bittersweet" in English.Professional actors are thought to be able to deliver such

An Emotion Transition System
One potential application of mixed emotion synthesis is building an emotion transition system [133].Emotion transition aims to gradually transition the emotion state from one to another.One similar study is emotional voice conversion [116], which aims to convert the emotional state.Compared with emotional voice conversion, the key challenge of emotion transition is to synthesize internal states between different emotion types.With our proposed methods, we are able to model these internal states by mixing them with different emotions.To achieve this, the sum up of the percentages of each emotion needs to be 100% (e. g., 80% Surprise with 20% Angry; 40% Happy with 60% Sad).Then, we can synthesize various internal emotion states by adjusting the percentages.
Compared with traditional methods such as interpolation, our proposed system is data-driven, and the synthesized emotions are more natural.

Discussion
This study serves as the first attempt to model and synthesize mixed emotions for speech synthesis.Although we have shown the effectiveness of our methods, the related problems have not been completely solved.We provide a discussion to address the concerns, show our findings, and inspire future studies.

Category vs. Dimensional Emotion Models
Our assumptions, formulation, and evaluation of mixed emotions are all based on categorical emotion studies.We note that mixed emotions can also be modelled with dimensional representations such as arousal, valence, and dominance.A dimensional model can capture a wide range of emotional concepts, which offers a means of measuring the similarity of different emotional states [134].However, several problems need to be adequately dealt with when modelling mixed emotions with a dimensional model.As mentioned in Section 3.1, the significant challenge for using dimensional representations comes from the lack of labels.Besides, humans are more efficient at discriminating among options than giving an absolute score [135], which adds challenges to the evaluation process.Furthermore, dimensional models are restricted to modelling the co-occurrence of like-valenced discrete emotions [136].For these reasons, we refrain from applying dimensional emotions to the current framework.

Remaining Challenges
There are a few remaining challenges that need attention from the community.As mentioned in Section 4.3.2,increasing the percentage of adding emotions may result in unnatural emotional expressions.If the synthesized emotion sounds unnatural or is difficult to understand, it may not be effective in achieving the desired outcome.Additionally, the human voice is a complex and highly variable instrument, and different people can produce the same emotional state in very different ways.This can make it difficult to accurately capture and reproduce a desired mix of emotions.At last, human raters are asked to evaluate the mixed emotions totally based on their personal experiences because of the lack of "ground truth" emotions.People from different cultures may have different experiences and backgrounds that can influence their emotional responses, and having a diverse group of evaluators can provide a more wellrounded perspective on the synthesized emotions.

Potential Improvements
We discuss several potential improvements to inspire future studies on mixed emotion synthesis: 1) Selection of ranking functions: adopt deep learning-based ranking methods [137] to improve the performance of ranking; 2) Multispeaker studies: add training data from multiple speakers; 3) Non-autoregressive backbone frameworks: use nonautoregressive TTS framework as the backbone to avoid the misalignment of attention and improve the naturalness of synthesized speech.

CONCLUSION
This contribution fills the gap on mixed emotion synthesis in the literature on speech synthesis.We proposed an emotional speech synthesis framework that is based on a sequence-to-sequence model.For the first time, with the proposed framework, we are able to synthesize mixed emotions and further control the rendering of mixed emotions at runtime.The key highlights are as follows: 1) We proposed a novel relative scheme to measure the difference between each emotion pair.We demonstrate that our proposed relative scheme enables the effective synthesis and control of the rendering of mixed emotions.Through ablation studies, we also show that the proposed relative scheme improves emotional intelligibility in synthesized speech; 2) We presented a comprehensive study to evaluate mixed emotions for the first time.Through both objective and subjective evaluations, we validated our idea and showed the effectiveness of our proposed framework in terms of synthesizing mixed emotions; 3) We present further investigations on synthesising a bittersweet feeling and an emotion triangle.The investigation study serves as an additional contribution to the article, which could broaden the scope of the study.In this article, we only focused on studying mixed emotions for emotional text-to-speech.We believe that our proposed relative scheme could enable mixed emotion synthesis in most existing emotional speech synthesis frameworks, including but not limited to emotional text-to-speech.We will expand our experiments to include emotional voice conversion in our future studies.
The future work includes: 1) a comparison with other ranking methods such as metric learning [138] and Siamese neural networks [137]; 2) conducting experiments for more emotion combinations, speakers, and other languages.Our future directions also include the study of cross-lingual emotion style modeling and transfer.Besides, a closer look at linguistic prosody for emotional speech synthesis is foreseen; for example, different semantic meanings can affect the way of expressing an emotion.

Fig. 3 :
Fig.3: The illustration of the proposed relative scheme at (a) training and (b) run-time phase.A relative ranking function is trained between each emotion pair and automatically predicts an emotion attribute at run-time.A smaller emotion attribute value represents a similar emotional style between the pairs.All the emotion attributes form an emotion attribute vector.

Fig. 7 :
Fig. 7: Classification probabilities derived from the pre-trained SER model for a female speaker ('0019') from the ESD dataset.Each point represents an averaged probability value of 20 utterances with mixed emotions.

TABLE 1 :
Our experimental settings of one primary emotion (A), three reference emotions (B) and the expected mixed emotional effects (A+B).

TABLE 2 :
Mean Opinion Score (MOS) with 95% confidence interval to evaluate the speech quality of synthesized mixed emotions.

TABLE 3 :
Best-worst scaling (BWS) test results to evaluate the perception of the reference emotions (Angry, Happy, and Sad) in synthesized mixed emotions.

TABLE 4 :
Best-worst scaling (BWS) test results to evaluate the perception of mixed emotional effects (Outrage, Delight, and Disappointment) in synthesized mixed emotions.
to the audience through both actions and speech.With our proposed methods, we are able to synthesize such mixed feelings of the oppositely valenced emotions such as Happy and Sad.Readers are suggested to refer to the demo page. feelings