Emotion Intensity and its Control for Emotional Voice Conversion

Emotional voice conversion (EVC) seeks to convert the emotional state of an utterance while preserving the linguistic content and speaker identity. In EVC, emotions are usually treated as discrete categories overlooking the fact that speech also conveys emotions with various intensity levels that the listener can perceive. In this paper, we aim to explicitly characterize and control the intensity of emotion. We propose to disentangle the speaker style from linguistic content and encode the speaker style into a style embedding in a continuous space that forms the prototype of emotion embedding. We further learn the actual emotion encoder from an emotion-labelled database and study the use of relative attributes to represent fine-grained emotion intensity. To ensure emotional intelligibility, we incorporate emotion classification loss and emotion embedding similarity loss into the training of the EVC network. As desired, the proposed network controls the fine-grained emotion intensity in the output speech. Through both objective and subjective evaluations, we validate the effectiveness of the proposed network for emotional expressiveness and emotion intensity control.


INTRODUCTION
E MOTIONAL Voice Conversion (EVC) is a technique that seeks to manipulate the emotional state of an utterance while keeping other vocal states unchanged [1].It allows for the projection of the desired emotion into the synthesized voice.Emotional voice conversion poses a tremendous potential for human-computer interaction, such as enabling emotional intelligence into a dialogue system [2], [3], [4].
Voice conversion aims to convert the speaker-dependent vocal attributes such as the speaker identity while preserving the linguistic information [5].Since the speaker information is characterized by the physical structure of the vocal tract and manifested in the spectrum [6], spectral mapping has been the main focus of voice conversion [7].However, speech also conveys emotions with various intensity levels that can be perceived by the listener [8].For example, happy can be perceived as happy or elation [9], while angry can be divided into a 'mild' angry and the 'full-blown' angry [10].In particular, intensity of emotion is described as the magnitude of factor to attain the goal of the emotion [11].Therefore, emotion intensity is not just the loudness of a voice, but correlates to all the acoustic cues that contribute to achieving an emotion [12].Moreover, speech emotion is hierarchical and supra-segmental in nature, varying from syllables to utterances [13], [14], [15], [16].Thus, it is insufficient to only focus on frame-wise spectral mapping for emotional voice conversion.Both intensity variations and prosodic dynamics need to be considered for speech emotion modelling.
Synthesizing various intensities of an emotion is a challenging task for emotional voice conversion studies.One of the reasons is the lack of explicit intensity labels in most emotional speech datasets.Besides, emotion intensity is even more subjective and complex than just considering discrete emotion categories, which makes it challenging to model [12].There are generally two types of methods in the literature for emotion intensity control.One uses auxiliary features such as a state of voiced, unvoiced, and silence (VUS) [17], attention weights or a saliency map [18].Another manipulates the internal emotion representations through interpolation [19] or scaling [20].Despite these methods, emotion intensity control is still an under-explored topic in emotional voice conversion.
Previous emotional voice conversion studies mainly focus on learning a feature mapping between different emotion types.Most of them, model the mappings of spectral and prosody parameters with a Gaussian mixture model (GMM) [21], [22], sparse representation [23], or hidden Markov model (HMM) [24].Recent deep learning methods such as deep neural networks (DNN) [25], [26] and deep bi-directional long-short-term memory network (DBLSTM) [27] have advanced the state-of-the-art.New techniques using generative adversarial network (GAN)-based [28], [29], [30] or auto-encoder-based models [31], [32], [33] make it possible for non-parallel training.We note that these frameworks convert the emotion on a frame basis, so speech duration cannot be modified.Moreover, since the spectrum and prosody are not independent of each other, a separate study of them may cause a mismatch during the conversion [34], [35].It would be advantageous to have a model to transfer the correlated vocal factors end-to-end, producing more realistic emotions in synthetic speech.
Recently, sequence-to-sequence (Seq2Seq) models have attracted much interest in speech synthesis [36], [37] and voice conversion [38], [39], [40], [41].With the attention mechanism, Seq2Seq frameworks jointly learn the feature mapping and alignment and automatically predict the speech duration at run-time.Inspired by these successful attempts, researchers introduce Seq2Seq modelling into emotional voice conversion.For example, a Seq2Seq model to jointly model pitch and duration is proposed in [42].In [43], a multi-task learning for both emotional voice conversion and emotional text-to-speech is studied.We note two limitations of these studies: First, they learn an averaged emotional pattern during the training, while emotional expressive speech presents abundant variations of emotion intensity in real life.Second, these frameworks require enormous emotional speech data to train.But in practice, such a large emotional speech database is not widely available, which limits the scope of applications.
In this article, we aim to address the above challenges.The main contributions of this paper are listed as follows.

•
We introduce Emovox, a Seq2Seq emotional voice conversion framework, which jointly transfers the spectrum and duration in an end-to-end way for emotional voice conversion.
• Emovox automatically learns the abundant variations of intensity that are exhibited in an emotional speech dataset, without the need for any explicit intensity labels and enables effective control of the emotion intensity in the converted emotional speech at the run-time; • Emovox eliminates the need for a large amount of emotional speech data for Seq2Seq EVC training and still achieves remarkable performance under limited data conditions; • We present a comprehensive evaluation to show the effectiveness of Emovox for emotional expressiveness and emotion intensity control.This paper is organized as follows: In Section 2, we motivate our study by introducing the background and related work.In Section 3, we present the details of our proposed Emovox framework and we introduce our experiments in Section 4. In Section 5, we report the experimental results and conclude in Section 6.

BACKGROUND AND RELATED WORK
This work is built on several previous studies spanning emotion intensity, expressive speech synthesis, and emotional voice conversion.We briefly introduce the related studies to set the stage for our research and summarize the gaps in current literature to place our novel contributions.

Emotion Intensity in Vocal Expression
The most straightforward way to characterize emotion is to categorize it into several different groups [44], [45]; however, the choice of emotion labels is mostly intuitive and inconsistent in the literature.One key reason is that

Deep Features
Fig. 1: An example of a speech emotion recognizer (SER) [47], where the deep features are obtained before the last fully-connected (FC) layer to describe the emotion styles [48], [49].
emotion intensity can affect our perception of emotions [46].For example, happy can be perceived as happy or elation, which are similar in voice quality but different in intensity [9].Thus, correlating the emotion intensity to the loudness of the voice is a rather oversimplification.Emotion intensity can be observed in various acoustic cues, not only in speech energy but also in speech rate and fundamental frequency [12].The differences in these cue levels could be larger between different intensities of the same emotion than between different emotions [46].

Sequence-to-Sequence Conversion Models
The sequence-to-sequence model with attention mechanism was first studied in machine translation [50] and then found effective in speech synthesis [36], [37].In text-to-speech, sequence-to-sequence modelling achieves remarkable performance by learning an attention alignment between the text and acoustic sequence, such as Tacotron [37].Similar to text-to-speech, voice conversion aims to generate realistic speech from internal representations; therefore, sequenceto-sequence models are applied to various voice conversion and emotional voice conversion studies.

Sequence-to-Sequence Voice Conversion
Sequence-to-sequence voice conversion frameworks such as SCENT [38], AttS2S-VC [39], and ConvS2S-VC [51], jointly convert the duration and prosody components, and achieve higher naturalness and similarity than conventional framebased methods.To address the conversion issues such as the deletion and repetition caused by the misalignment, various approaches are proposed, such as a monotonic attention mechanism [40], non-autoregressive training [52], [53], and the use of pre-training models [54] or text supervision [55], [56], [57].These successful attempts further motivate the study of sequence-to-sequence modelling for emotional voice conversion.

Sequence-to-Sequence Emotional Voice Conversion
Compared with conventional frame-based models, sequence-to-sequence models are more suitable for emotional voice conversion.First, the sequence-to-sequence models allow for the prediction of speech duration at the run-time, which is an important aspect of the speech rhythm and strongly affects the emotional prosody [58].Besides, a joint transfer of spectrum and prosody in sequence-to-sequence models addresses the mismatch issues in conventional analysis-synthesis-based emotional voice conversion systems [28], [33], [34].Also, emotional prosody is supra-segmental and can be only associated with a few words [47].Learning an attention alignment makes it possible to focus on emotion-relevant regions during the conversion.Hence, sequence-to-sequence modelling for emotional voice conversion will be our primary focus in this paper.
There are only few studies on sequence-to-sequence emotional voice conversion [20], [42], [43], [59].In [42], the authors jointly model pitch and duration with parallel data, where the model is conditioned on the syllable position in the phrase.In [43], a multi-task learning framework of emotional voice conversion and emotional text-to-speech is built with a large-scale emotional speech database.In [20], the authors introduce an emotion encoder and a speaker encoder into the sequence-to-sequence training for emotional voice conversion.We note that these frameworks require tens of hours of parallel emotional speech data, which is hard to collect.A recent work [59] proposes a 2-stage training strategy for sequence-to-sequence emotional voice conversion leveraging text-to-speech to eliminate the need for a large emotional speech database.However, none of these frameworks study emotion intensity variations, and the converted emotional utterances lack the controllability of emotion intensity.Only [20] attempts to scale the emotion embedding by multiplying it with a factor to control the emotion intensity at run-time.However, the authors do not explicitly model emotion intensity variations during the training, and their intensity control method lacks interpretability.
This work aims to bridge this gap in the current literature and study emotion intensity modelling for emotional voice conversion.We aim to build a sequence-to-sequence emotional voice conversion framework with effective emotion intensity control using a limited amount of emotional speech data.

Expressive Speech Synthesis with Prosody Style Control
Speech emotion is highly related to speech prosody and influenced by several prosodic cues embedded in acoustic speech such as intonation, rhythm, and energy [60], [61].The most straightforward way to model and control prosody style is to use explicit annotations, or labels [62], [63], [64].Besides explicitly labelling, researchers use a reference encoder to imitate and transplant the reference style in an unsupervised way [65].Global style token (GST) [66] is an example to learn interpretable style embeddings from the reference audio.By choosing specific tokens, the model could control the style of synthesized speech.Other studies [67], [68], [69], [70] mainly replace the global style embedding with fine-grained prosody embedding.Some other studies based on Variational Autoencoders (VAE) [71] show the effectiveness of controlling the speech style by learning, scaling, or combining disentangled representations [72], [73].
Emotion expressive speech is even more complex, which has subtle dynamic variations associated with multiple prosodic attributes [74], [75], [76].Inspired by the successful attempts in prosody style control, several studies control the emotion intensity for emotional speech synthesis.For example, in [19], an inter-to-intra distance ratio algorithm is applied to the learnt style tokens for emotional speech synthesis, where an interpolation technique is used to control emotion intensity.In [18], the authors show that a speech emotion recognizer is capable of generating a meaningful intensity representation via attention or saliency.In [77], [78], a relative attribute scheme is introduced to learn the emotion intensity for emotional speech synthesis.None of these frameworks explicitly models prosody style, but rather encodes the association between input text and its emotional prosody style end-to-end.
This contribution studies explicit modelling of emotion intensity variations with a relative attribute method for emotional voice conversion.We believe that the relative attributes scheme provides a straightforward way to model intensity variants, which will be discussed later.

Emotional Prosody Modelling with a Speech Emotion Recognizer
Emotional prosody is prominently exhibited in emotional expressive speech [79], [80], [81], which can be characterized by either categorical [45] or dimensional representations [82].Recent studies [83] in speech emotion recognition provide valuable insights into emotional prosody modelling.Instead of categorical or dimensional attributes, they characterize the emotion styles with the latent representations learnt by the deep neural network.Compared with humancrafted features, deep features learnt by a speech emotion recognizer (SER) are data-driven and less dependent on human knowledge [84], [85], which we believe is more suitable for emotion style transfer.
Some studies are leveraging a speech emotion recognizer to improve the prosody modelling for expressive speech synthesis.In [86], an emotion recognizer is used to extract the style embedding for style transfer.In [49], a speech emotion recognizer is further used as the style descriptor to evaluate the style reconstruction performance.In [48], researchers use the deep emotional features from a pretrained speech emotion recognizer to transfer both seen and unseen emotion styles.These studies show the capability of a speech emotion recognizer to describe emotion styles with their latent representations.
A speech emotion recognizer also shows a potential to supervise the emotional speech synthesis system to generate the speech with desirable emotion styles [87].In [88], a reinforcement learning paradigm for emotional speech synthesis is proposed, where the classification accuracy of the speech emotion recognizer is used as the reward function to the system.In [89], the authors use emotion classifiers to enhance the emotion-discrimination of the emotion embedding and the predicted Mel-spectrum.In [90], an emotional speech synthesis system is built on an expressive TTS corpus with the assistance of a cross-domain emotion recognizer.These studies show remarkable performance by incorporating the supervision from the pre-trained emotion recognizer into the emotional speech synthesis systems, which motivates our study.We further study the use of perceptual losses in EVC training to improve the intelligibility of the converted emotion.

Research Gap (Summary)
Below, we summarise the gaps in the literature of emotional voice conversion that we aim to address in this paper.
1) There are very few studies on emotion intensity control, which is crucial to achieving emotional intelligence.
2) Despite the tremendous potential, emotion intensity control is still not a well-explored research direction for emotional voice conversion.3) There is a lack of focus on modelling prosody style to achieve improved emotion intensity control.4) Feasibility of using a pre-trained speech emotion recognizer as an emotion supervisor for EVC training poses tremendous potential but is not well understood.

Emovox : EMOTIONAL VOICE CONVERSION WITH EMOTION INTENSITY CONTROL
The proposed emotional voice conversion framework: Emovox consists of four modules, 1) a recognition encoder, which derives the linguistic embedding from the source speech; 2) an emotion encoder, which encodes the reference emotion style into an emotion embedding; 3) an intensity encoder, which encodes a fine-grained intensity input into an intensity embedding, and 4) a Seq2Seq decoder, which generates the converted speech from a combination of linguistics, emotion, and intensity embeddings.At run-time, Emovox preserves the source linguistic content ("linguistic transplant"), while transferring the reference emotion to a source utterance ("emotion transfer"), as illustrated in Figure 2. Emovox also allows users to manipulate/control the emotion intensity of the output speech ("intensity control").
To train Emovox, we propose a Seq2Seq framework that disentangles the speech elements from input acoustic features, and reconstructs the acoustic features from the speech elements.To reduce the amount of training data for Emovox, we introduce two pre-training strategies, i. e., 1) style pre-training with a large TTS corpus, and 2) emotion supervision training with an SER.

Seq2Seq Emotional Voice Conversion
Human speech can be viewed as a combination of speech style, and linguistic content [75], [91].If the speech style that represents the emotion can be disentangled from the linguistic content, emotion conversion can be achieved by manipulating the speech style at run-time while keeping the linguistic content and speaker identity unchanged [33], [92].
There are various ways to disentangle the speech elements.In [56], text information and adversarial learning are used in a sequence-level autoencoder.This framework achieves strong disentanglement between linguistic and speaker representations and enables duration modelling for voice conversion.We adopt this framework in Emovox to model emotion styles and intensity, as shown in Figure 3.To overcome the issues such as deletion and repetition with the Seq2Seq approach, we include a text input as the supervision signal to augment the linguistic embedding, which are shown effective in recent studies [56], [57], [59].Emovox aims to transfer the reference emotion to the source speech ("emotion transfer") while controlling its emotion intensity ("intensity control") and preserving the source linguistic information ("linguistic transplant").
Given the phoneme sequences and acoustic features as the input, the text encoder and the recognition encoder learn to predict the linguistic embedding from the text (H text ) and the audio input (H audio ), respectively.The emotion encoder learns the emotion representations from the speech, while the emotion classifier further eliminates the residual emotion information in the linguistic embedding H audio .The Seq2Seq decoder Dec learns to reconstruct the acoustic features Â from the combination of the emotion embedding h emo , the intensity embedding h inten , and the linguistic embedding either from the text encoder: H text or recognition encoder: H audio as shown in (1). where During the training, H text and H audio are taken by the decoder alternately, depending on whether the epoch number is odd or even.A contrastive loss is employed to ensure the similarity between H text and H audio as in [56].We believe the proposed Emovox learns an effective disentanglement between linguistic and emotional elements and provides a straightforward way to model and control both emotion and its intensity, which will be discussed next.

Modelling Emotion and its Intensity
To model emotion intensity, one of the difficulties is the lack of annotated intensity labels.Inspired by the idea of attribute [93] in computer vision, we regard emotion intensity as an attribute of the emotional speech.Combining the emotion representations with the intensity information allows the framework to jointly learn abundant emotion styles and intensity levels from any emotional speech database.

Formulation of Emotion Intensity using Relative attributes
In computer vision, there are various ways [94], [95] to model the relative difference between different data categories.Instead of predicting the presence of a specific attribute, relative attributes [96] offer more informative descriptions to unseen data, thus closer to detailed human supervision.Motivated by the success in various computer vision tasks [97], [98], [99], we believe that relative attributes bridge between the low-level features and high-level semantic meanings, which is appropriate for emotion intensity modelling.
Emotion intensity can be viewed as how well the emotion can be perceived in its type.Since the neutral speech does not contain any emotional variance, the emotion intensity of a neutral utterance should be zero.Therefore, we regard the emotion intensity as a relative difference between neutral speech and emotional speech.Emotion intensity can be represented by relative attributes learnt with a rich set of emotion-related acoustic features from each emotion pair.The learning process of relative attributes can be formulated as a max-margin optimization problem as explained below: Given a training set T = {x t }, where x t is the acoustic features of the t th training sample, and T = N ∪ E, where N and E are the neutral and emotional set respectively.We aim to learn a ranking function given as below: where W is a weighting matrix indicating the emotion intensity.To learn the ranking function, we have to satisfy the following constraints: where O and S are the ordered and similar sets respectively.We pair an emotional sample of E with a neutral sample from N to form an ordered set O, where the emotion intensity of E is higher than in that of N .We then randomly create pairs of neutral-neutral and emotionalemotional samples in the similar set S, where the emotion intensity of the pair is similar.The weighting matrix W is estimated by solving the following problem similar with that of a support vector machine [100]: where C is the trade-off between the margin and the size of slack variables ξ a,b and γ a,b .Through Eq. ( 6) -( 9), we learn a wide-margin ranking function that enforces the desired ordering on each training point.Once it is learnt, the relative ranking function can estimate the order of unseen data.
In practice, we learn a ranking function for each emotion category.As shown in Figure 3, the learnt ranking function predicts a relative attribute normalized to [0, 1] for each sample in the training set.A larger value of relative attribute represents a stronger intensity of an emotion.

Modelling Emotion Styles and its Intensity
As shown in Figure 3, we obtain the relative attribute from the learnt ranking function, which passes through a fully connected layer to derive an intensity embedding.The emotion encoder learns to generate the emotion embedding from the input speech features.The Seq2Seq decoder combines a linguistic embedding sequence, an emotion and an intensity embedding to reconstruct the acoustic features of the emotional speech.
During the training process, Emovox jointly learns the emotion style and its intensity from the speech samples that are referred to as emotion training hereafter.With the explicit intensity modelling, we are able to manipulate the level of intensity at run-time for intensity control.The intended emotion intensity can be predicted from the reference or given manually at run-time.In theory, Emovox may perform both emotional text-to-speech and emotional voice conversion.In this paper, the text encoder is not used at run-time since we are only interested in voice conversion.
As shown in Figure 4, we first use the emotion encoder to generate the emotion embeddings from a set of reference utterances belonging to the same emotional category.Next, we use the averaged reference emotion embedding to represent an emotion category.Finally, the recognition encoder derives a linguistic embedding sequence from the source speech utterance at run-time.By assigning an intended emotion category and a level of emotion intensity, the Seq2Seq decoder generates the emotional speech of the same content as the source but with the target emotion style at an appropriate intensity.

Model Pre-training
During training, a large amount of emotional speech is always required to achieve robust attention alignment and deliver high emotional intelligibility in a Seq2Seq model [101].To reduce the reliance on emotional speech, we propose two pre-training strategies, 1) style pre-training with a large TTS corpus and 2) emotion supervision training with a SER.

Style Pre-training with a Multi-Speaker TTS Corpus
It is known that speech style contains speaker-dependent elements related to speaker characteristics, called speaker style.Speaker style is exhibited in most TTS corpora containing multi-speaker speech data.Unlike emotional speech databases, there are abundant speech databases for TTS [102], [103], [104] with a neutral tone, which allows us to build a multi-speaker Seq2Seq TTS framework, and train a network to disentangle speaker style from the linguistic content.We call this stage "style pre-training".
During the style pre-training, the style encoder learns abundant speaker styles through a multi-speaker TTS corpus while excluding the linguistic information from the acoustic features.As a result, even though the style encoder does not learn to encode any specific emotion style during training, it learns to discriminate different emotion styles during emotion training, as shown in Figure 5(a).We, therefore, use the style encoder trained on a TTS corpus as the pre-trained model for an emotion encoder.

Modelling Emotion with Perceptual Loss
We would like the converted emotional speech to be perceived with the intended emotion category.However, this is not easily achieved, especially with a limited emotional training data, for several reasons: 1) The pre-trained emotion decoder in Figure 3 is not explicitly trained for characterisation of emotions, and 2) frame-level style reconstruction loss Fig. 4: An illustration of the run-time conversion phase.By combining a source linguistic embedding sequence, an averaged reference emotion embedding, and an intensity embedding, the Seq2Seq decoder generates the acoustic features with the reference emotion type and the manually defined intended intensity.
is not always consistent with human perception because it does not capture speech's prosodic and temporal patterns.
Following the success of perceptual loss in speech synthesis [105], we introduce a perceptual loss as the emotion supervision in the training process, as shown in Figure 3.We first use a pre-trained SER to predict the emotion category from the reconstructed acoustic features.We then calculate two perceptual loss functions: 1) emotion classification loss L cls , and 2) emotion embedding similarity loss L sim .We incorporate these two loss functions into the training and update all the trainable modules.For detail of SER pretraining, readers are referred to [106].
The emotion classification loss L cls is introduced to ensure the perceptual similarity between the reconstructed acoustic features and the intended emotion category at the utterance level, where l is the target one-hot emotion label, p is the predicted emotion probabilities at the utterance level, and CE(•) denotes the cross-entropy loss function.
The pre-trained SER is considered text-independent.To ensure that the emotion encoder characterizes emotions independent of linguistic content, we introduce an emotion style descriptor, derived from the pre-trained SER [48], [49], as a learning objective for emotion encoder with a loss function L sim , between the emotion encoder output, i. e., emotion embedding, and the emotion style descriptor, as illustrated in Figure 3.

Effect of Perceptual Loss
To validate the effectiveness of two perceptual loss functions, we evaluate the emotion-discriminative ability of the emotion encoder with an ablation study.We believe that the emotion encoder demonstrates a better performance by producing more discriminative emotional representations.We use the t-SNE algorithm [107] to visualize the emotion embeddings in a two-dimensional plane, as shown in Figure 5(e).It is observed that emotion embeddings form different emotion clusters in terms of the feature distribution.To get a more intuitive understanding of the clustering performance, we consider performing a clustering evaluation to evaluate the discriminability of the emotion embeddings.
The typical objective function of clustering formalizes the goal of attaining high intra-cluster similarity, and low inter-cluster similarity [108], [109].There are studies to use different measurements for the quality of a clustering [110], [111], [112], [113].Our study considers a simplified and effective solution for clustering evaluation.We first compute a centroid for each of K emotion classes, c i , i ∈ [1, K] by taking the average of all N i embeddings e in class i as follows [114]: where E i is the set of embeddings in class i.We then calculate the inter-class distance dist inter by computing the Euclidean distance between each embedding e ∈ E i and the other embedding centres c j;j =i as follows: and intra-class distance dist intra as follows: A clustering ratio r is calculated from the ratio of intraclass distance dist intra and inter-class distance dist inter as follows: A lower value of ratio r represents a better clustering effect of emotion embeddings.
We perform an ablation experiment on the ESD evaluation dataset [1].We visualize the distribution of emotion embeddings and report the clustering ratios in Figure 5.As the style encoder is pre-trained without the emotion intensity mechanism, we report the results of Emovox without intensity control for a fair comparison, which is denoted as Emovox w/o intensity.We first observe that Emovox w/o intensity always achieves a better clustering performance than the style encoder in Figure 5. From Figure 5(b) -(d), it is observed that both loss functions L cls and L sim contribute to a lower r, which suggests a better clustering performance.From Figure 5(e), we further observe a more distinct separation between the emotions with different energy (such as neutral, sad vs angry, or happy).With both L cls and L sim , we obtain the lowest clustering ratio at 0.514.It shows that these two losses can help the emotion encoder to generate more discriminative emotional representations.

EXPERIMENTS
In this section, we report our experimental settings.For all the experiments, we conduct emotion conversion from neutral to angry, neutral to happy, and neutral to sad, which we denote as Neu-Ang, Neu-Hap, and Neu-Sad, respectively.We have made the source codes and speech samples available to the public 1 .We encourage readers to listen to the speech samples to understand this work.
Note that emotion intensity control is only available with Emovox.For a fair comparison among the methods, we obtain an intensity value for Emovox, by passing a reference set of speech data through the learnt ranking function, as shown in Figure 4 ("Emotion Intensity Transfer") .Besides, none of these frameworks require any parallel training data or frame alignment procedures.
For a contrastive study, we replace the intensity control module in Emovox with two other competing intensity control methods: Scaling Factor and Attention Weights through comprehensive experiments.
• Emovox w/ Scaling Factor (proposed): where the emotion embedding is multiplied by a scaling factor [20]; • Emovox w/ Attention Weights (proposed): where the attention weight vector obtained from a pre-trained SER is used to represent the intensity [18]; • Emovox w/ Relative Attributes (proposed): our proposed method with relative attributes as described in Section 3; To summarize, we do emotion intensity transfer to compare Emovox with the baselines (i.e., CycleGAN-EVC, StarGAN-EVC and Seq2Seq-EVC) and emotion intensity control to compare it with other emotion intensity control methods (i.e., Emovox w/ scaling factor, Emovox w/ attention weights).

Experimental Setup
We extract 80-dimensional Mel-spectrograms every 12.5 ms with a frame size of 50 ms for short-time Fourier transform (STFT).We then take the logarithm of the Mel-spectrograms to serve as the acoustic features.We convert text to phoneme with the Festival [117] G2P tool to serve as the input to the text encoder.
We use the Adam optimizer [118] and set the batch size to 64 and 16 for style pre-training and emotion training, respectively.We set the learning rate to 0.001 for style pre-training and halve it every seven epochs during the emotion training.We set the weight decay to 0.0001, and the weighting factors of the emotion classification loss L cls and the emotion similarity loss L sim to 1.

Recognition-Synthesis Structure
Emovox has a recognition-synthesis structure similar to that of [56], [119].The Seq2Seq recognition encoder consists of an encoder which is a 2-layer 256-cell BLSTM, and a decoder which is a 1-layer 512-cell LSTM with an attention layer followed by an FC layer with an output channel of 512.Our text encoder is a 3-layer 1D CNN with a kernel size of 5 and the channel number of 512, followed by 1-layer of 256-cell BLSTM and an FC layer with an output channel number of 512.The Seq2Seq decoder has the same model architecture as that of Tacotron [37].The style encoder is a 2-layer of 128cell BLSTM followed by an FC layer with an output channel number of 128, which has been used in previous studies on voice conversion [56] and emotional voice conversion [59].The classifier is a 4-layer net of FC with the channel numbers of {512, 512, 512, 99}.

Relative Emotion Intensity Ranking
We follow an open-source implementation 2 to train the relative ranking function for emotion intensity.We extract 384-dimensional acoustic features with openSMILE [120] including zero-crossing rate, frame energy, pitch frequency, Mel-frequency cepstral coefficient (MFCC), and etc.These acoustic features are used in the Interspeech Emotion Challenge [121].We anticipate that these acoustic features can capture the subtle emotion intensity variations in speech.For each emotion category, we train a relative ranking function using neutral and emotional utterances.

Speech Emotion Recognizer
We train a speech emotion recognizer following a publicly available implementation [106].The SER includes: 1) 3 TimeDistributed two-dimensional (2-D) convolutional neural network (CNN) layers, 2) a DBLSTM layer, 3) an attention layer, and 4) a linear projection layer.The TimeDistributed 2D CNN layers and the DBLSTM layer summarize the temporal information into a fixed-length latent representation.The attention layer further preserves the effective emotional information while reducing the influence of emotion-irrelevant factors and producing discriminative utterance-level features for emotion prediction.The linear projection layer predicts the emotion class possibility from the utterance-level emotional features.We perform data augmentation by adding white Gaussian noise to improve the robustness of SER ( [122], [123], [124], [125]).

Data Preparation and Emotion Training
We first perform style pre-training on the VCTK Corpus [104], where we use 99 speakers for pre-training.The total duration of pre-training speech data is about 30 hours.For SER training and emotion training, we randomly choose one male speaker from the ESD database 3 to conduct all the experiments in the same way as in [59].
We follow the data partition protocol given in the ESD database.For each emotion, we use 300 utterances for emotion training and 20 utterances as the evaluation set.We use In the emotion training, we first initialize all the modules with the weights learnt from style pre-training, where the style encoder and style classifier act as the emotion encoder and emotion classifier, respectively.We then randomly initialize the last projection layer of the emotion encoder and emotion classifier.The output channel numbers of the emotion encoder and the emotion classifier are set to 64 and 4, respectively.A learnt ranking function predicts a relative attribute and then is passed through an FC layer with the output channel size of 64 to obtain the intensity embedding.We then concatenate the emotion and intensity embedding to feed into the Seq2Seq decoder.The waveform is reconstructed from the converted Mel-spectrograms using Parallel WaveGAN.We use a public version of Parallel WaveGAN 4 , and train it with the ESD database.

Objective Evaluation
We first conduct an objective evaluation to assess the system performance using Mel-cepstral Distortion (MCD) and Differences of Duration (DDUR) as the evaluation metrics.

Mel-cepstral Distortion (MCD)
MCD [126] is calculated between the converted and the target Mel-cepstral coefficients (MCEPs), i. e., ŷ = {ŷ m } and y = {y m }, where M represents the dimension of the MCEPs.A lower value of MCD indicates a smaller spectral distortion, and thus a better performance.Note that, in the Seq2Seq-EVC and Emovox models, we adopt Mel-spectrograms as the acoustic features.Therefore, we calculate MCEPs separately from the speech waveform.

Differences of Duration (DDUR)
To evaluate the distortion in terms of duration, we compute the average differences between the duration of the converted and the target utterances over the voiced parts (DDUR), which is widely used in voice conversion studies [38], [56], [59], 4. https://github.com/kan-bayashi/ParallelWaveGANwhere Z and Ẑ represent the duration of the reference utterance and the converted utterance, respectively.A lower value of DDUR represents a better performance in terms of duration conversion.

Subjective Evaluation
We adopt two subjective metrics: 1) a mean opinion score (MOS) test for emotion similarity evaluation, and 2) a bestworst scaling (BWS) test to evaluate speech quality, emotion intensity, and emotion similarity.18 subjects participated in all the listening tests.These 18 subjects (12 male and 6 female) are native Chinese speakers and proficient in English.Their age range is between 20-30.All the subjects are required to listen with headphones and replay each sample 2-3 times.A detailed introduction about the judging criteria is given before the tests.

Mean Opinion Score (MOS) Test
We conduct a mean opinion score (MOS) [127] test to evaluate the emotion similarity.All participants are asked to listen to the reference target speech first and then score the speech samples for emotion similarity to the reference target speech.A higher score represents a higher similarity with the target emotion, and indicates a better emotion conversion performance.We randomly select 10 utterances from the evaluation set.Each subject listens to 120 converted utterances in total (120 = 10 x 4 (# of frameworks) x 3 (# of emotion pairs)).

Best-Worst Scaling (BWS) Test
We also conduct a best-worst scaling (BWS) [128] test to evaluate: 1) Speech Quality: where all the listeners are asked to choose the best and the worst sample in terms of the speech quality, which covers two aspects: a) how the linguistic and speaker identity is preserved, and b) the naturalness of the speech; 2) Emotion Intensity: where all the listeners are asked to choose the most and the least expressive one in terms of the emotion expression; 3) Emotion Similarity: where all the listeners are asked to choose the best and the worst one in terms of the emotion similarity with the reference.
We randomly select 5 utterances from the evaluation set to perform the BWS tests.We first evaluate the performance of different intensity control methods in terms of speech quality and intensity control.Each subject listens to 135 converted utterances (135 = 5 x 3 (# of frameworks) x 3 (# of intensities) x 3 (# of emotion pairs)).We further conduct an ablation study with Emovox, where each subject listens to 60 converted utterances in total to evaluate the emotion similarity with the reference (60 = 5 x 4 (# of frameworks) x 3 (# of emotion pairs)).

RESULTS
In this section, we report our experimental results.We first compare the performance of Emovox with that of the baselines using objective and subjective evaluations in Section 5.1.We then evaluate the proposed emotion intensity control method through the comparison with other control methods in Section 5.2.While comparing with the baselines, we use different training data settings in Section 5.3.Lastly, we study the contributions of the training strategies using ablation experiments in Section 5.4.

Emovox versus Baselines
In this subsection, we include CycleGAN-EVC, StarGAN-EVC, and Seq2Seq-EVC as baselines.It is noted that these baselines do not have an intensity control module.As a fair comparison, we conduct emotion intensity transfer for Emovox in both objective and subjective evaluations.1, all systems achieve better MCD values than that of the Zero Effort case.Zero Effort case directly compares the source and target utterances without any conversion.We also observe that Emovox completely outperforms CycleGAN-EVC and StarGAN-EVC.It also outperforms Seq2Seq-EVC for Neu-Ang and Neu-Hap (first three letters of source and target emotion, each) and achieves comparable results for Neu-Sad.This suggests that Emovox is superior to the others in terms of spectrum conversion.'XUDWLRQ V G% G% (e) Converted Sad (Intensity = 0.9) Fig. 7: Visualization of Mel-spectrograms from source neutral, reference sad, and converted sad utterances at three intensity values, i. e., 0.1, 0.5, 0.9, with the same speaking content ("At the end of four").A greater intensity value represents a more emotional expression.

Differences of
do not convert the speech duration.Thus, the DDUR results of these two frameworks are not reported.As shown in Table 1, compared with Seq2Seq-EVC, Emovox achieves better results for both Neu-Ang and Neu-Hap for duration modelling, and achieves comparable results in Neu-Sad conversion.These results further confirm the effectiveness of Emovox in terms of duration conversion.

Subjective Evaluation
We report Mean Opinion Score (MOS) test results for emotion similarity with the reference for our proposed Emovox and all the baselines.From Figure 6, we observe that our proposed Emovox consistently outperforms the baselines for all the emotion pairs.This observation is consistent with  Note: Emovox w/ scaling factor, Emovox w/ attention weights, and Emovox w/ relative attributes are denoted as Scaling, Attention, and Relative respectively."B" denotes "Best", and "W" denotes "Worst".
that in the objective evaluation.As for statistical significance, Emovox achieves the most narrow confidence interval for Neu-Ang and Neu-Hap, that suggests a high level of consistency [129].Furthermore, we report the p-value of t-test scores of MOS between Emovox and the others.We observe that almost all pairs achieve a p-value below 0.05, confirming the significant results [130].For Neu-Ang, the p-value between Emovox and Seq2Seq-EVC is about 0.0558, which is less than 0.1 and still supports our claim.

Emotion Intensity Control
To evaluate the emotion intensity control in Emovox, we choose three different intensity values: 0.1, 0.5, and 0.9, corresponding to weak, medium, and strong.To understand the interplay between emotion intensity and different prosodic attributes, we first visualize several related prosodic cues of the converted emotional utterances with the same speaking content but different emotion intensities.
We then compare our intensity control methods with other state-of-the-art methods.

Visual Comparisons
We visualize the prosodic attributes related to the emotion intensity, such as speech duration, pitch, and energy to gain an intuitive understanding of emotion intensity in vocal speech.Besides, we also would like to show that the emotion intensity control can be manifested in the changes of these prosodic features in our proposed framework.
(1) Duration: Speech duration is considered as a distinct factor between active and passive emotions [131].To show that the emotion intensity is related to speaking rate, we compare the Mel-spectrogram of Sad as a reference emotion, which is characterized with a slower speaking rate and more resonant timbre [132], with that of its Neutral emotion counterpart in Figure 7(a).We also illustrate the Melspectrograms of Emovox-converted utterances with different intensities in Figure 7(c),(d) and (e).We observe that the converted Sad utterance with the highest intensity value has the slowest speaking rate among all three intensities (as shown in Figure 7(e)).As the intensity value increases, the speaking rate decreases.
(2) Pitch Envelope: Pitch envelope (i.e., the level, range, and shape of the pitch contour) is considered a major factor that contributes to the speech emotion, which is closely  correlated to the activity level [132], [133].We represent pitch information with F0 contour, which is estimated with the harvest algorithm [134] and aligned with dynamic time warping [135].In Figure 8, we visualize the pitch contour of converted Angry, Happy, and Sad utterances with three different intensities.From Figure 8(a) and (b), we observe that the converted Angry and Happy utterances with higher intensity values tend to have higher F0 values with larger fluctuations over time.This coincides the fact that the utterances with higher intensity values are more vibrant and sharper in expressing emotions such as Angry and Happy.
For Sad, there is no big difference in F0 range for different intensities, as shown in Figure 8(c).This observation intuitively suggests that the intensity of expressing Sad emotion may be more related to the speaking rate than the vocal pitch.
(3) Speech Energy: Speech energy measures the volume or the loudness of a voice [136], [137].Speech energy is often regarded as a prominent character of emotion intensity in the literature [46], [138].To show the effect of intensity control, we visualize and compare the energy contour of different intensities in Figure 9.To represent the speech energy, we use 26 Mel-filterbanks and multiply each of them with the power spectrum.Then, we can measure the speech energy by adding up the coefficients.As shown in Figure 9(a) and (b), we observe that the converted Angry and Happy utterances with higher intensity have lager energy values, which is consistent with our observations on the F0 contour.As for Sad, we similarly observe that a higher intensity results in slightly higher energy values as shown in Figure 9(c).These observations show that our proposed Emovox can effectively control the emotion intensity manifested in multiple prosodic factors in vocal speech.

Comparison with State-of-the-Art Control Methods
As a comparative study, we implement three intensity control methods (i.e., Emovox w/ scaling factor, Emovox w/ attention weights, and Emovox w/ relative attributes) as described in Section 4.1.We evaluate the performance of these three methods in terms of speech quality and intensity control.
(1) Speech Quality: We first report the BWS listening test on Emovox for speech quality evaluation in Table 2.At each intensity value, the subjects are asked to evaluate the speech quality of the converted emotional speech with 3 different emotion intensity control methods.
From Table 2, we observe that Emovox w/ relative attributes always achieves the best results for Neu-Ang and Neu-Hap, and comparable results with Emovox w/ attention weights for Neu-Sad.These results show that Emovox w/ relative attributes can achieve better speech quality while controlling the output emotion intensity than other control methods.
(2) Intensity Control: We then report another BWS test to evaluate the performance of emotion intensity control.For each framework, listeners are asked to assess the emotional expressiveness among three different intensities.We conjecture that the speech samples with an intensity value of 0.9 sound more expressive than others, while those with an intensity value of 0.1 sound more neutral.We report the  preference percentage scores (%) of the most and the least expressiveness for each controlling method in Figure 11 and Figure 10, respectively.As illustrated in Figure 11(c), Emovox w/ relative attributes achieve the best preference results on intensity control, where most listeners choose the samples with an intensity value of 0.9 as the most expressive ones.We also note that Emovox w/ scaling factor and Emovox w/ attention weights work well for converted angry and happy, as shown in Figure 11(a) and (b).However, their performance of converted sad is not satisfactory.We further observe that Emovox w/ relative attributes also work better than the others, where most listeners choose the samples with an intensity value of 0.1 as the least expressive ones, as shown in Figure 10(c).This observation is consistent with the previous one, which further validates the superior performance of relative attributes on emotion intensity control.
As a summary, our proposed Emovox w/ relative attributes shows better performance on emotion intensity control while achieving better speech quality than other control methods.

Impact of Training Data Size
To evaluate the effect of training data on the final performance, we gradually reduce the number of utterances used at the emotion training stage.We use 300, 150, 100, and 50 training utterances for each emotion, and use 20 utterances for evaluation.In Figure 12, we report the MCD and DDUR results of Emovox and the baseline Seq2Seq-EVC.
We observe that Emovox consistently achieves better MCD results than Seq2Seq-EVC.We further observe that the MCD scores for Emovox between 150 to 50 training utterances are comparable and not significantly poorer than that of using 300 utterances.This indicates Emovox's robustness to limited training data.
For DDUR, we first observe that both Emovox and Seq2Seq-EVC have much higher DDUR values with 50 training utterances.It suggests that both frameworks cannot predict the speech duration well if the training size is too small.Between 300 to 100 training utterances, the performance of Emovox is comparable, which again attest to Emovox's robustness to limited training data.

Ablation Studies
We conduct ablation studies to validate the contributions of 1) style pre-training, and 2) perceptual losses from pretrained SER in emotion training.

Style Pre-training
We first compare the Mel-cepstral (MCD) results of Emovox and Emovox (w/o style pre-training), where the latter is trained directly with a limited amount of emotional speech data and without any pre-training.As shown in Table 1, Emovox (w/o style pre-training) provides the worst results for all emotion pairs.We further compare Emovox and Emovox (w/o style pretraining) in terms of DDUR as shown in Table 1.We observe that Emovox (w/o style pre-training) has the worst DDUR results.These results validate the effectiveness of style pretraining.

Perceptual Loss Functions
As discussed in Section 3.3.3,we expect that the emotion embedding similarity loss L sim and emotion classification loss L cls help generate more discriminative embeddings (see Figure 5).To further validate the effectiveness of these two loss functions on the final performance, we conduct a best-worse scaling listening test where we evaluate the emotion similarity with the reference emotion.To be consistent with Section 3.3.3,we only conduct an ablation study with the Emovox w/o intensity configuration.The results are reported in Table 3.
From Table 3, we observe that most listeners choose "Emovox w/o intensity, w/ L sim and L cls " as the best in terms of emotion similarity, while most of them choose "Emovox w/o intensity, w/o L sim or L cls " as the worst for all the emotion pairs.This suggests that these two loss functions improve emotional expressiveness, which validates the idea of incorporating SER losses for emotion supervision.

CONCLUSION
This contribution filled the research gap of emotion intensity control in current emotional voice conversion literature.We proposed a novel emotional voice conversion framework -Emovox -that is based on a sequence-to-sequence model.The proposed Emovox framework provides a fine-grained, effective emotion intensity control for the first time in emotional voice conversion.The key highlights are as follows: 1) We formulated an emotion intensity modeling technique and proposed an emotion intensity controlling mechanism based on relative attributes.We proved that our proposed mechanism outperformed other competing controlling methods in speech quality and emotion intensity control.2) Instead of simply correlating emotion intensity with the loudness of a voice, we presented a comprehensive analysis for the first time to understand the interplay between emotion intensity and various prosodic attributes such as speech duration, pitch envelope, and speech energy.We showed that our emotion intensity control could be manifested in various prosodic aspects.3) We proposed style pre-training and perceptual losses from a pre-trained SER to improve the emotion intelligibility in converted emotional speech.We showed that Emovox outperformed state-of-thearts emotional voice conversion frameworks.With style pre-training and perceptual losses from a pretrained SER, Emovox could effectively perform well with a limited amount of emotional speech data.
Our future directions include the study of cross-lingual emotional voice conversion and emotion style modelling with self-supervised learning.In addition, a closer coupling of conversion and speech emotion recognition is foreseen: conversion can help augment training data for recognition, while recognition can serve as objective conversion training guidance.

Fig. 2 :
Fig. 2: Block diagram of Emovox during the conversion stage.Emovox aims to transfer the reference emotion to the source speech ("emotion transfer") while controlling its emotion intensity ("intensity control") and preserving the source linguistic information ("linguistic transplant").

Fig. 3 :
Fig. 3: Overall training diagram of Emovox, where emotion and its intensity are separately modelled.Two utterance-level perceptual losses from a pre-trained SER: 1) emotion similarity loss L sim , and 2) emotion classification loss L cls are introduced to improve the emotional intelligibility at the utterance level.The red boxes represent the models that are involved in the training, while the green boxes are not.

Fig. 5 :
Fig. 5: The distributions of emotion embeddings resulting from encoders of different training schemes: (a) the pre-trained style encoder, (b) the emotion encoder without L sim or L cls , (c) the emotion encoder only with L cls , (d) the emotion encoder only with L sim , and (e) the emotion encoder with both L sim and L cls .A smaller value of ratio r indicates a better clustering performance.

Fig. 6 :
Fig.6: Mean Opinion Score (MOS) test with 95 % confidence interval to evaluate the emotion similarity with the reference, where listeners are asked to score each sample in a scale from -2 to +2 (-2: absolutely different; -1: different; 0: cannot tell; +1: similar; +2: absolutely similar).Marker * indicates p < 0.05 for paired t-test scores (pairs between Emovox and the others).
Fig. 10: A comparison of the preference percentage scores (%) of Emovox with 3 different emotion intensity control methods.Listeners are asked to listen to the converted speech samples of 3 different emotion intensities 0.5, and 0.9), and choose the expressive one.
Fig. 11: A comparison of preference percentage scores (%) of Emovox with 3 different emotion intensity control methods, and for 3 converted emotions.The subjects are asked to listen to the converted speech samples of 3 different emotion intensities (0.1, 0.5, and 0.9), and choose the most expressive one.

TABLE 1 :
A comparison of the MCD and the DDUR results of different methods for three emotion conversion pairs.

TABLE 2 :
A comparison of best-worst scaling (BWS) test results for speech quality of three different emotion intensity control methods with Emovox.
Fig. 12: A comparison of MCD and DDUR results of our proposed Emovox and Seq2Seq-EVC for all three emotion pairs with different amounts of training utterances, that are 300, 150, 100, and 50 utterances, respectively.

TABLE 3 :
The effect of the perceptual loss function for the emotion similarity in a best-worst scaling (BWS) test for four variants of Emovox w/o Intensity framework.