Estimating the Uncertainty in Emotion Class Labels with Utterance-Specific Dirichlet Priors

Emotion recognition is a key attribute for artificial intelligence systems that need to naturally interact with humans. However, the task definition is still an open problem due to the inherent ambiguity of emotions. In this paper, a novel Bayesian training loss based on per-utterance Dirichlet prior distributions is proposed for verbal emotion recognition, which models the uncertainty in one-hot labels created when human annotators assign the same utterance to different emotion classes. An additional metric is used to evaluate the performance by detection test utterances with high labelling uncertainty. This removes a major limitation that emotion classification systems only consider utterances with labels where the majority of annotators agree on the emotion class. Furthermore, a frequentist approach is studied to leverage the continuous-valued"soft"labels obtained by averaging the one-hot labels. We propose a two-branch model structure for emotion classification on a per-utterance basis, which achieves state-of-the-art classification results on the widely used IEMOCAP dataset. Based on this, uncertainty estimation experiments were performed. The best performance in terms of the area under the precision-recall curve when detecting utterances with high uncertainty was achieved by interpolating the Bayesian training loss with the Kullback-Leibler divergence training loss for the soft labels. The generality of the proposed approach was verified using the MSP-Podcast dataset which yielded the same pattern of results.


INTRODUCTION
A UTOMATIC emotion recognition (AER), as a key com- ponent of affective computing, has attracted much attention due to its wide range of potential applications in conversational AI, driver monitoring, and mental health analysis etc.Despite significant progress in recent years [1], [2], [3], AER is still a challenging problem and does not even have a single widely accepted task definition.
A straightforward definition of AER is to classify each utterance (i.e. a data sample of emotions with verbal evidence) into a set of pre-defined discrete emotional states (referred to as emotion classes in this paper, such as happy, sad, and frustrated etc.) [4], [5], [6], [7], [8], [9], [10], [11].However, since emotion is inherently ambiguous, complex, and highly personal, disagreements exist among human annotators who perceive the emotions and then label the data.Based on subjective judgements, different one-hot labels can be assigned to the same utterance by different annotators, and even by the same annotator [7], which leads to uncertainty in emotion labelling.Statistics show that some emotion classes are positively correlated (e.g.sad & frustrated) or inversely correlated (e.g.sad & happy) [7], [9], [12].These issues make the labels created by single annotators less reliable.As a solution, multiple annotators are often employed to label each utterance and majority • W. Wu  voting is used to remove the uncertainty arising from difference among the assigned labels.However, this strategy not only causes training utterances without majority agreed labels to be unused, which is not ideal since emotional data are highly valuable, but also effectively assumes that such utterances would either not be encountered or not be evaluated at test-time.In fact, mixtures of emotions are commonly observed in human interaction [13].
In this paper, AER with one-hot labels is studied and the classification-based task definition revisited.Instead of removing the uncertainty in labels with majority voting, we propose modelling such uncertainty with a novel Bayesian training loss based on utterance-specific Dirichlet prior distributions.In this approach, the one-hot labels provided by different annotators of each utterance are considered as categorical distributions sampled from an utterance-specific Dirichlet prior distribution.A separate prior distribution is estimated for each utterance, which is achieved by an improved Dirichlet prior network (DPN) [14].The DPN is trained by minimising the negative log likelihood of sampling the original one-hot labels from their relevant utterance-specific Dirichlet priors.Alternatively, from a frequentist perspective, "soft" labels can be obtained by averaging the categorical distributions relevant to the one-hot labels which can be viewed as maximum likelihood estimate (MLE) of the label for each utterance.An AER model can be trained by minimising the Kullback-Leibler (KL) divergence between the soft labels and its output distributions.The DPN and KL divergence training losses can be combined by simple linear interpolation.
To evaluate the proposed approach, a state-of-the-art neural network model architecture proposed in [15] is adopted for emotion classification, which consists of a time synchronous branch (TSB) that focuses on modelling the temporal correlations of multimodal features and a time asynchronous branch (TAB) that takes sentence embeddings as input to facilitate modelling meanings embedded in the text transcriptions.Experimental results on the widely used IEMOCAP dataset [7] show that the TSB-TAB structure achieves state-of-the-art classification results in 4-way classification (happy, sad, angry & neutral) when evaluated with all of the commonly used speaker-independent test setups.The 4-way classification is then extended to 5-way classification by including an extra emotion class "others" which represents all the other types of emotion labelled in IEMOCAP but ignored in the 4-way setup.Next, we redefine the task from classification to distribution modelling by representing the emotions using a 5-dimensional distribution rather than a single hard label.This allows utterances without majority agreed labels to also be considered.Uncertainty estimation is performed by training the model with soft labels and the DPN training loss.Classification accuracy is no longer an appropriate evaluation metric when considering uncertainty in emotion labelling.Instead, we propose evaluating the model performance in uncertainty estimation in terms of the area under the precision-recall curve (AUPR) when detecting utterances without majority unique labels at test-time.This also provides a more general two-step test procedure for AER, which can detect utterances with high uncertainty in emotion for further processing and classify the remainder of them into one of the emotion classes.Further experiments on the MSP-Podcast dataset, a larger speech corpus with natural emotions, demonstrate that our proposed uncertainty estimation approach generalises to handling realistic emotion data.
The rest of the paper is organised as follows.Section 2 discusses work on objective emotion quantification and related methods that address the inconsistency of emotion perception among human annotators.Section 3 presents an analysis of the IEMOCAP database and revisits the definition of emotion classification task with IEMOCAP.Section 4 introduces the soft-label-based and DPN-based approaches to emotion distribution modelling and uncertainty analysis.The model structure and experiment setup are shown in Section 5. Section 6 presents the results and analysis for emotion classification and distribution modelling on IEMOCAP.The proposed approach is further verified with experiments on the MSP-Podcast dataset in Section 7, followed by conclusions.

RELATED WORK
The inherent ambiguity of emotion, resulting from mixed emotions and personal variations in emotion expression etc., makes it still an open question as how to define emotion for easier quantification and objective analysis.Discrete emotion theory classifies emotion into several basic categories (e.g.happy, sad, fear, anger etc.) [16], [17], [18] while psychologists have also observed that these distinct emotion categories overlap and have blurred boundaries between them [19], [20].Alternative methods were developed to characterise emotional states by several fundamental continuous-valued or multi-valued bipolar dimensions that are more suitable to be evaluated independently [21], [22].
The subjectivity of emotional perception further complicates the problem of designing AER datasets.Despite the efforts of psychologists to de-correlate emotion dimensions, creating intensity labels with continuous values or multiple discrete values can still be highly subjective and also lead to a high degree of uncertainty in the data.In response to this problem, most datasets were created using the strategy of having multiple human annotators to provide multiple labels for each utterance.The "ground truth" is then commonly defined as the majority vote for discrete labels [7], [9], [12], [26] or the mean for dimensional labels [7], [9], [27].When using the mean dimensional labels, the discrepancies between annotators are ignored.Several approaches have been proposed to characterise the subjective property of emotion perception by modelling the inter-annotator disagreement level as the standard deviation of the dimensional labels, such as including a separate task to predict the standard deviation in a multi-task framework [28], [29], or predicting such values using Gaussian mixture regression models [30], [31].Recently, alternative methods including Gaussian processes [32], generative variational auto-encoders [33], and Monte-Carlo dropout [34] have been applied to the problem without using the standard deviation of dimensional emotion labels as additional training labels.
When using a majority vote to obtain the ground truth for discrete class labelling, the data without ground truth due to annotators' disagreement are usually discarded in classification-based AER [35], [36], [37].AER researchers have proposed various methods to address the uncertainty in emotion labelling caused by the inconsistency of emotion perception among human annotators.Nediyanchath et al. [38] used multitask learning with gender or speaker classification to model the variations in personal aspects of emotional expression.Lotfian et al. [39] proposed a multitask learning framework to recognise the primary emotional class by leveraging extra information provided in the evaluations about secondary emotions.Ando et al. [40] estimated the existence of multi-label emotions as an auxiliary task to improve the recognition of the dominant emotions.Another type of commonly used method is to train AER models with soft labels, which are derived as the mean of the hard labels and can be interpreted as the intensities of the emotion classes [41].Fayek et al. [42] incorporated inter-annotator variabilities by training a separate model based on the hard labels produced by each annotator, and it was shown that an ensemble of such models performed similarly to a single model trained using the soft labels.Although these approaches improved training with soft labels, at test-time the evaluations were only based on emotion classification accuracy, which results in a major inconsistency between training and evaluation [43].
This paper focuses on classification-based AER.Rather than trying to remove the uncertainty in the emotion representation or to make emotion classes more separable, we acknowledge the ambiguity in emotion expression and subjectivity in emotion perception and evaluations, and model the resulted uncertainty in emotion class labels using a novel Bayesian training loss.

EMOTION CLASSIFICATION WITH IEMOCAP
The IEMOCAP [7] corpus is the primary corpus used in this paper.It is one of the most widely used datasets for verbal emotion classification and is designed with a typical data annotation procedure.It consists of 5 dyadic conversational sessions performed by 10 professional actors and the data includes three different modalities including the spoken audio, text transcription and the facial movements.There are in total 10,039 utterances and approximately 12 hours of data, with an average duration of 4.5 seconds per utterance.Each utterance was annotated by three human annotators for categorical labels (neutral, happiness, sadness, and anger etc.).Each annotator was allowed to tag more than one emotion category for each sentence if they perceived a mixture of emotions.The ground truth labels are determined by majority voting.However, since only 7,532 utterances have majority unique 1 hard labels , the remaining 25% of the utterances are normally discarded.This issue and its solutions will be discussed later in detail in Sections 3.3 and 4.Although our analysis is primarily performed on IEMOCAP, the issues also existed in many other commonly used AER datasets, such as MSP-Podcast [10] whose analyses are provided in Section 7.

4-way classification
To be consistent and comparable with previous studies [1], [35], [36], [37], [44], [45], [46], only utterances with majority unique labels belonging to "angry", "happy", "excited", "sad", and "neutral" were used for our 4-way classification experiments.The "excited" class was merged with "happy" to better balance the size of each emotion class, which results in a total of 5,531 utterances with 1,636, 1,103, 1,084, and 1,708 utterances for happy, angry, sad and neutral respectively.This approach is widely used but it discards 44% of the data.

5-way classification
As shown in Table 1, although 7,532 utterances in IEMO-CAP have majority unique labels, only 5,531 of them are used in the 4-way classification setup given in Section 3.1.This excludes all the other classes of emotions, including "frustration" which is the largest emotion class taking 25% of the dataset, and therefore this partition only tackles part of the AER problem.To resolve this issue, we investigated the use of an alternative 5-way classification setup.An extra target of "others" was included as the 5-th class, to represent all of the other emotions that exist in IEMOCAP including utterances labelled as "frustration", "fear", "surprise", 'disgust", and "other".All 7,532 utterances with majority unique labels were used for training and test in 5-way classification.
As in most previous studies, our 4-way and 5-way classification systems can be evaluated based on classification accuracy with the utterances with majority unique labels.
1. Emotion category with highest votes was unique (notice that the evaluators were allowed to tag more than one emotion category).However, for the utterances without majority unique labels, that comprise 25% of IEMOCAP, classification accuracy against a single reference cannot be used 2 .In fact, the utterances without majority unique labels often include complex and meaningful emotions, and are potentially important for training AER systems.Furthermore, similar data will be encountered at test time and the system should also be evaluated on these type of data.In order to understand the problem better, the following section performs a data analysis.

Data analysis
Recall that in IEMOCAP, each utterance was labelled by three annotators and each annotator was allowed to give multiple different labels to each utterance.Table 2 shows examples of typical situations for the hard labels provided by the annotators and Table 3 summarizes some of the statistics of IEMOCAP.There are 1,272 evaluations that have more than one label, indicating the annotators were uncertain about the emotions when evaluating the utterance.When labels from different annotators are considered, all annotators agreed on the same emotion class label (e.g.'AAA') for only 24% (2,383 out of 10,309) of the utterances, which we denote as 3/3 agreement or Ω 3/3 utterances.The rest of the utterances consist of those with agreement from two annotators, denoted as 2/3 agreement or Ω 2/3 utterances (e.g.'AAB'), and those without any agreement, referred to as no agreement or Ω ≤1/3 utterances (e.g.'ABC').Note the case shown in the last row of Table 2.Although the labels have majority 'AB', the majority is not unique.Thus, the sentence belongs to Ω ≤1/3 .
When majority voting is applied, both 3/3 agreement and 2/3 agreement utterances result in the same majority unique ground truth label (e.g.'A'), despite that an annotator assigned different labels to the 2/3 agreement utterances, which causes a loss of the complexity and uncertainty in emotion annotation.This problem is more severe when there isn't a majority label and they are ignored completely 2. Multiple reference annotations could be counted as correct for scoring purposes.But it is clearly unsatisfactory when the number of annotators increases while using a small number of emotion classes.
in both training and test.Considering the fact that 51% of the utterances are 2/3 agreement and 25% have no majority unique label, the strategy to use majority voting on labels significantly changes the true emotion distribution.To resolve these problems, in the next section, we propose representing and modelling the emotion using a distribution rather than a single hard label.

UNCERTAINTY ESTIMATION IN EMOTIONS
To model the uncertainty caused by subjective emotion perception and evaluation, we propose revising the training target for each utterance from a single one-hot hard label (or categorical or 1-of-K label) to a continuous-valued categorical distribution over the emotion classes.For Ω ≤1/3 utterances, this allows the consideration of scenarios where majority unique labels do not exist (e.g. the 'ABC' and 'AABBC' cases in Table 2).For Ω 2/3 utterances, it avoids the problem that not all original hard labels are represented by majority voting (e.g. the 'AAB' and 'AABC' cases in Table 2).To learn such a categorical distribution, a frequentist approach is used to obtain a "soft" label for training using the MLE of the distribution for each utterance.Next, an alternative Bayesian approach is proposed that estimates a separate Dirichlet prior for each utterance.These two approaches can be combined by interpolating the KL loss with the DPN loss.Finally, a method to evaluate the performance for uncertainty estimation of the 5-way AER systems is proposed.

Soft labels
Denote a categorical distribution [p(ω 1 |µ), . . ., p(ω K |µ)] T = µ as the emotion distribution of an utterance x, where K is the number of emotion classes and n , . . ., µ } N n=1 be a dataset with N utterances where x n is the input features of the n-th utterance and µ (1) n , . . ., µ (Mn) n are its M n labels provided by all annotators.In this paper, µ (m) n is a one-hot vector since hard labels are annotated as in IEMOCAP.Such hard labels can be considered as samples drawn from the underlying true emotion distribution p tr (µ|x).For brevity, the subscript n is dropped and the following analysis is based on a single utterance x and its associated M labels {µ (1) , . . ., µ (M ) }.
One way to obtain the target emotion distribution µ for each utterance is to use the MLE: The k th element of μ, μk = p(ω k | μ), can be obtained for the MLE as the relative frequency of the ω k hard labels: where N k is the number of occurrences of ω k in {µ (1) , . . ., µ (M ) }.Such an MLE-based distribution is referred to as a soft label, which consists of the proportion of each emotion class.For instance, if the three 1-of-3 hard labels of an 'AAB' utterance are [1,0,0], [1,0,0], and [0,1,0], the soft label is [0.67,0.33,0],whereas the majority unique label obtained by majority voting is [1,0,0].This comparison shows that a soft label can preserve some uncertainty information derived from the original hard labels.A 5way classification system introduced in Section 3.2 can be trained using soft labels instead of majority unique labels by minimising the KL divergence between the soft labels and the predictions, which is abbreviated as a "soft" system in this paper.Denoting the softmax output of the neural network model as y where Λ is the collection of model parameters and y is the predicted emotion distribution, the soft system training loss is the KL divergence between μ and y given by

Dirichlet prior network
In Section 4.1, the average of the observed labels of an utterance is used as the approximation to the true target emotion distribution.The MLE converges to the true target distributions when there is an extremely large number of labels available.However, this condition cannot be satisfied in real-world AER since often only a small number of annotators (i.e. three for IEMOCAP) can be employed for emotion data labelling due to the task complexity and cost.This issue can be alleviated from the Bayesian perspective by introducing a prior distribution for the categorical distribution.
In this section, we introduce the Dirichlet prior network (DPN) [14], [47], a neural network model which models p(µ|x, Λ) by predicting the parameters of its Dirichlet prior distribution.
The Dirichlet distribution, as the conjugate prior of the categorical distribution, is parameterised by its concentration parameters α = [α 1 , . . ., α K ] T .The Dirichlet distribution Dir(µ|α) is defined as where Γ(•) is the gamma function defined as Hence, as shown in Fig. 1, given the concentration parameter α, the categorical distribution µ is a sample drawn from Dir(µ|α), and a 1-of-K hard label relevant to the emotion class ω k is a sample drawn from µ.Here µ models the distribution over K emotion classes.Dir(µ|α) models the distribution of the emotion distribution µ.For AER, utterance-specific priors derived for each utterance separately are more suitable than a "global" prior shared by all utterances, since emotions produced by different speakers in different contexts should not have the same prior.A DPN predicts the concentration parameters α as the output of the network α = f Λ (x).Here the prior distributions are "utterance-specific" as they are predicted separately for each utterance.

Training
Labels provided by different annotators of an utterance can be regarded as categorical distribution samples drawn from the utterance-specific prior.Given a training utterance x and M categorical emotion distributions {µ (1) , . . ., µ (M ) } provided by the annotators, a DPN is trained to maximise the likelihood p(µ|x, Λ) = Dir(µ|α), which is equivalent to minimising the negative log likelihood loss function: where Dir(µ (m) |f Λ (x)) is defined in Eqn.(3) and µ (m) is an one-hot hard label.L dpn is referred to as the DPN loss in the rest of the paper.When using the DPN loss, label smoothing [48] that converts µ (m) into a "softer" label μ(m) = [μ was found necessary to stabilise training, where ε 1 > 0 is a small constant [14].It was also observed that it is important to increase each α k predicted by the model by another small constant ε 2 > 0 when calculating Dir( μ(m) |f Λ (x)) based on Eqn.(3) [14].Comparing Eqn.(5) to Eqn. (2), each label µ (m) is taken into account separately when training a DPN, while only the averaged label μ is considered when training a soft label system.For example, two observations 'A','B','C' and 'ABC','ABC','ABC' yield the same soft label loss but different DPN losses.The latter case shows that all three annotators are uncertain about the emotion, indicating the emotion of the utterance might have high degree of inherent uncertainty.DPN training preserves the number of occurrences of each emotion class and allows an estimate of the confidence of the uncertainty.

Inference
The predictive distribution of the DPN for an input x is given by marginalising over all possible categorical distributions, which is equivalent to the expected categorical distribution under the conditional Dirichlet prior: which is equal to , where z k is the logit of kth output unit of the neural network model.This makes the expected posterior probability of an emotion class ω k be the value of the k-th output of the softmax.In this way, a standard DNN classifier with a softmax output function can be viewed as predicting the expected categorical distribution under a Dirichlet prior [14] while the mean is insensitive to arbitrary scaling of α k .This means that the precision α 0 , which controls the sharpness of the Dirichlet prior distribution, is degenerate if the classifier is trained with the KL divergence loss instead of the DPN loss.

Combining DPN with soft labels
In this section, we propose to combine the DPN loss L dpn with the KL loss L kl for soft labels by where λ is a scalar coefficient.Compared to the gradients produced by L kl , it was observed empirically that those produced by L dpn are highly sparse and have a much larger dynamic range.Interpolating L dpn with L kl reduces the sparsity of the gradients values, and the value of λ (e.g.20) is set manually to ensure that the dynamic ranges of the gradients of the L dpn and L kl terms are similar.In practice, it was found in our experiments that using L dpn-kl instead of L dpn not only stabilised DPN training by removing the need for the two smoothing constants ε 1 and ε 2 defined in Section 4.2.1, but also improved the AER system performance with all evaluation criteria.Another motivation for using L dpn-kl lies in the connection between L dpn and L kl when an unlimited number of labels is available for each utterance.Let y = [p(ω 1 |x, Λ), . . ., p(ω K |x, Λ)] T be the distribution of x belonging to each of the K emotion classes estimated by the neural network model, when M → ∞, the DPN loss can be rewritten as where I(µ ∈ ω k ) is an indicator function and equals one when µ is a hard label of ω k .H [p tr (µ|x)] is the entropy of p tr (µ|x) and is a constant term in the loss.p tr (µ|x) is the underlying true emotion distribution of x, and L ∞ kl = KL [p tr (µ|x) y] is the KL loss when M → ∞.Thus both L ∞ dpn and L ∞ kl reach the same optimum.With a finite number of labels for each utterance, L dpn and L kl approximate L ∞ dpn and L ∞ kl separately from different perspectives: L dpn approximates the expectation with respect to the true distribution by the the empirical average of the likelihood while L kl approximates the true distribution by the sample average.When only a finite number of hard labels is available, L dpn-kl can achieve a better approximation by leveraging the complementarity of L dpn and L kl .It's worth mentioning that our proposed combined loss could be applicable to other perception and understanding tasks which rely on subjective evaluations and uncertain labels from human annotators.

Evaluation of uncertainty estimation
In the previous sections, a Bayesian approach has been proposed to model the uncertainty in the training labels.However, the most appropriate way to handle the uncertainty in the test data remains an issue.From Table 3, at least 25% of the utterances in IEMOCAP do not have majority unique labels, which makes it impossible to evaluate by classification with a single reference class.This indicates classification accuracy is no longer a suitable evaluation metric when considering more general AER applications, which can encounter test utterances with ambiguous emotional content.Therefore, in this section, we propose using the area under the precision-recall curve (AUPR) with different measures as alternative metrics for AER.
The AUPR is the average of precision across all recall values computed as the area under the precision-recall (PR) curve.To compute AUPR, a binary task is first defined which, in this paper, is detecting utterances without majority agreed labels (i.e.Ω ≤1/3 utterances in IEMOCAP).An uncertainty measure is then defined to measure the uncertainty for each predicted distribution.Two measures are used in this paper: • The probability of the predicted class or max probability (Max.P) that measures the confidence in the prediction [49], [50].Max.P is defined as • The entropy of the predictive distribution (Ent.) that has been used in [49], [51].It behaves similarly to Max.P, but represents the confidence encapsulated in the entire output distribution.That is, A decision threshold is set based on the uncertainty measure which determines whether the test sample belongs to the positive or negative class.For example, utterances with Max.P lower than the threshold (or Ent.higher than the threshold) are predicted as Ω ≤1/3 .A PR curve is obtained by calculating the precision and recall for different decision thresholds where the x-axis of a PR curve is the recall, the yaxis is the precision and the decision thresholds are implicit and are not shown as a separate axis.The area under the PR curve is computed as AUPR.Compared to classification accuracy, AUPR can not only be applied to any test utterance but also quantify the model's ability to estimate uncertainty.In Section 6.3, experiments that assess uncertainty estimation ability by detecting utterances without majority agreed labels are reported.This detects whether a test utterance belongs to Ω ≤1/3 based on the model output distributions with either Max.P or Ent.In the detection experiments, Ω ≤1/3 is chosen as the negative class, while Ω 2/3 and Ω 3/3 are chosen as the positive classes.Detecting Ω ≤1/3 was selected as the binary task for AUPR since this simulates such a real application case: if an Ω ≤1/3 utterance is detected, the utterance may include ambiguous emotions that should be evaluated by further models or humans.Otherwise the utterance belongs to Ω 2/3 or Ω 3/3 where a majority unique label exists and emotion classification can be applied.

EXPERIMENTAL SETUP
Our experimental setup is given in this section, including feature representations and model structures.

Feature representations
The audio representation used for speech-based AER often includes log-mel filterbank features (FBKs) [52].In this paper, 40-dimensional (-d) FBK with a 10 ms frame shift and 25 ms frame length are used, which is denoted FBK 25 .An additional type of long-term FBK feature is also used, which extracted in the same way as FBK 25 apart from an long 250 ms frame length, and this is denoted FBK 250 .FBK features contain information about the short-term spectrum but don't explicitly contain pitch information that can be important in describing emotional speech [53] and is often complementary to FBK features [54], [55].Following [56], log pitch frequency features with probability-of-voicingweighted mean subtraction over a 1.5 second window are used along with FBK features.
Text features are also included in our models.Pre-trained 50-d GloVe embeddings are used to encode word-level transcriptions [57], while the pre-trained BERT-based model without fine-tuning is used to encode the transcription of each single utterance into a 768-d vector [58].Following prior work [35], [36], [37], the reference transcriptions from IEMOCAP were used for the text modality.

Model structure
The proposed model structure is shown in Fig. 2, which consists of a time synchronous branch (TSB) that fuses the audio features with the corresponding text information at each time step, and a time asynchronous branch (TAB) that captures the text information embedded across the transcriptions of a number of consecutive utterances.In the TSB, the audio features and the corresponding GloVe-based word embeddings are combined at each time step with a simple concatenation operation.The TSB structure is similar to that which is often used for speaker embedding extraction [59].
It uses a five-head self-attentive layer [60] to pool the framelevel vectors across time in the input window, and a time delay neural network with residual connections [61] is used as the encoder to derive the frame-level vectors.
While the TSB includes modelling the temporal correlations between different modalities, the TAB focuses on capturing text information including meaning from the speech transcriptions.The BERT-derived sentence embeddings of the utterance transcriptions are used as the input vectors to the TAB.The embeddings for a number of consecutive utterances were used as the TAB input since the emotion of each utterance is often strongly related to its context in a spoken dialogue [62].A shared fully-connected (FC) layer is used to reduce the dimension of each input BERT embedding, and the resulting vectors are then integrated by another five-head self-attentive layer.Finally, output vectors from both branches are fused using an FC layer for emotion classification.The hidden and output activation functions are ReLU and softmax respectively, and a large-margin softmax loss function is used for better regularization [63].

EXPERIMENTS ON IEMOCAP
Since the test sets are slightly imbalanced between different emotion categories, both the weighted accuracy (WA) and unweighted accuracy (UA) are reported.WA corresponds to the overall accuracy while UA corresponds to the average class-wise accuracy.Models were implemented using HTK [64] in combination with PyTorch.The newbob learning rate scheduler with an initial learning rate of 5 × 10 −5 was used throughout training.

4-way classification and cross comparisons
To compare to previously published results on IEMOCAP, the system was evaluated with all of the commonly used training and test setups on IEMOCAP: training on Sessions 1-4 and testing on Session 5; 5-fold cross validation (CV) that leaves one of the 5 sessions out of training and used for testing at each fold, and 10-fold CV that leaves one of the 10 speakers out at each fold.These test setups show whether the model is able to learn reliable and speaker independent features with the limited amount of training and test data in IEMOCAP.The results and modalities used in previous related work are summarised and compared in Table 4, which shows that our 4-way classification system achieved state-of-the-art results on IEMOCAP when evaluated with all of the three test settings.More detailed experiments and results can be found in [15].

5-way classification
As shown in Table 5, the classification accuracy of the 5way system using 5-fold CV on the previous four emotions (happy, sad, neutral, angry) was 72.47% WA and 74.29% UA.A 4.65% decrease was observed compared with the results of the 4-way system.On the other hand, since the 4-way system cannot correctly classify examples from the "others" class, the overall classification accuracy of the 4way system drops dramatically to 57.02% WA and 62.72% UA when tested on the 5-way data.

Uncertainty estimation experiments
Four systems were tested: • "hard": A 5-way emotion classification system described in Section 3.2; • "soft": A soft label system trained by minimising L kl between soft labels and the prediction in Section 4.1; • "dpn": A standard DPN system described in Section 4. 2 and trained with L dpn with ε 1 = 1 × 10 −2 and ε 2 = 1 × 10 −8 ; • "dpn-kl": A system trained on L dpn-kl described in Section 4.3 where ε 1 and ε 2 are set to 0, and the weight λ to scale the L kl term is set to 20.0.The "hard" system was trained on 5-way classification using the training data belonging to Ω 2/3 and Ω 3/3 , while the other systems were all trained using all the utterances in the training set.All systems were first evaluated using the 5way classification accuracy on Ω 2/3 and Ω 3/3 test utterances, and then evaluated on all test utterances using the average KL divergence, entropy, and AUPR (Max.P) and AUPR (Ent.).The average of 5-fold CV results on IEMOCAP are shown in Table 6.Compared to the "hard" system that is trained for better emotion classification accuracy, the "soft", "dpn", and "dpn-kl" systems were all trained to better model the uncertainty among the different emotion classes.Therefore it is expected that the hard system has the best UA and WA among all systems.However, as discussed in Section 4.4, classification accuracy is not suitable here as it can not be applied to the Ω ≤1/3 test utterances.It is also expected that the hard system has the lowest entropy (sharper output distributions) as it is trained to learn 0-1 distributions.It is widely known [65], [66] by deep learning researchers that such deep model hard label classification systems are often "over-confident", meaning that the model can have poor uncertainty estimation ability as indicated by the AUPR metrics.Similarly, it is reasonable that the soft system has the best KL divergence as it is trained to minimise such an objective.However, the KL divergence is also not the most suitable metric here as it does not distinguish between the 'AB' and 'AAABBB' cases as discussed in Section 4.2.1.Comparing "dpn" to "soft", "dpn" ranks better on AUPR (Max.P) while "soft" is better on AUPR (Ent.)."dpnkl" however, outperforms "dpn" on all evaluation metrics and produces the highest AUPR among all systems.It also achieves a balance between the "hard" and "soft" systems, yielding higher UA and WA than "soft" and a smaller KL divergence than "hard".The standard deviation across folds are shown in Fig. 3.Although the error bars of AUPR value contain overlap among the systems, the "dpn-kl" system consistently outperforms the others in all folds.

Further experiments for analysis
For the convenience of visualisation, we took one fold (trained on Session 1-4 and tested on Session 5) as an example and performed further analysis.The number of sentences in each group is listed in Table 7.This section first presents the effect of replacing single categorical hard labels with emotion distributions on the uncertainty of emotion prediction for different data groups by comparing the "hard" system to the "soft" system.Then the performance of all four systems on detecting the test utterances with high labelling uncertainty is presented.Data group Ω ( 5) Ω # of sentences 2,170 479 1,171 520

Comparisons between "hard" and "soft" systems
The performance of the "hard" and "soft" systems on different data groups are given in Table 8 and illustrated in Fig. 4. As utterances in Ω 1/3 don't have "ground truth" hard labels, only the KL divergence and entropy are reported for that group.
As shown in Fig. 4(b), "soft" has a higher entropy than "hard" in all cases as the soft labels retain some uncertainty in the labels and are trained to produce flatter distributions.The uncertainty in the estimated emotion distribution increases when fewer annotators reach an agreement.The label distribution becomes flatter and the entropy of the distribution predicted by "soft" increases.Furthermore, as shown in Fig. 4(a), for Ω (5) 3/3 , which are utterances that all three annotators agree on the same emotion label, it is easier for "hard" to learn the target 0-1 distributions and has smaller KL divergence.For the Ω   improves the matching between the distributions and has considerably smaller KL divergence values.

Evaluating the uncertainty estimation in emotion prediction -Ω 1/3 detection
As discussed in Section 4.4, Ω 1/3 detection experiments were conducted to assess the models' ability to estimate uncertainty.Both Max.P and Ent. were used as thresholds for AUPR measurement.The precision-recall (PR) curves for all four systems are shown in Fig. 5. From the graphs, using Max.P and Ent. as thresholds yield similar trends.The dpn-kl system consistently outperforms all the other systems based on both measures of uncertainty in Ω 1/3 detection performance.The average value of Max.P and Ent. for different data groups are reported in Table 9 and illustrated in Fig. 6.In general, when the emotion in the utterance gets more complex (with fewer annotators agreeing on the same emotion label), the average Max.P decreases and the average Ent.increases, showing that the systems predicts higher uncertainty of the emotion distribution.The prediction from "hard" has the least uncertainty as it is trained with one-hot labels.The high Max.P and low Ent.values of Ω 1/3 produced by "hard" indicate that this system can give incorrect predictions with high confidence when it encounters test utterances with complex emotions.The standard dpn system exhibits the most uncertainty by producing the lowest Max.P and the highest Ent., indicating that the Dirichlet prior predicted based on only about 3 hard labels introduces a large amount of uncertainty   in the estimation.However, such uncertainty plays a key role in the AER problem as it is difficult and expensive to have many reference labels for each utterance and a small number of labels are often insufficient to reflect the true underlying emotion distribution.The uncertainty is reduced by smoothing the Dirichlet samples with the MLE by incorporating an additional L kl term in the loss function.

Further analysis on emotion data uncertainty
To understand the influence of the uncertainty in labels, a modified dpn-kl system (referred to as "dpn-kl2") was trained using a modified label setting.The labels of each Ω 2/3 and Ω 3/3 utterance were replaced by the same number of their corresponding majority unique hard label while the hard labels of Ω 1/3 utterances were kept the same (referred to as vote-and-replace).An example of this modified label setting is shown in Table 10, and the results are given in Table 11.
With vote-and-replace, the UA and WA of "dpn-kl2" increase and are closer to those of "hard".Compared to "dpn-kl", the entropy after voting decreases significantly.The entropy of "dpn-kl2" is even smaller than that of "hard", possibly because the number of the majority agreed labels remain unchanged in "dpn-kl2" while only one label is kept by "hard".Taking the first example in Table 10, the label of "dpn-kl2" is 'AAAAA' while the label of "hard" is 'A'.Using the same label multiple times to represent the levels of confidence is an advantage of the DPN loss.Voteand-replace removes the uncertainty from the Ω 2/3 and Ω 3/3 utterances.This validates our motivation that the majority voting strategy considerably changes the uncertainty properties of the resulting model and should be avoided when constructing AER systems aimed at more general settings.

TABLE 10
Example of the vote-and-replace operation of the labels used for dpn-kl2 system."A", "B", "C" denotes different emotion categories.

Original label for dpn-kl:
A A A B C Majority: A Modified label for dpn-kl2: A A A A A Original label for dpn-kl: A B C Majority: --Modified label for dpn-kl2: A B C TABLE 11 Performance of dpn-kl2 system using labels after modified by the vote-and-replace operation.The tests were performed on Session 5 of IEMOCAP."↑" denotes the higher the better, "↓" denotes the lower the better.The distribution of emotion classes in MSP-Podcast is shown in Fig. 7. Nearly 20% of the data doesn't have majority agreed labels, and the distribution is imbalanced among the emotion classes with neutral being the largest emotion class.These are common characteristics of datasets with natural emotions.To counteract the effect of class imbalance, the data was up-sampled by varying the overlap between the input windows.The standard splits for training, validation and testing were used in the experiments.

System
The TSB-TAB structure was modified to be suitable for MSP-Podcast.Since reference transcriptions were not provided by MSP-Podcast, automatic speech recognition (ASR) results were used instead.We used an open public wav2vec 2.0 large model 3 to generate the speech transcriptions for MSP-Podcast, which has a word error rate (WER) of 2.2% on clean test set of Librispeech [67] (for regular speech) and 39.0% on IEMOCAP (for emotional speech) 4 .GloVe embeddings were not used here since the end-to-end ASR used here doesn't directly generate word alignments.Furthermore, the BERT-derived sentence embeddings were not used for the context utterances since each emotional utterance is presented separately without providing their surrounding context in MSP-Podcast.The TAB was therefore simplified to an FC layer with ReLU activation.This is a suitable setup for our proposed method that can be used when reference transcriptions and context utterances are not available.
Emotion classification results are compared in Table 12 5 for both our setups and some from the literature.In the 4-way setups, only utterances with majority agreed label belonging to "happy", "sad", "angry", "neutral" were used.Emotion class "disgust" was included in the 5-way setups.
5. Note that the results are not directly comparable as different versions of MSP-Podcast dataset were used: papers [55], [69], [70] used release 1.4, papers [68] used release 1.7, and paper [71] used release 1.9.All utterances with majority label were used in the 9-way setup while "other" was excluded in the 8-way setup [71].
Although our results are not directly comparable to those in the literature as different versions of MSP-Podcast dataset were used, our model produced competitive performance.
The degradation of classification results as the number of classes increases indicates the difficulty of fine-grained AER.The 9-way setup was used in the uncertainty estimation experiments which uses all emotion data.Label grouping was not performed here in order to retain the original labels without modification 6 .Emotion was then represented by a 9-dim categorical distribution.The experiments in Table 6 were repeated on MSP-Podcast, and the results shown in Table 13.AUPR was computed by detecting utterances without majority agreed labels.Since each utterance in MSP-Podcast can be labelled by a varying number of annotators, this demonstrates the application of the AUPR metric in a general situation.Table 13 shows similar trend to Table 6 with "hard" system producing the highest classification accuracy and the "soft" system giving the smallest KL divergence.The "dpn-kl" system again produces the highest AUPR among all systems, showing its superior ability in emotion uncertainty estimation.Thus we validated the generality of our proposed methods to handle challenging realistic emotion data.

CONCLUSION
The paper proposes resolving the problem of disagreement in annotated hard labels for emotion classification from the perspective of Bayesian statistics.Instead of using majority voting to achieve a majority hard label that is used to build a classifier, the Dirichlet prior network training loss is applied 6. Grouping sentences doesn't affect labelling much in IEMOCAP as in IEMOCAP "others" is dominated by "frustration".
to the task to better model the distribution of emotions.It preserves label uncertainty by maximising the likelihood of sampling all hard labels with inconsistent emotion classes from an utterance-specific Dirichlet distribution, which is predicted separately for each utterance with a neural network model.Given the fact that a large proportion of emotion data (e.g. in the IEMOCAP dataset) has significant interannotator disagreement, the proposed Bayesian framework also allows the detection of test utterances without majority unique labels based on two uncertainty estimation metrics, which is a more general setup than simply ignoring such data as in the traditional emotion classification framework.A novel combined loss function that interpolates the DPN loss with Kullback-Leibler loss has also been proposed, which not only has a more stable training performance but also results in improved uncertainty estimates.The findings were further validated using a larger real-life emotion dataset.Beyond emotion recognition, label uncertainty is a common issue in many human perception and understanding tasks, since golden references are often not well-defined due to the subjective evaluation of annotators.The proposed method could be applicable to other such tasks to handle the uncertainty in labels.
Wen Wu Wen Wu received the B.E degree from Fudan University and the MPhil degree from University of Cambridge.She is currently a PhD student at University of Cambridge supervised by Prof. Phil Woodland.Her research interests include audio-visual emotion recognition and Bayesian uncertainty estimation.
Dr. Chang Zhang Chao Zhang received his BE and MSc degrees in 2009 and 2012 both from the Department of Computer Science and Technology, Tsinghua University, and his PhD degree in 2017 from Cambridge University Engineering Department (CUED).He is currently an Assistant Professor at the Department of Electronic Engineering, Tsinghua University.Before that, he was a Senior Research Scientist at Google, a Research Associate at CUED, and an advisor and speech team co-leader of JD.com.His research interests include spoken language processing, machine learning, and cognitive neuroscience.He has published 70 peer-reviewed speech and language processing papers and received multiple paper awards.He is also a Visiting Fellow at CUED, and an Associate Member of the IEEE Speech and Language Processing Technical Committee.

Fig. 1 .
Fig. 1.Illustration of the DPN process.µ is a categorical distribution over K emotion classes which is sampled from the Dirichlet prior Dir(µ|α).

Fig. 3 .
Fig. 3. Error bars showing the standard deviation across 5-fold for uncertainty estimation experiments on IEMOCAP.The last two figures show the AUPR value for each fold where "Fold1" denotes the 1st fold that was trained on Session 2-5 and tested on Session 1, etc.
have more uncertainty in the labels, "soft"

Fig. 4 .
Fig. 4. Comparison of the "hard" system and the "soft" system in terms of KL divergence (a) and entropy (b) of three data groups in Session 5 of IEMOCAP.

Fig. 5 .
Fig. 5. PR curves for the four systems using (a) Max.P and (b) Ent. as the uncertainty measures.The tests were performed on Session 5 of IEMOCAP.

Fig. 6 .
Fig. 6.Comparison of average Max.P (a) and Ent.(b) of different data groups in Session 5 of IEMOCAP produced by different systems.

Fig. 7 .
Fig. 7.The distribution of emotion classes based on majority agreed labels in MSP-Podcast.
Prof. Philip C. Woodland Philip C. Woodland is a Professor of Information Engineering in the Engineering Department, University of Cambridge, Cambridge, U.K., where he is the Head of the Machine Intelligence Laboratory and a Professorial Fellow of Peterhouse.After working at British Telecom Research Labs for three years, he returned to a Lectureship at Cambridge in 1989 and became a (Full) Professor in 2002.He has published more than 250 papers in the area of speech and language technology with a major focus on speech recognition systems.He has received a number of best paper awards including for work on speaker adaptation and discriminative training.He is one of the original coauthors of the HTK toolkit and has continued to play a major role in its development.He was a member of the editorial board of Computer Speech and Language(1994)(1995)(1996)(1997)(1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009) and is currently a member of the editorial board member of Speech Communication.He was a member of the Speech Technical Committee of the IEEE Signal Processing Society from 1999 to 2003.He is a Fellow of the IEEE, the International Speech Communication Association and the Royal Academy of Engineering.

TABLE 3
Statistics of IEMOCAP.An annotator can perform an "evaluation" for an utterance.An evaluation can include more than one labels if the annotator is uncertain about the emotions.

TABLE 4
Summary of 4-way classification results on IEMOCAP in the literature.
"A", "T", and "V" refer to the audio, text, and video modalities respectively.

TABLE 6
The average of 5-fold CV AER results for uncertainty estimation experiments on IEMOCAP.The WA and UA classification accuracy results were obtained using Ω ≥2/3 data in the test sets.All the other results were computed on the whole test sets."↑" denotes the higher the better, "↓" denotes the lower the better.

TABLE 7
Number of utterances in different data groups of Session 5 of IEMOCAP.

TABLE 8
Comparison of the "hard" system and the "soft" system.Tested on different data groups in Session 5 of IEMOCAP.

TABLE 9
Averaged value of Max.P and Ent. on different test data groups.The tests were performed on Session 5 of IEMOCAP.
[10] section presents our experiments on MSP-Podcast[10], a larger dataset with natural emotional speech data, to validate the generalisation ability of our proposed method and the reliability of our findings.MSP-Podcast contains natural English speech from podcast recordings.Release 1.8 was used in this paper which contains 73,042 utterances from 1,285 speakers amounting to more than 110 hours of speech.The corpus was annotated using crowd-sourcing.Each utterance was labelled by at least 5 human annotators and has an average of 6.7 annotations per utterance.

TABLE 12 AER
results for classification experiments on MSP-Podcast.

TABLE 13 AER
results for uncertainty estimation experiments on MSP-Podcast."↑" denotes the higher the better, "↓" denotes the lower the better.
Xixin Wu (Member, IEEE) received his B.S. degree from Beihang University, Beijing, China, his M.S. degree from Tsinghua University, Beijing, China, and his Ph.D. degree from The Chinese University of Hong Kong, Hong Kong.He is currently a Research Assistant Professor with the Stanley Ho Big Data Decision Analytics Research Centre, The Chinese University of Hong Kong.Before this, he worked as a Research Associate with the Machine Intelligence Laboratory, Cambridge University Engineering Department.His research interests include speech synthesis and recognition, speaker verification, and neural network uncertainty.