Uncertain Facial Expression Recognition via Multi-task Assisted Correction

Deep models for facial expression recognition achieve high performance by training on large-scale labeled data. However, publicly available datasets contain uncertain facial expressions caused by ambiguous annotations or confusing emotions, which could severely decline the robustness. Previous studies usually follow the bias elimination method in general tasks without considering the uncertainty problem from the perspective of different corresponding sources. In this paper, we propose a novel method of multi-task assisted correction in addressing uncertain facial expression recognition called MTAC. Specifically, a confidence estimation block and a weighted regularization module are applied to highlight solid samples and suppress uncertain samples in every batch. In addition, two auxiliary tasks, i.e., action unit detection and valence-arousal measurement, are introduced to learn semantic distributions from a data-driven AU graph and mitigate category imbalance based on latent dependencies between discrete and continuous emotions, respectively. Moreover, a re-labeling strategy guided by feature-level similarity constraint further generates new labels for identified uncertain samples to promote model learning. The proposed method can flexibly combine with existing frameworks in a fully-supervised or weakly-supervised manner. Experiments on RAF-DB, AffectNet, and AffWild2 datasets demonstrate that the MTAC obtains substantial improvements over baselines when facing synthetic and real uncertainties and outperforms the state-of-the-art methods.


I. INTRODUCTION
F ACIAL expressions carry essential information for per- ceiving human emotions and attitudes in daily communications.Automatic facial expression recognition (FER) from visual signals of images and videos is a vital technology for realizing human-computer systems such as remote health care, virtual reality, and social robots.Due to sufficient labeled data and high-speed computation resources, deep learning models have achieved excellent performance and dominated the FER research in recent years [1], [2], [3].
High-quality annotated images are significant when developing a FER method.Early facial expression datasets (e.g., CK+ [4] and Oulu-CASIA [5]) usually recruit a small scale of subjects and collect their facial expressions in a lab-controlled environment.Due to the limited number and conditions, experts can annotate the data carefully and precisely.To meet the requirement of massive labeled samples for training a deep FER model, recently released datasets (e.g., RAF-DB [6], AffectNet [7], and AffWild2 [8]) gather images from the Internet.For those real-world datasets, their annotations are difficult to maintain consistency in a large-scale manner.As a result, many labels are uncertain or even incorrect, which may cause two negative impacts on model learning.First, the overfitting problem will arise due to the considerable proportion of ambiguous samples in the training set.Second, the uncertain labels will mislead the model learning features of specific facial expressions and decrease recognition performance.
The uncertainty can be divided into two categories in the FER task according to the source.The first one is subjective annotation.Labels for existing facial expression datasets are voted by annotators recruited on crowdsourcing platforms [6].These annotators usually do not have the expertise and will assign different labels for the same image based on their backgrounds, especially for facial expressions under in-thewild scenes.In Fig. 1, we show several examples in RAF-DB and AffectNet datasets to illustrate the prevalence of uncertain samples.For samples on the left column, annotators can easily make consistent labeling.While for the right column, it is evident that multiple annotators might have various perspectives on the same sample.The second source is intrinsic confusion.Existing FER methods usually predict basic emotions (e.g., anger, disgust, fear, happiness, neutral, sadness, and surprise).However, expression behaviors in daily life are spontaneous and diverse due to different induction, posture, and context [9], [10].Many facial expressions consist of compound or non-basic emotions under in-the-wild scenarios, which are difficult to be described by discrete labels.This phenomenon becomes even worse when encountering the class imbalance problem in the dataset, as shown in the distribution visualization of training labels in the AffectNet dataset (see Fig. 2).
To this end, a few studies have proposed solutions to alleviate the uncertainty problem.Most of them focused on migrating the methods of handling data noise in general tasks [11], [12].Generally, a specific block for uncertainty estimation will be introduced to weight or relabel every sample during the model training [13].Recently, considering characteristics of the FER task in terms of the variety of annotations and the inter-connectivity of sub-or similar tasks, the idea of using the relationship among multiple labels such as action units (AUs) and valence-arousal (VA) has been explored [14], [15].However, these methods are still plagued by uncertain samples for the following reasons: 1) different types of uncertainty are lumped together without targeted treatment; 2) although additional multi-label knowledge is applied, semantic information is considered from the label level rather than the feature level; 3) the relabeling strategy without constraints is usually coarse, which could decrease the reliability of the generated labels.
In this paper, we develop a new framework to perform uncertain FER via Multi-Task Assisted Correction, called MTAC.It consists of three parts: a target branch and two auxiliary branches.Facial features are first extracted through a backbone network for every batch of training data.A weighted regularization module estimates each sample by learning confidence and encourages the model to focus on images with reliable labels in the target branch.Based on a parametersharing backbone, the VA estimation task is introduced to jointly supervise the feature learning with a consideration of category imbalance, while the AU detection task is conducted by adding a graph convolution block and extracting the semantic representation of each sample.For those samples that are identified as extremely uncertain, we compare their semantic representation with memory templates and relabel them under the constraint of feature-level similarity.
This paper is an extended work of our preliminary study published on ICPR 2022 [16].Compared to the previous version, we have enhanced this paper in four aspects: 1) we involve a new auxiliary task of VA estimation for collaborative model training, which can address the uncertainty caused by the biases of discrete labels on describing in-the-wild facial expressions; 2) we revise original loss functions and design a new weighted loss for handling data imbalance, which can jointly optimize the feature extractor; 3) we construct a new memory template of weighted semantic centers and improve the relabeling strategy, which can adaptively generate pseudo labels for uncertain samples; 4) we employ additional backbones and datasets for more comprehensive experiments to evaluate the effectiveness of our method.Overall, the main contributions of this paper are summarized as follows: • MTAC method quantifies the sample confidence and suppresses the effects of uncertain discrete labels during the model training.
• MTAC mitigates the category imbalance and facilitates feature learning on ambiguous facial expressions with continuous labels in the auxiliary VA estimation task.• MTAC explores the semantic representation from the auxiliary AU detection task and conducts uncertain label correction under a feature-level constraint.
• Extensive experiments on three large-scale datasets demonstrate that our MTAC can effectively solve the uncertain problem and achieves superior performance against state-of-the-art methods.The rest of this paper is organized as follows: Section II reviews several related studies, Section III elaborates on the proposed MTAC method, Section IV reports the experimental results and discussions, and Section V concludes this work.

II. RELATED WORK
In this section, we briefly summarize the recent progress of the FER research in terms of multi-task facial expression analysis, graph-based affective representation, and deep learning with uncertainty.

A. Multi-task facial expression analysis
Automatically predicting basic emotions is the main task in traditional FER studies.Benefiting from psychological findings, advanced emotional description models have been utilized to annotate a broader range of facial expressions, such as AUs [17] and VA [18].Therefore, recent studies have explored combining multiple tasks for a generalized feature extractor of facial expressions.Chen and Joo [19] incorporated the triplet loss into the objective function to embed the dependency between AUs and expression categories.Zhang et al. [20] designed a unified adversarial learning framework to link emotion prediction and the joint distribution of dimensional labels.Similarly, Antoniadis et al. [21] captured the dependencies between categorical and dimensional emotions through a graph convolutional network (GCN).Cui et al. [14] extracted the dependency between object-level labels and property-level labels, which could be used to revise and generate labels for new datasets.Besides emotion-related tasks, other close facial tasks like facial landmark detection have been proven to provide additional information to facial expression analysis.Chen et al. [15] introduced landmark detection as a neighbor task and leveraged the distribution of cluster samples to handle the label inconsistency.Toisoul et al. [22] integrated facial landmarks with discrete and continuous emotions into a single network that features around fiducial points were used to build an attention mechanism.Unlike previous methods, we exploit AU detection and VA estimation as two auxiliary tasks to assist uncertainty correction in this paper.Each auxiliary task can be independently integrated during the model training without causing extra burden in the testing stage.

B. Graph-based affective representation
Effective facial representations are vital for FER methods.Recently, graph-based methods have been proposed because they can simultaneously represent facial anatomy and semantic relationships among facial areas, which are considered crucial clues of human facial perception [23].Liu et al. [24] designed a graph representation of facial expressions that consisted of reasonable facial landmarks and semantic connections, which modeled critical appearance and geometric facial changes.Zhao et al. [25] constructed a geometric graph description of facial components that were more robust to appearance variations like texture noise and light changes.Besides facial landmarks, many studies generate graph representations based on local facial regions.Jin et al. [26] cropped 20 local facial areas as graph nodes and linked edges according to a trainable weighted adjacency matrix to exploit intra-and inter-regional relationships.Xie et al. [27] correlated a cross-domain graph for global-local feature adaptation to learn invariant representations of facial expressions.Alternatively, graphs constructed from the perspective of AUs are also explored.Luo et al. [28] learned a unique graph that described the relationship between each pair of AUs, including its activation status and its association with other AUs.Song et al. [29] transferred hybrid messages among AUs and inferred possible graph structures to provide complementary information for higher performance.In this work, we focus on AU graphs where the extracted semantic representation is used to constrain the re-labeling strategy.Compared to existing methods, our AU graph is built based on a data-driven way in a fully supervised or a weaklysupervised manner.

C. Deep learning with label uncertainty
Label uncertainty is a common and significant problem in FER, and plagues deep models for many general tasks [30].Machine learning researchers usually regard uncertainty as a noisy label issue and rely on modified loss functions to penalize it.Zhong et al. [31] propagated the uncertain signals across a confidence graph based on feature similarity and temporal consistency that were used to train a label noise cleaner.Li et al. [12] regularized the low-dimensional subspace of embedded images by a consistency loss and a prototypical loss so that alleviated uncertain samples with a neighboring constraint.Analogously, in the FER field, Wang et al. [32] proposed a self-cure network to learn the importance weight of each facial image and suppress uncertain samples by identifying and modifying untruthful labels.She et al. [11] exploited auxiliary multi-branch distribution learning and pairwise uncertainty estimation to solve the ambiguity in both the label space and the instance space.Zhang et al. [33] formulated a noise modeling network based on a weaklysupervised strategy that learned the mapping from feature space to the residuals between clean and noisy labels.Gu et al. [13] suppressed the label and feature noise by leveraging a multivariate normal distribution and preserving the inter-class correlations.As mentioned before, the FER task suffers from various uncertainties, which need to be fully considered in the existing methods.To this end, we combine multi-task learning and distribution learning to address both subjective annotation and intrinsic confusion problems in this paper.The proposed MTAC can adaptively suppress or correct uncertain samples during the modeling training.

III. PROPOSED METHOD
As mentioned above, the uncertainty in large-scale FER datasets comes from two aspects, i.e., subjective annotation and intrinsic confusion.To this end, we need to know which samples are uncertain to reduce their impact on model training and correct them to use existing data fully.Inspired by the previous work [15], [20], we introduce the idea of multitask learning and distribution learning to achieve the uncertain FER.This work has two bases: 1) features of one sample extracted on similar tasks are correlated, and 2) similar samples should have an underlying dependency in both label space and feature representation.This section presents an overview of the MTAC and then elaborates on its crucial modules.

A. Overview of MTAC
An overview of the MTAC is illustrated in Fig. 3.The MTAC contains: 1) a target branch that takes facial features extracted by a pre-trained backbone network and computes the annotation confidence using a self-attention layer.These confidence weights will affect the importance of the sample when calculating the classification loss.2) one auxiliary branch of the VA estimation task jointly supervises the feature learning accompanying the class-oriented loss to simultaneously deal with the uncertainty of intrinsic confusion and the category imbalance in the current batch.3) the other auxiliary branch of the AU detection task constructs data-driven AU graphs, generates a memory template of semantic centers for every emotion category, and then relabels suspicious samples based on the rank regularization and the similarity preserving constraint.The VA branch supports the feature learning of the target branch by using continuous emotion labels and considering category imbalance, while the AU branch relabels extremely uncertain samples based on the semantic similarity constraint.Samples are re-annotated if they appear semantically closer to a another category center than the original one in feature space.Both auxiliary branches are free to work or disabled and will not participate in the testing process.
The whole MTAC is an end-to-end framework, and the two auxiliary branches can work individually or collaboratively and will not participate in the testing process.

B. Target Branch with Uncertainty Measurement
Before handling the uncertainty, we want the model to provide confidence for each input while making the prediction.As illustrated in Fig. 4, our target branch follows a general pipeline with a feature extractor and a classifier for the FER task.For a batch of N images, F = [f 1 , f 2 , ..., f N ] ∈ R D×N denotes the facial features extracted by the pre-trained backbone network, D indicates the dimension for each facial feature.To identify the ambiguous samples and measure their uncertainties, inspired by [32], [34], a self-attention block is employed that consists of a fully connected (FC) layer and the sigmoid function.Formally, the confidence score of the i-th sample can be calculated as: where W a denotes the parameters of the self-attention layer.
During the model training, it is expected that samples with lower confidence should impose less impact, and samples with higher confidence should receive more attention in the current batch.Therefore, we applied the weighted Cross-Entropy (CE) loss similar to [32], [11].Specifically, the loss function for the facial expression classifier is formulated as: where W j denotes the parameters of the j-th classifier, f i indicates the facial feature, C and y i are the number of classes and the original discrete label, respectively.According to [35], L wce and α are positively correlated.

C. Auxiliary VA Estimation Branch with Category Balancing
We exploit the VA estimation task as an auxiliary branch to mitigate the uncertainty of intrinsic confusion and complement the biases of discrete emotion labels.As shown in Fig. 5, the VA estimation branch shares the same backbone network as the target branch.However, it removes the final classifier for continuous predictions of valence and arousal.Specifically, we choose the Concordance Correlation Coefficient (CCC) [36] as our metric here because it reflects both the trend and the error between the dimensional label and the regressed value, which can be computed as: where y and ŷ denote the continuous label and the prediction, separately, µ and σ indicate the corresponding mean and variance, respectively, and σ y ŷ is the covariance of y and ŷ.
In addition, considering the heavy category imbalance in existing FER datasets, we introduce the class-oriented weight that is designed as: where N j is the number of images belonging to class j, and major classes faster than minor classes.Therefore, we propose a weighted CCC loss function for the VA estimation task as: where ρ v j and ρ a j denote the valence CCC and the arousal CCC of the j-th category, respectively.We put this category balancing on the VA branch rather than the target branch for two reasons: 1) feature learning in dimensional emotion estimation is not influenced by imbalanced discrete labels; 2) small categories have higher uncertainties of the subjective annotation and larger intra-class distances as shown in Fig. 2. Alternatively, γ can also be added to Eq. 2 similarly to prevent uncertainty from category imbalance when the VA branch is disabled.

D. Auxiliary AU Detection Branch with Graph Reasoning
Although the uncertainty is significantly alleviated with the help of the above two branches, low-confidence samples such as those incorrectly labeled will still degrade the model performance.To this end, the AU detection task is further employed as the other auxiliary branch because the Facial Action Coding System is proven to have latent mappings with emotion categories [17], [37], [23].As illustrated in Fig. 6, we can obtain a set of AU features of each image through the backbone network, M denote the feature dimension and the AU amount, respectively.
Considering the consistency of predefined mappings between emotion categories and AUs in large-scale datasets is difficult to guarantee [24], [38], we construct a data-driven AU graph that takes individual AU features as graph nodes and the co-occurring AU dependency as graph edges.Specifically, our AU graph is based on the conditional probability of obtaining co-occurring AU dependencies from the training set, which can be calculated as: where OCC p∩q denotes the number of co-occurrences of AU p and AU q , and OCC q is the total number of occurrences of AU q .Since the AU co-occurring relationship is practically asymmetric, so P (AU p |AU q ) = P (AU q |AU p ).
Then, the AU graph is input in a two-layer GCN to extract the semantic representation.Formally, each graph convolution layer is represented as: where Ā denotes the normalized A with all rows sum to one, W g is the weight matrix to be learned in the current layer.
All the node features outputted by the GCN are fed into a FC layer with Sigmoid functions to predict multiple AUs.Similar to L wce , we improve the binary CE loss with the confidence score to train every AU classifier, and the total weighted group loss for the AU branch is formulated as: where α is the confidence weight, z m and p m are the original/pseudo label and the prediction of m-th AU, respectively.The logits s i ∈ R 1×M before AU classifiers are treated as the semantic representation of the sample.

E. Relabeling with Semantic Similarity Constraint
To determine which labels need to be corrected and which new classes should be assigned, we design a semantic similarity constrained relabeling strategy (see Fig. 7).For every training batch, a center set for all emotion categories U = [u 1 , u 2 , ..., u C ] ∈ R M ×C is generated based on the semantic representations and the confidence weights, which can be calculated as: where N j is the number of the samples with the j-th label in the current batch.Then, a memory template T ∈ R M ×C is initialized and dynamically updated throughout the whole training process as follows: where h denotes the batch index, and τ is a control factor of updating rate.Eventually, the memory template will gradually stabilize as the model converges [39].After that, the cosine distance between each semantic representation s i and each of semantic center t j in the memory template T is computed as: Next, for every sample in the current batch, we rank all its semantic distances to the memory template T .Benefiting from the other two branches, it is supposed that the uncertain samples should be suppressed and have a large distance from their original category center.Thus, for those samples with extreme uncertainty, we relabel them following the semantic similarity constraint, which can be defined as: where y i denotes the corrected label, org indicates the original discrete category, and j = org.Note that this relabeling strategy will only take effect if the semantic distance to the original category center is not the shortest in the ranking.
In such cases, the template class with the highest semantic similarity will be assigned to this sample as a new label.

F. Model Training
Finally, the total loss function of the whole framework can be formulated as: where λ 1 and λ 2 are the weighted ramp functions that will change with epoch rounds [40], which can be computed as follows: where β denotes the current epoch index, and H is a constant that controls the participation of different branches.The weighted ramp functions allow MTAC to pay more attention to the AU branch in the initial training stage.Since the number of samples accumulated at the beginning is insufficient, it cannot generate effective semantic representations and solid memory templates.After a certain number of training rounds, the model will focus more on the target and VA branches to extract distinguishing features for final predictions.Moreover, our MTAC can work independently with the target branch, while the two auxiliary branches can be flexibly combined into the framework without additional inference burden.

IV. EXPERIMENTS
In this section, we conduct extensive experiments to demonstrate the performance of MTAC in terms of ablation study, tackling synthetic and real uncertainty, and multi-task comparison with the state-of-the-art.
RAF-DB is a large-scale dataset with 15, 339 in-the-wild images annotating of six basic emotions and neutral.In our experiments, 12, 271 and 3, 368 samples are used for training and testing, respectively.Since no continuous emotion labels are provided, the VA branch will be disabled in the experiment on RAF-DB.
AffectNet contains close to one million expression images.To ensure a fair comparison, we select samples manually labeled as six basic emotions and neutral for evaluation.The number of images in the training and test sets is 283, 901 and 3, 500, respectively.In addition, automatically labeled samples in AffectNet are used as a set of real noisy data, denoted as AffectNet Auto, to verify the ability of MTAC to handle uncertain expressions.
AffWild2 is the first audiovisual dataset with annotations for all main behavior tasks, including FER, VA estimation, and AU detection.It contains 558 videos with around 2.8 million images of facial expressions.In this work, we use the subset of 'MTL Challenge' with seven discrete labels, VA labels, and AU labels simultaneously, which consists of 39, 614 and 10, 839 training and testing samples, respectively.The CCC score is used as a metric for the evaluation of the regression task.
Since the AU annotation requires specially trained experts and is time-consuming, it is natural that no AU labels are provided in RAF-DB and AffectNet.To account for this issue, we applied Openface 2.0 [41] to automatically generate pseudo AU labels, similar to [15], [19].For AffWild2, the original AU labels are used to generate the AU graph.In other words, the AU branch and the relabeling can work either fully or weakly supervised and compatible with various datasets.Moreover, the AU branch does not participate in the parameter update of the backbone network.Our MTAC utilizes a featurelevel semantic similarity constraint to correct the extremely uncertain sample adaptively instead of directly replacing it with the prediction, which can reduce the negative impact of incorrect pseudo AU labels.

B. Implementation Details
The MTAC is implemented with the Pytorch platform and trained using two Nvidia Volta V100 GPUs.The cropped facial regions are used and further resized to 224 × 224 pixels as inputs.For the whole framework, we choose the ResNet-18 [42] and the DenseNet [43] as two backbone networks which are pre-trained on the MS-Celeb-1M [44] dataset as previous methods [11], [32], [45].For the target branch and the auxiliary VA branch, the initial learning rate of the Adam optimizer is 0.01, which is then updated to 10 −3 and 10 −4 at the 10-th and 20-th epoch, respectively.For the auxiliary AU branch, each GCN layer has 64 channels, where the control factor τ and the decayed learning rate are set as 0.9 and 0.005, respectively.We choose a batch size 512 to ensure that every template can be effectively updated during the whole training process, while the H defaults to 5. The relabeling starts after 10 epochs to ensure a stable memory template of the semantic representation.

C. Ablation Study
A few ablation studies are performed to verify the contribution of every branch in MTAC and the key hyper-parameter proposed in this paper.
1) Components evaluation: MTAC deals with the effects of uncertain samples based on three branches, i.e., target branch, auxiliary VA branch, and auxiliary AU branch.The target branch suppresses suspicious samples and highlights valid inputs through confidence measurement and weighted loss function.The VA branch optimizes the parameter-sharing network with continuous annotations and considers the category imbalance.The AU branch corrects extremely uncertain labels with the data-driven AU graph and semantic memory templates.All three branches can be flexibly combined with various network architectures.In this experiment, we design five different settings for effectiveness verification.Note that the class-oriented weights are assigned in the target branch on RAF-DB due to the lack of continuous labels.When no branch is active, it is equivalent to a standard ResNet-18.
As shown in Tab.I, the independent target branch significantly enhances the FER performance on three datasets, and a greater improvement can be achieved by further using two auxiliary branches, respectively.In particular, the VA branch performs slightly better than the AU branch on Af-fectNet and AffWild2 because of the additional knowledge from the continuous label space and the manipulation for the huge category imbalance.The best performance comes from the complete MTAC framework with all three branches that consider the uncertainty from both subjective annotation and intrinsic confusion.2) Evaluation of the class-oriented weight: Most largescale facial expression datasets have severe category imbalances.In MTAC, the proposed class-oriented weight γ is compatible with various loss functions.In this experiment, we design three different settings, i.e., MTAC without γ, γ in the target branch (as our preliminary work in ICPR 2022 [16]), and γ in the auxiliary VA branch.As presented in Tab.II, the category balancing significantly contributes to the model training.It shows a better performance in the VA branch, which demonstrates our statement in Sec.III-C.3) Evaluation of the data-driven graph: Relabeling under semantic similarity constraints is another essential module of MTAC for uncertainty mitigation.Its semantic information of AU co-occurring dependencies is encoded with a datadriven AU graph.To study the established edges, we randomly initialize A with element values from 0 to 1 to shield edge attributes in this experiment.We also design a fully-connected A that every element is fixed as 1.As shown in Tab.III, the random edges introduce additional uncertainty and lead to performance decreases, while the fixed edges can not reflect the AU co-occurrence and approximate the actual distribution.On the contrary, our data-driven AU graph helps the GCN to generate better semantic representations and further boosts the memory templates and the relabeling.

D. Evaluation of Handling Uncertainty
To test our MTAC in handling uncertain samples, we set up comparative experiments under synthetic uncertainty and real uncertainty, respectively.Two baseline methods, i.e., ResNet-18 and DenseNet, and two state-of-the-art methods with uncertainty consideration, i.e., SCN [32], and DMUE [11], are selected for comparison.
1) Synthetic uncertain samples: We randomized 10%, 20%, and 30% of the original labels of the training set for RAF-DB and AffectNet, respectively.From Tab.IV, the proposed MTAC outperforms baselines on two datasets, illustrating that uncertain samples hamper network training.Moreover, as the proportion of uncertainty increases, the performance degradation of MTAC compared to the corresponding baselines is also smaller, which further proves the effectiveness of our featurelevel semantic similarity constraint.The DMUE achieves the best results on AffectNet by multi-branch distribution learning when facing 10% and 20% uncertainty.Benefiting from multitask correction, our MTAC obtains competitive performance in the above two settings and performs the best in the experiment with 30% uncertainty.
2) Real uncertain samples: Apart from synthetic uncertainty, we also select AffectNet Auto as a training set with naturally uncertain samples of wrong annotations and confusing emotions for cross-dataset validation, which is rarely considered by previous studies.The automatic labeling algorithm published in the official document has an accuracy of 65% [7].As shown in Tab.V, MTAC achieves the best when encountering real uncertain samples, and the performance growth exceeds that in the synthetic uncertainty experiment.One possible explanation is that we additionally accounted for the intrinsic confusions, which are more general uncertainty in real data.In the proposed MTAC, the class-oriented weight in the auxiliary VA branch mitigates ambiguities from imbalanced categories, and the semantic memory template with updated category centers in the auxiliary AU branch conducts effective label correction.

E. Visualization
To present the specific manipulation effect of MTAC on uncertain samples, we visualize the intermediate results in terms of passive uncertainty suppression with uncertainty measurement and active uncertainty correction with relabeling.
1) Uncertainty measurement: Fig. 8 depicts the visualization of the uncertainty measurement in the target branch on examples in RAF-DB, AffectNet, and AffWild2.Generally, the proposed MTAC successfully figures out the uncertain samples based on the confidence score and adaptively updates the value after the relabeling execution.In particular, in the second case of AffectNet, our MTAC accurately identifies the synthetic uncertainty and performs a correction for the original annotation.2) Relabeling: To exhibit the semantic similarity constrained relabeling workflow, we illustrate the prediction distribution in the target branch and the semantic distance in the auxiliary AU branch on examples in RAF-DB, AffectNet, and AffWild2.In addition, subjective annotations from twelve volunteers are counted to make a comparison with our relabeling strategy.As shown in Fig. 9, the generated memory template of semantic representation centers can increase the interclass distance.The predicted emotion categories are similar in distribution to manual annotations.It reveals that our MTAC can effectively handle uncertain samples to facilitate the model training and improve the final FER performance.

F. Comparison with the State-of-the-art
Since the proposed MTAC is designed for FER on largescale datasets and utilizes multiple labels, we compare it with the state-of-the-art methods for single-task and multi-task performance evaluation.
1) Evaluation of single FER task: Tab.VI shows the performance comparison and Fig. 10 presents the confusion matrices of MTAC in the single FER task.To summarize, our method performs the best and the top-2 results on RAF-DB and AffectNet, respectively.Although LDL-ALSG [46], SEIIL [47], and Face2Exp [48] introduce extra knowledge to support the network training, they only consider the label-level distribution and cannot repair the uncertain samples, leading to performance degrade.In addition, IPA2LT [46], SCN [32], WSND [33], and FENN explicitly deal with uncertain labels and thus achieve good results.However, intrinsic uncertainty can still limit their feature learning without information in the side space.Benefiting from the uncertainty measurement, the data-driven AU graph, and the feature-level constrained relabeling, our MTAC outperforms NMA [20] and achieves competitive results against DMUE [11] that apply uncertainty mitigation and auxiliary task simultaneously, which reveal the effectiveness of the proposed modules in this work.Note that 2) Multi-task evaluation: To demonstrate the capability of MTAC in executing multi-task prediction, we introduce two advanced approaches re-implemented by ourselves for further evaluation, i.e., Emotion-GCN [21] and EmoFAN [22].As shown in Tab.VII, our MTAC performs the best in the discrete emotion classification and obtains competitive CCC scores in the continuous emotion regression on both two benchmarks, which are more robust than another two multi-task methods.One possible reason is that the uncertainty correction of discrete labels optimizes model parameter updates for more      discriminative facial features and finally improves generalization performance on the VA estimation task.

Fig. 1 .
Fig. 1.Examples of valid and uncertain images in RAF-DB, AffectNet, and AffWild2 datasets.Top texts indicate their original labels.Uncertainty commonly exists in different facial expression datasets.

Fig. 2 .
Fig. 2. Distribution of discrete labels and continuous labels in AffectNet's training set.Categories with few samples are more likely to be confused with other classes.Many samples are far away from the center of the original category but close to other clusters, indicating a great deal of uncertainty.

Fig. 3 .
Fig. 3.The framework of MTAC.It consists of a target branch for FER, one auxiliary branch for VA estimation, and one auxiliary branch for AU detection.The VA branch supports the feature learning of the target branch by using continuous emotion labels and considering category imbalance, while the AU branch relabels extremely uncertain samples based on the semantic similarity constraint.Samples are re-annotated if they appear semantically closer to a another category center than the original one in feature space.Both auxiliary branches are free to work or disabled and will not participate in the testing process.

Fig. 4 .
Fig.4.The pipeline of the target branch.Given a training batch, the confidence score of every sample is calculated by applying a self-attention block and is then used to suppress the uncertainty in the loss function.These confidence scores are further passed to the auxiliary AU branch for semantic representation learning and memory template establishment.

Fig. 5 .
Fig.5.The pipeline of the auxiliary VA branch.Continuous emotion labels are utilized to jointly train the parameter-sharing backbone network for better facial feature learning.In addition, the class-oriented weight is computed to mitigate the category imbalance problem of discrete labels.

Fig. 6 .
Fig. 6.The pipeline of the auxiliary AU branch.The underlying relationship among AUs is encoded by a data-driven graph from datasets and exploited to generate the semantic representation of each sample.

Fig. 7 .
Fig.7.The pipeline of the relabeling strategy.A memory template is built and updated based on the average category centers and then constrained the relabeling strategy with similarity distance.The new label then participates in network optimization in the target branch.

FearFig. 8 .
Fig. 8. Visualization of joint feature learning.Two examples of each dataset in RAF-DB, AffectNet, and AffWild2 are shown.The top left block denotes the confidence score α, and the bottom block presents the label of the current sample.From left to right of every three columns are the original sample, synthetic sample, and corrected sample, respectively.

Fig. 9 .
Fig. 9. Visualization of relabeling.Two examples of each dataset in RAF-DB, AffectNet, and AffWild2 are shown.The light color block at the top denotes the synthetic uncertain label, the dark color block at the bottom left indicates the new label after relabeling, and the dark block at the bottom right presents the confidence score α after correction.From left to right of every four columns are the original sample, prediction distribution, semantic similarity distance, and voluntary annotation statistic, respectively.DI, HA, SA, and SU are disgust, happiness, sadness, and surprise, respectively.
ry d is g u st fe a r h a p p y n e u tr a l sa d su rp ry d is g u st fe a r h a p p y n e u tr a l sa d su rp
this paper, we proposed the MTAC framework to alleviate the uncertainty in facial expression images.The target FER branch measured uncertainty to calculate the confidence score and strengthen valid samples during model training.The auxiliary VA branch executed category balancing and joint feature learning with the support of continuous emotion labels.The auxiliary AU branch constructed the data-driven AU graph to generate semantic representations.The relabeling strategy corrected extremely uncertain samples under the feature-level similarity constraint based on the updated memory templates.Our MTAC has a modular design that allows adding and removing branches on the basis what is needed during training and inference.Extensive experiments on three large-scale datasets showed that MTAC was found robust to uncertain samples, and achieved superior results in FER task.In the future, other auxiliary tasks such as landmark detection and face recognition can be considered, and MTAC can be extended to generate annotations for unlabeled data, pre-train universal encoders of facial expressions, and address uncertain problem in other data modality.

TABLE I EVALUATION
OF DIFFERENT BRANCHES.'Target B.(ranch)' APPLIES THE UNCERTAINTY MEASUREMENT, AUXILIARY VA B.(ranch) EXECUTES THE JOINT FEATURE LEARNING AND THE CATEGORY BALANCING, AND AUXILIARY AU B.(ranch) EXPLOITS THE DATA-DRIVEN AU GRAPH AND THE SEMANTIC SIMILARITY CONSTRAINED RELABELING.BOLD DENOTES THE BEST RESULT, AND ITALICS INDICATES THE SECOND BEST RESULT.

TABLE II EVALUATION
OF THE CLASS-ORIENTED WEIGHT.BOLD DENOTES THE BEST RESULT, AND ITALICS INDICATES THE SECOND BEST RESULT.

TABLE III EVALUATION
OF THE DATA-DRIVEN GRAPH.BOLD DENOTES THE BEST RESULT, AND ITALICS INDICATES THE SECOND BEST RESULT.

TABLE IV EVALUATION
OF ENCOUNTERING SYNTHETIC UNCERTAIN SAMPLES.BOLD DENOTES THE BEST RESULT, AND ITALICS INDICATES THE SECOND BEST RESULT.* MEANS PERFORMING 8-CATEGORY CLASSIFICATION.

TABLE V PERFORMANCE
OF MTAC ON DATASETS WITH REAL UNCERTAIN SAMPLES.BOLD DENOTES THE BEST RESULT, AND ITALICS INDICATES THE SECOND BEST RESULT.† MEANS RE-IMPLEMENTING RESULTS.

TABLE VI COMPARISONS
WITH THE STATE-OF-THE-ART METHODS.* MEANS PERFORMING 8-CATEGORY CLASSIFICATION.† INDICATES THE UNCERTAINTY HANDLING IS INTRODUCED.‡ DENOTES EXTRA KNOWLEDGE OF AUXILIARY TASKS IS CONSIDERED.BOLD DENOTES THE BEST RESULT, AND ITALICS INDICATES THE SECOND BEST RESULT.

TABLE VII MULTI
-TASK PERFORMANCE COMPARISON.* DENOTES USING CCC METRIC.† MEANS RE-IMPLEMENTING RESULTS.BOLD DENOTES THE BEST RESULT, AND ITALICS INDICATES THE SECOND BEST RESULT.