Toward Unbiased Facial Expression Recognition in the Wild via Cross-Dataset Adaptation

Despite various success in computer vision with facial images (e.g., face detection, recognition, and generation), facial expression recognition is still a challenging problem yet to be solved. This is because of simple but fundamental bottlenecks: 1) no global agreement on different facial expressions, 2) significant dataset biases that prevent cross-dataset analysis for a large-scale study, and 3) high class imbalance in in-the-wild datasets that causes inconsistency in predicting expressions in images using a machine learning algorithm. To tackle these issues, we propose a novel Deep Learning approach via adaptive cross-dataset scheme. We combine multiple in-the-wild datasets to secure sufficient training samples while minimizing dataset bias using ideas of reversal gradients to retain generality. For this, we introduce a flexible objective function that can control for skewed label distributions in the dataset. Incorporating these ideas, together with the ResNet pipeline as a backbone, we carried extensive experiments to validate our ideas using three independent in-the-wild facial expression datasets, which first confirmed bias from different datasets and yielded improved performance on facial expression recognition using the multi-site dataset.


I. INTRODUCTION
Face is a primary means to transfer information not only among humans but it is an effective tool for communication between humans and machines as well. In this regard, analyses of faces using images have been adopted for fundamental researches in various areas such as neuroscience [1], psychology [2], human-computer-interaction [3], etc. It is indisputable that computer vision methods are the driving forces of such researches; there is a rich history of works in vision with face detection [4], [5], face recognition [6], [7], 3D face construction [8], [9], facial image generation [10], [11] as well as substantial extensions in security such as personal identification [12], face spoofing and anti-spoofing [13], [14]. Recent works with Deep Learning (DL) demonstrate remarkable advances in these applications by combining a flexible neural network that can train a generalized The associate editor coordinating the review of this manuscript and approving it for publication was Paolo Remagnino . model and large-scale training samples that recently became available, e.g., [15], [16]. The prediction performance of these algorithms is up to human-level precision [16] and these methods are deployed on many commercial devices with cameras.
The visual information from a face is mainly delivered by facial expressions. However, despite a rich body of successful works with facial images in machine learning and computer vision, facial expression recognition still remains a challenging task that is yet to be solved. First of all, we face a lack of data with reliable annotations to train a data-driven machine learning model. This problem stems from two issues: 1) personal facial expressions are driven from subjective emotions that are difficult to assign objective labels [17], and 2) there exist substantial ambiguity between emotion classes [18]. That is, there is a severe diversity in expression caused by personal and cultural differences that lead to large intra-class variation and small inter-class variation. For example, the same emotional expression shown in an image can be categorized differently in a different dataset and vise versa for different emotions (See Fig. 1). Also, a mixture of emotions, which often arise in various real facial expressions, makes the expression recognition problem even more challenging. Such issues are clearly distinguished from recent popular datasets such as ImageNet [19] and MS-Celeb-1M [20], whose annotation and identity information are creditable.
Notably, there have been several tries to tackle the facial expression recognition problem in a data-driven way. However, individual dataset acquired at different sites often includes substantial dataset bias caused by local data collection protocol and the aforementioned diversity in expressions. These biases include selection bias caused at the image searching stage using limited key words or the participants recruiting stage, capturing bias caused by the data collection environment, and annotation bias caused at the image labeling stage [21]. In the end, regardless of classification algorithms, these (potentially large) biases make it difficult for the algorithms to agree on the same expression coming from different images. Datasets collected in the past two decades are mostly in-the-lab setting where their participants acted to make specific expressions or to induce to make spontaneous expressions based on the psychological ground in restricted lab environments, which eventually leads to exaggerated expressions [22]. Some of the recent datasets broke away from restrictions from the labs and collected a large number of online facial expression images based on a set of keywords. These crawled images were then annotated to constitute in-the-wild datasets [22]- [24].
One approach to reducing the annotation bias is to have a diverse group of annotators to soft-label the data with probability [25], however, it requires substantial resources and efforts. The simplest way would be to perform cross-dataset generalization: to combine data from multi-sites to secure a sufficient number of samples [26], [27]. This is conceptually easy but reducing annotation bias across datasets remains a challenging task. There have been several tries [28]- [31], but these works often lack explanation on dataset invariance, suffer from complex architecture with label combination, overfit to a single dataset, and cannot control for selection and annotation biases. Moreover, even if we have enough data, many facial expression datasets have class imbalance problem that affects downstream classification, which is common in vision datasets [19]. The class imbalance occurs more frequently in the in-the-wild datasets since the frequencies of each expression depends on the emotion type. For example, facial expressions are typically categorized by 6 basic emotions (i.e., Happiness, Surprise, Disgust, Sadness, Fear and Anger) and 1 neutrality, and some of these emotions are expressed more often than others. Such natural phenomena affect the skewed distribution of data labels in the wild; ∼70% of class labels in many in-the-wild datasets belong to either Happiness or Neutral.
In this regard, we tackle the issues addressed above by proposing an adaptive cross-dataset framework let us combine multiple datasets to constitute a larger-scale multi-site dataset with minimal dataset bias. The framework is a variant of domain adaptation methods; unlike other tries to reduce differences between source and target domains [32], the key idea is to rather minimize bias between datasets to boost facial expression recognition performance with a global multi-site dataset. We deal with two of the biases introduced above: 1) to reduce selection bias, we train our model such that it minimizes differences in distributions of image features extracted from different datasets, and 2) to reduce annotation bias, we add a label extractor in our pipeline that generates pseudo emotion labels used as image feature/label pairs to train our VOLUME 8, 2020 model such that it does not distinguish different datasets. Finally, in order to tackle the class imbalance issue that typically arises in many in-the-wild datasets, we introduce an adaptive cross entropy as an objective function that assigns different weights to class-wise cross entropy according to a statistical property of each class and training precision during optimization.
Our work throughout this article suggests the following contributions: 1) we successfully combine multiple in-the-wild facial expression datasets to provide a sufficient number of samples for training an machine learning (ML) model by minimizing biases between datasets, 2) we introduce an adaptive cross entropy scheme to work around the skewed class distribution of facial expressions in the dataset, 3) we demonstrate extensive empirical experimental results identifying dataset bias and validating the performance of our framework on facial expression recognition using the aggregated data from multiple in-the-wild datasets. The results show that our framework is able to accurately classify various facial expressions and performs better than other state-of-theart baseline methods.

II. PROPOSED METHOD
We tackle the two major issues in facial expression recognition with multi-site datasets with in-the-wild images: 1) dataset bias, and 2) class imbalance problems. We propose a cross-dataset adaptation (CDA) scheme to address selection and annotation biases from different datasets using separate feature extractor and pseudo-label extractor, and adaptively control classification error from cross-entropy to mitigate the class imbalance problem. The details are given below.

A. CROSS-DATASET ADAPTATION
Combining multiple datasets from multi-sites is a common approach to secure a sufficient number of meaningful samples. However, the bias introduced from different datasets (i.e., cross-dataset bias) must be minimized to improve the generalization capability of an ML algorithm trained by the multi-site dataset. For facial expression datasets, one cause of such bias may be regional/cultural specific annotation resulting in inconsistent labels between different datasets. This means simply gathering and training with facial images and their labels can lead to harmful effects on generalization, especially when the datasets have very different characteristics or include conflicting annotations on the same expression.
In this scenario, our idea to control for the cross-dataset bias is to construct features robust to dataset type and utilize a pseudo label for emotion to reversely train on dataset label. We design two components: 1) a feature extractor, which learns image features using a Residual Network (ResNet) that reduces specificity between different datasets while increasing discriminant ability for emotion classes, and 2) an emotion label extractor using a Convolution Neural Network (CNN) that is used to reduce annotation inconsistency between datasets. We utilize a dataset classifier with a Gradient Reversal Layer (GRL) [32], which can be viewed as a control tower to minimize biases between datasets. The GRL reversely trains both the feature extractor and the label extractor not to distinguish different datasets while training the emotion classifier to accurately predict emotions based on the trained features. The overall architecture of our proposed method with CDA is shown in Figure 2. There are multiple emotion classifiers (three classifiers for our experiment), i.e., e = (e 1 , e 2 , e 3 ), that are assigned to individual dataset within the full multi-site dataset. The details of our approach are described below.
Given input images x and their labels y, we constitute a feature extractor f (x; θ f ) with parameters θ f and a label extractor g(y; θ g ) with parameters θ g . Then, a CDA component is defined as CDA(x, y; θ f , θ g , θ e , θ d ) that combines an emotion classifier e(f (x); θ e ) and a dataset classifier where θ e and θ d are trainable parameters in the emotion classifier and dataset classifier respectively. The loss function L e for emotion classifier is defined as where, L is an error between an output from the emotion classifier e(f (x)) and a pseudo label from the label extractor g(y), and L 2 is a 2 -regularizer for the weights of the label extractor θ g to make the pseudo label stable. The loss function L d for dataset classifier is formulated with dataset label z and parameters θ f , θ g , θ d as that takes (f (x), g(y)) at the same time to train the dataset classifier d. Combining these two losses L e and L d , our multi-task loss E for CDA becomes E(x, y, z; θ f , θ g , θ e , θ d ) = L e (x, y; θ f , θ e , θ g ) (3) and (3) is further reformulated with the GRL at the backpropagation stage as: where λ 1 and λ 2 are user parameters to balance effects of the feature extractor and the label extractor respectively from the dataset classifier. Minimizing the (4) jointly trains a DL model such that it correctly classifies different facial expressions in the images, and learns features from images and pseudo labels where intra-dataset variation are maximized simultaneously. The training is performed via backpropagation with partial derivatives given below: The overall architecture of our framework. Features derived from facial images x and altered labels (psuedo labels g(y )) from ground truths y are inputted to separate emotion classifiers e 1 , e 2 , e 3 and dataset classifier d to minimize emotion classification error and dataset bias based on the proposed Adaptive Cross Entropy (ACE) Loss.
where, µ is a learning rate of the overall DL structure. Notice that θ f is jointly trained by both L e and L d to satisfy both conditions for emotion classifier and dataset classifier. The θ g is trained by L 2 and L d ; specifically, the update with respect to L d is achieved by implementing the GRL. The gradients of L d are applied to θ f and θ g in the opposite direction, so that it degrades precision of dataset classifier and consider all datasets as a single dataset.

B. ADAPTIVE CROSS ENTROPY
Collecting a sufficient amount of samples for each emotion category is difficult by the nature of domains where the data are collected. Many datasets from research labs, i.e., in-the-lab datasets, collect posed facial expression images in a constrained laboratory environment (i.e., consistent pose, angle, illumination, etc.). These datasets usually have equally distributed samples for facial expression categories from ideal environments. Although the performance of ML models with in-the-lab datasets may be reasonable, these datasets cannot be generalized to the real-world environment since the data were acquired in restricted settings. This is because the models trained based on in-the-lab datasets cannot be adequately extended to real-life conditions, where the majority of pictures are taken in the wild. This can be easily verified by testing a pre-trained model (using in-the-lab data) with in-the-wild datasets, which causes a significant performance drop in a real-world scenario for facial expression classification [33].
We, therefore, must turn to in-the-wild datasets that provide better generality to a trainable model. Unfortunately, in a typical in-the-wild dataset, there exists a significant class imbalance problem. That is, the majority of images in the dataset are categorized as Happiness and Neutrality classes, whereas Fear, Disgust, Surprise, Sadness, and Anger classes have much fewer samples due to the inherent properties of emotional status. A simple solution to such a class imbalance problem is to balance the number of samples in all classes based on the sample size of a minority class. However, this may lead to a considerable reduction in overall data volume, which can significantly drop classification performance, especially with DL algorithms.
In such a scenario, the Adaptive Cross Entropy (ACE) proposed in this article is designed to reflect the characteristics of a dataset by constructing a loss function based on the distribution and precision of prediction for each class. Specifically, we first define a conventional Categorical Cross Entropy(CCE) for training a DL model with total of n samples with c classes in a multi-class classification problem as: where x i denotes the i-th training sample, y i denotes a corresponding label, and θ denotes parameters of the DL model. (8) is a return from a soft-max function from feed-forward result of the network and the δ j function creates VOLUME 8, 2020 one-hot vector encoding for multi-class classification, which can be defined as follows: The CCE Loss function in (8) produces a loss value by adding all the values of the log operation to the prediction probability of each class. The prediction probability of each class is determined by the δ j (y i ) function, which is determined by the label y i . That is, since the log-loss is accumulated in proportion to the frequency of the labels during the training process, it may negatively affect the result of the backpropagation algorithm in the imbalanced dataset. To compensate for such shortcomings of conventional categorical cross entropy, we propose an ACE loss as follows: which contain two additional weight terms, i.e., w j p and w j d , in addition to (8). We make these weights adaptively behave according to the precision and skewed class distributions in the emotion prediction. The weights are defined as where n j is the number of samples of the j-th class, and h j denotes the number of true positive (hit) samples of the j-th class in a training process. To improve the accuracy of minority classes caused by class imbalance, we adopt a ''precision compensation'' weight term w j p in (13), which is adversely proportional to precision for each class. Differently expressed but the intuition is similar to that of Focal Loss [34]: if the precision with respect to a minor class is low, the weight should become larger to minimize prediction error and balance the accuracy of the minor class. Conversely, when the precision with respect to a major class is high, the weight is decreased to avoid over-fitting towards a major class.
What differentiates our idea on ACE from the convention is the distribution compensation weight term in (11), i.e., w j d , which alleviates the asymmetry in the loss caused by the data label imbalance. It increases the loss value when the number of samples in the j-th class n j is small, and vice versa when n j is relatively larger than other classes. This is an important aspect for our CDA framework: since the numbers of samples in each dataset are significantly different, we address this issue by balancing the numbers with w j d . Moreover, because the number of samples for each emotion class is also skewed, we applied the same scheme to both and dataset classifier and emotion classifiers. The function f (w j p w j d , α) adjusts how much to concentrate on imbalanced data between CCE and ACE by α. (it becomes CCE when α is 0). We expect that these adaptive parameters will improve the performance of overall accuracy by balancing the precisions of each class.
In practice, when training a DL algorithm, a loss function typically is calculated for each mini-batch. If we define the two weight terms of the ACE loss function for each mini-batch, n is defined as the size of each mini-batch. Then, the weight value of the ACE loss function for each mini-batch is determined based on the statistical property of label distribution and class accuracy, which are adaptively changed for each mini-batch.
Combining the above two ideas, the final proposed method uses the ACE as the loss of the emotion classifier in the CDA structure. That is, L in the equations in (1) and (2) are replaced with (10) to complete our model.

III. EXPERIMENTAL RESULTS
Our experiments were performed based on three different in-the-wild datasets with facial expression images, which are the most popularly used in facial expression recognition. The description on the datasets as well as the experimental setting, results, and validations are given below.

A. DATASETS
We used various in-the-wild datasets with 7 different emotion classes. The 7 classes consist of 6 basic emotions, i.e., Happiness, Sadness, Disgust, Anger, Fear and Surprise, and Neutral. The three datasets that we used consists of facial expression images that were collected from the web by searching with emotion-related keywords.
RAF Dataset: The Real-world Affective Faces (RAF) dataset contains image data from Flicker (https://www.flickr. com) by parsing emotion keywords that are related to the 7 emotion classes. There was a total of 315 annotators who labeled the images based on their knowledge of psychology. Specifically, each image and its label were validated by 40 independent annotators to increase the credibility of the dataset. The dataset consists of 15, 339 annotated images where 12, 271 belongs to the training set and the remaining 3, 068 images are assigned to the testing set [22].
ExpW Dataset: Facial Expression in-the-Wild (ExpW) is a dataset collected by a search engine using emotion-related keywords. It consists of 88, 600 facial expression images and does not offer a train set and a test set separately [24]. Thus, we randomly separated 3, 500 images as a test set and remaining 85, 100 images as a train set.
AffectNet Dataset: AffectNet is a facial expression dataset collected by searching 1,250 emotion-related keywords from search engines such as Google, Bing, and Yahoo. Each image was assigned to one annotator and labeled. There are a total of 287, 401 images related to 7 classes, including a train set of 283, 901 images and a validation set of 3, 500 images [23]. A test set has not been made public yet, and thus we used this validation set as a test set in all our experiments. Figure 3 shows the facial expression class distributions in training and testing sets from the three datasets. In the case  of RAF and ExpW, we see that class imbalance is reflected in both the training set and the testing set. However, in the case of AffectNet, while the class label distributions in the training set are unbalanced, the class labels in the testing set are distributed evenly. We will focus on the test set of AffectNet since we want a classifier that is not biased towards a few specific expressions.

B. EXPERIMENTAL DESIGN
We used 50-layer ResNet [40] as our backbone network in all experiments. Throughout our experiment, the batch-size was 2048, the number of train-epochs was set to 120, and the stochastic gradient descent optimizer was used as an optimization method with a GPU, NVIDIA's TITAN RTX. In order to evaluate dataset bias only and remove other nuisance factors, we did not use the aligned images provided from each dataset; we only used face region information provided from each dataset and cropped facial images from the original images. Then, all images were resized to the same size of 100 × 100 which were used as the inputs to our framework. For data augmentation, random-cropping and horizontal-flipping were applied to all images so that the quality of all images become randomized. Only for the experiment in section III-C, we used the batch-size to 350.

C. CLASSIFYING FACIAL EXPRESSIONS IN IMAGES
We first present the main result from a facial expression recognition experiment comparing performance of our result to those from state-of-the-art baselines. Table 1 shows the summary of the performances. We combined the three datasets (i.e., RAF, AffectNet and ExpW) to create a large-scale multi-site dataset and trained our model. In this experiment, we used ExpW dataset as a part of training dataset but excluded for evaluation and comparisons with other methods, because it does not provide a public testing set. As seen in Table 1, our framework showed the best performance on the test set from AffectNet, which returned 2∼9% improvement on the recognition accuracy. Considering that the distribution of emotion labels are balanced in the test set of AffectNet dataset but all the models being compared here are trained with skewed distribution of labels in the training set, the result shows that our model is able to properly learn and generalize even with substantial class imbalance. Moreover, notice that the accuracy with our model is even better than the stat-of-the-art methods (with pre-training) trained on AffectNet only. This is important since training a classical DL model with the multi-site data decrease the performance of the model, which is to be shown with a separate experiment ( Table 2 in section III-E). Here, we still achieve good accuracy with the combined multi-site dataset by minimizing dataset biases.
We obtained comparable accuracy to the state-of-the-art result on RAF test data [39] with only 0.17% difference. We think that the difference is coming from the conventional face alignment used in other state-of-the-art baselines, which is typically used to reduce large unwanted variations on in-the-wild datasets. We did not apply the face alignment in the preprocessing since it may behave as a separate covariate affecting the original biases in each dataset.

D. IDENTIFYING BIAS IN MULTI-SITE FACIAL EXPRESSION DATASET
In order to confirm that there exists clear bias between different datasets, we first performed 1) dataset classification and 2) cross-dataset recognition experiments to quantify the bias introduced from the three facial expression datasets. In the dataset classification experiments, we exclusively used 381, 272 images as a training set and 10, 068 images as a test set by combining the three datasets, i.e., RAF, ExpW, and AffectNet.
Since these datasets were collected from natural scenes (i.e., from the wild), we first hypothesized an ideal condition that there wouldn't be any dataset bias between these datasets which would yield randomly classified result (with 33% accuracy). We performed the dataset classification TABLE 3. Architecture of label extractor, emotion classifier, and domain classifier. Note that the input dimension for the dataset classifier is 1, 007, which is the dimension of a concatenated vector of a image feature (1, 000) and a label (7). with ResNet-50, which is a part of our framework as a feature extractor. Unexpectedly, we obtained 91.74% dataset classification accuracy despite the dataset classification was performed on in-the-wild datasets. Even though the differences caused by image quality and alignment from each dataset were minimized, the unfortunate high classification performance was achieved clearly demonstrating the existence of dataset selection bias in each in-the-wild dataset. Especially, dataset bias of RAF dataset was the largest among the three datasets, followed by AffectNet and the ExpW as shown in a resultant confusion matrix in Figure 4. In case of RAF dataset, there were 40 annotators validating the images and expressions; perhaps there was a policy for determining expression labels that the annotators were made to fit which made the dataset bias stronger with repeated validation process.
In addition, we performed experiments on cross-dataset recognition (i.e., training/testing on different datasets) to validate generality of each dataset and baseline performance of the ResNet-50 algorithm trained on each dataset. Table 2 shows the results of facial expression classification with cross-dataset setting on the three in-the-wild datasets.
We observed a large performance drop when training and testing were performed on different datasets as shown in the off-diagonals of Table 2. Conversely, when training and testing processes were done on the same dataset, as shown in the diagonal elements of Table 2, we obtained much higher classification performances.
In particular, training with RAF dataset showed steepest drop when tested with test sets from other datasets with the lowest generalization capability compared to other datasets, whereas AffectNet dataset had the least performance drop. This was expected considering the size of training samples in descending order of AffectNet, ExpW, and RAF. Regarding the mean of classification accuracies as a measure classification difficulty for each dataset, classification with the test sets became more difficult in the order of AffectNet, ExpW, and RAF. Interestingly, we obtained the lowest classification performance with AffectNet; this may be because the distribution of images for each class in the training set and the test set are significantly different. In other words, since the class distributions in the train sets from the three datasets are skewed and the class distribution of the test set from AffectNet is balanced, this imbalanced distribution between a train set and the test set may negatively affect overall performances on test experiments with AffectNet dataset.
Throughout these two experiments, we confirmed that there exist clear biases among these three in-the-wild datasets, as well as other potential challenges with skewed class distributions. These are critical challenges for utilizing cross-datasets (i.e., multi-site dataset) scheme to setup a large-scale facial expression classification experiment to improve performance of DL algorithms; simply merging different datasets may hurt the performance of the DL framework. Our framework precisely tackles these issues to achieve successful results.

E. ANALYSIS ON RECOGNITION PERFORMANCE WITH MULTI-SITE DATASET
The purpose of this cross-dataset experiment is to validate improved generalization ability of our framework by running it over heterogeneous images from different datasets. We expect the performance to improve as we utilize more number of training samples by combining multiple datasets (i.e., cross-dataset) with the ideas we proposed in section II. Despite a larger dataset size, merging different datasets requires careful control of dataset bias that come from different dataset properties. In the following, we show experimental results with a multi-site dataset combining RAF, ExpW, and AffectNet datasets. For these experiments, we used the CDA architecture with a feature extractor using the ResNet-50 as a backbone, a label extractor, three emotion classifiers, and a dataset classifier as shown in Table 3. The λ 1 for the feature extractor and λ 2 for the label extractor were set to increase from 0 to 1 during the entire training epochs. The rest of the parameters were set as the same as in section III-B.
Different deep learning models were trained using all training samples from all three in-the-wild datasets. Three models are compared: 1) baseline method trained only on AffectNet, 2) baseline method trained on the three datasets, and 3) our proposed method that uses the CDA strucutre. As shown in Table 4, the proposed method achieved better performance than the others on all the test sets. Comparing the two baseline methods, although it showed  better average accuracy when trained on the entire three datasets than the case with training using AffectNet training set only, the individual performance on the AffectNet test set decreased. That is, despite the larger sample size by merging the three in-the-wild datasets, the selection and annotation bias of the datasets were well highlighted, resulting in poor performance rather than complementary effects on the AffectNet test set; we concluded that simply combining training sets from the multiple datasets does not produce meaningful results for generalization. On the other hand, using our proposed method, we gained increase in the accuracy for all the test sets despite the training sets were a mixture of different datasets.
In Figure 5, distribution of trained features from baseline with combined dataset (left) and the proposed method (right) are compared using Uniform Manifold Approximation and Projection (UMAP) [41] to show the effects from the proposed feature extractor. In the left of Figure 5, clear selection bias for each dataset is observed with clusters of distributions for each dataset in the trained features, while the features from our feature extractor show mixed distribution to make the three dataset as a single global dataset. Figure 6 expands the results with AffectNet in more detail with recall matrices obtained from the three methods. Looking at the recall matrix in the middle column, true positive rate per class and recall values on the diagonals of the recall matrix are slightly decreased compared to the baseline method in the first column (Mean Recall Scores: Baseline with AffectNet 0.59 ± 0.20, Baseline with the combined dataset 0.57 ± 0.20). The diagonals representing recall is higher as the green becomes darker. Notice that the green colors in the last columns (i.e., elements in the last column) in each recall matrix in Figure 6 for Neutral are degrading across baseline with the combined datasets and our proposed method. This means the proposed method improves generalization ability using the training samples from all three datasets, where dataset bias is decreased compared to other methods. The proposed method was effectively applied to the AffectNet test set, with higher and more balanced true positive rates than the baselines. The mean recall score of the proposed method was 0.61 ± 0.15.
Notice that the precision per class using the proposed framework shown at the bottom of each matrix is much more balanced than other methods. This is highly desirable considering that the AffectNet dataset has uniform distribution of labels in the testing set. In particular, we even get balanced number of predictions for neutral class. We want to emphasize this result, since we hypothesized that the dataset bias in neutral class may be the strongest among the seven classes of facial expression. This may be because, when a policy to annotate or select facial expression images as neutral class is ambiguous, it would result in relatively small inter-class variation compared to other classes [42]. Using our proposed method, the number of neutral predictions was reduced from 981 in the baseline with the three datasets to 733. This means that the range of prediction for neutral class is reduced compared to those of the baseline algorithms, where more strict criteria was applied using the proposed method for predicting the neutral class. Although different datasets have different opinions on neutral facial expression, predictions based on our proposed method are narrowed down to common characteristics of the three independent datasets.
In addition, the precision for neutral class has improved despite the decrease in the number of predictions for neutral class using our framework. This is because the number of false positives in terms of neutral class has decreased significantly compared to the baseline methods. Such result suggests that the proposed method appropriately drove our predictions towards reducing dataset biases from the different datasets for neutral class. Overall performance of the proposed method was improved as seen in the mean of precisions (Baseline with the three datasets: 0.61 ± 0.11, Proposed method with the three datasets: 0.64 ± 0.11).
We also performed visualization of feature vectors using the UMAP method to see how well the extracted feature vectors are representing the seven emotions. In Figure 7, the left panel shows the result from the baseline method where the seven classes for facial expression are not clearly distinguished with mixed distribution; we observe samples only in the happiness class (brown tags) form a cluster. However, in the right panel showing our result, we can see that the emotions are separately distributed at intervals to distinguish those clusters. Notice that the Neutral samples are clustered at the center of the distribution, which is desirable since different emotions manifest from the Neutral status.

IV. CONCLUSION
We designed a cross-dataset adaptation scheme for combining multiple datasets to retain sufficient sample size to train a DL model and minimize biases that exist across different datasets. Our model is designed to learn the ''general'' representation of data in specific classes that exist across multiple datasets. We applied our framework to a facial expression recognition problem using three independent in-the-wild datasets which had large dataset biases and class imbalance problem. We confirmed that this is a serious bias between datasets with dataset classification analyses and demonstrated extensive empirical results to evaluate the performance of generalization ability of our framework. We achieved improved performance over state-of-the-art algorithms with a balanced test set, i.e., AffectNet, when trained with multiple independent training sets with skewed class distribution as well as comparable results on RAF test dataset. There is a great potential that our method can be applied to various domains where combination of multi-site datasets is required to acquire enough data.