RASN: Using Attention and Sharing Affinity Features to Address Sample Imbalance in Facial Expression Recognition

The sample imbalance of expression datasets always leads to poor recognition results for minority classes. To solve this problem, we propose a facial expression recognition network, called Residual Attentive Sharing Network (RASN). There is a fact that different expressions have affinity features, which makes it possible for the minority classes to benefit from the majority classes in the expression feature extraction process, from which we propose a sharing affinity features module to compensate for the inadequate feature learning of minority classes by sharing affinity features. In addition, an affinity features attention module is added to highlight expression-related affinity features and suppress expression-unrelated ones for enhancing the role of sharing affinity features. Experiments on the CK+, RAF-DB, and FER2013 datasets validate the robustness of our method to sample imbalance. The validation accuracies of our method are 96.97% on CK+, 71.44% on FER2013, and 90.91% on RAF-DB, respectively, which exceed current state-of-the-art methods.

The associate editor coordinating the review of this manuscript and approving it for publication was Haiyong Zheng . network layers increases, deep neural networks [5], [6], [7] 29 have more representation ability and can express semantic 30 information more accurately. The deep learning methods 31 have the advantage of higher recognition accuracies over 32 conventional methods. 33 However, it is difficult for the deep learning meth-34 ods to achieve the same enhancement effect on datasets, 35 such as RAF-DB [8] and FER2013 [9]. This is because 36 humans express expressions at different frequencies in real 37 scenes, resulting in different difficulties in collecting different 38 expressions. As shown in Figure 1, the distribution of the 39 number of expressions of each category on RAF-DB and 40 FER2013 dataset is extremely unbalanced, what is called 41 sample imbalance. The phenomenon will lead to insuffi-42 cient feature learning for the minority classes and reduce 43 the recognition accuracy. Recent studies have proposed 44 different methods to address the sample imbalance issue 45 in FER. Data augmentation is the most common method. 46 Xie et al. [10] propose a TDGAN, which generates 47 minority class samples for data augmentation by using Gen-48 erative Adversarial Networks to reduce the impact of sam-49 ple imbalance on the classification effect. Contrary to it, 50 Shi et al. [11] propose the concept of affinity features for 51 facial expressions based on the fact that FER is a classifica-52 tion task for faces, and reduce the effect of sample imbal-53 ance on the model by sharing affinity features. However, 54 Shi et al. [11] still suffer from sample imbalance, which we 55 believe is mainly due to the fact that the sharing affinity

91
(2) An affinity features attention module is proposed to 92 learn the importance weights of affinity features. So as to 93 further enhance the role of sharing affinity features.

94
(3) The robustness of our RASN [12] 116 realize a frame attention network with ResNet18 as the 117 backbone network to make the model pay closer atten-118 tion to important frames and aim at improving the abil-119 ity of the model recognition. Wang et al. [13] demonstrate 120 Region Attention Network which is robust to posture and 121 illumination, and Wang et al. [14] propose Self-Cure Net-122 work to make the model robust to uncertainty samples. 123 Bargal et al. [15] extract features from VGG13, VGG16, 124 and ResNet18. and then integrate all of them for classifi-125 cation. Zhang et al. [16] propose MSCNN network, which 126 do expression recognition with cross-entropy loss to learn 127 features with large between-expression varition and do face 128 recognition by contrastive loss to reduce the varition in within 129 expressions features. The attention mechanism originates from the study of human 132 visual attention. Humans usually focus mainly on the impor-133 tant area instead of the whole to identify the category of the 134 object. Recently, many studies have shown that integrating 135 attention mechanism into the network to learn the correlation 136 between the features can improve the effectiveness of the 137 network, such as the Squeeze-and-Excitation(SE) module 138 proposed by Hu [17], the SE module enables the model to 139 selectively emphasize the features of important channels and 140 suppress the features of unimportant channels through global 141 information, thus improving the performance of the model. 142 Woo et al. [18] proposed a CBAM module, which is trained 143 by combining spatial attention and channel attention. First, 144 VOLUME 10, 2022   Shi et al. [11] propose an Amend Representation Module 191 (ARM) to enhance the model by Sharing Affinity block and 192 De-albino block, since FER can be regarded as a face-specific 193 classification task, there is a fact that different categories 194 of expressions have affinity features, and sharing affinity 195 features can improve the recognition accuracy of the model. 196 However, since the way to acquire and share affinity features 197 is used to assist De-albino block in ARM, its recognition 198 accuracy is only minimally improved when using the Shar-199 ing Affinity block alone. In order to make full use of the 200 affinity features among various expressions to address the 201 sample imbalance, we propose a new method for sharing 202 affinity features, which is different from ARM in the way of 203 obtaining and sharing affinity features. To better illustrate the 204 mechanism of sharing affinity features module to abate the 205 influence of sample imbalance, an illustrative graph of which 206 is shown in Figure 3.  degree of the affinity feature F a to the final feature F gs , the 244 higher the value, the greater the contribution degree. Here, 245 the first feature extraction layer1(Resnet layer1) of ResNet18 246 is used as an example. First, the original input features F g 247 extracted by Resnet layer1 are input to a Sharing-module to 248 obtain F a , which are multiplied by the hyperparameter λ, and 249 then add to the original input features F g element-wisely to 250 obtain the feature map F gs , and finally the feature map F gs , 251 which are input to Resnet layer2 to achieve sharing affin-252 ity features. The afterward 3 feature extraction layer, from 253 the second to the fourth layers uses the same way as the 254 above-mentioned Resnet layer1 to share the affinity features. 255 In this way, all of the 4 feature extract layers will take the 256 affinity features into consideration when extracting features, 257 which can enhance the model's representation learning ability 258 and the robustness of the sample imbalance, and thus raising 259 the recognition accuracy. By sharing affinity features, it is possible that minority classes 262 can learn features from majority classes, so as to increase the 263 robustness of the model to sample imbalance. However, there 264 is reason to believe that the importance of each feature of 265 affinity features is not the same. As with some expression-266 unrelated features, such as skin color, gender, and so on., there 267 is no doubt that it will adversely affect the effect of sharing 268 affinity features when the expression-unrelated features are 269 given the same weight as those expression-unrelated ones. 270 Therefore, an affinity features attention module is designed 271 to obtain the weights of each channel of affinity features, and 272 its network structure diagram is shown in Figure 4. Compute the prediction q(x i ) by F g and F as ; 7: Compute the cross-entropy loss by Eq.7: Compute the gradient of CNN by Eq.8: Compute the gradient of Attention-module by Eq.10: The gradient g c of the CNN is obtained by finding 319 out the partial derivative of θ c for loss, and the parame-320 ter θ c for the CNN is optimized using gradient descent as 321 follows, where θ n+1 c is the updated CNN parameters, θ n c is the 325 pre-updated CNN parameters, g c is the gradient of the CNN, 326 and α is the learning rate.

327
Finally, the gradient g a of the Attention-module is obtained 328 by taking the partial derivative of θ a for loss, and the param-329 eter θ c for the Attention-module is optimized using gradient 330 descent as follow, is the Attention-module parameter before the update, g a is 335 the gradient of the Attention-module, and α is the learning 336 rate.

338
In this section, we first introduce three public datasets. Then  The main reason is that the Attention-module is only used to 405 learn the weights of affinity features, and it is of no effect 406 without Sharing-module. Third, the Attention-module can 407 make the Sharing-module improve by 1.76% and 2.51% with-408 out pre-training and with pre-training, respectively. this is 409 accomplished by assigning high weight to expression-related 410 affinity features and low weight to expression-unrelated affin-411 ity features, as a result, the Attention-module enhances the 412 role of important affinity features and suppresses the role 413 of unimportant affinity features, and further enhances the 414 sharing affinity features effect.  Table 2, Table 3, and Table 4, respectively.

447
As shown in Table 2, Table 3, and Table 4, our RASN CBLoss also improve the precision, recall, and f1-score on 468 the three datasets, but it is still lower than RASN. In addition,    with a large number of low-quality images and noisy labels. 477 Therefore, our RASN can achieve greater improvement on 478 datasets with higher sample quality. To better analyze the effectiveness of RASN. We randomly 482 select different class samples from RAF-DB and then use 483 Grad-CAM [22] to make attention visualization of the main 484 areas of concern of RASN and baseline (traditional ResNet18 485 without Sharing-module and Attention-module). The exper-486 imental results are shown in Figure 6. From the figure, it it 487 to be noted that for the surprise class expressions, RASN 488 mainly focuses on the eyes and eyebrows. For the happy 489 expressions, RASN mainly focuses on its mouth and the 490 center of the face. For the anger expressions, RASN mainly 491 focuses on its mouth and eyebrow area. These phenomena 492 are consistent with reality and demonstrate that our RASN 493 does learn the key features for discriminating expressions. 494 We believe this may be because the Attention-module of 495 RASN effectively learns the importance of affinity features. 496

519
Although much effort has been invested in solving the sample 520 imbalance problem, our RASN still achieves better recogni-521 tion accuracy compared to other state-of-the-art methods and 522 baseline (traditional ResNet18 without Sharing-module and 523 Attention-module). Tables 5, 6, and 7 show the test accuracy 524 comparison on CK+, FER2013, and RAF-DB, respectively. 525 As can be seen from the table, for CK+, the second-highest 526 recognition accuracy of the test set is Pre-train CNN, which 527 uses transfer learning technique to overcome the shortage of 528 training samples. For FER2013, the second-highest recogni-529 tion accuracy of the test set is SAP [34], which is a sample 530 awareness-based expression recognition method, in which 531 a Bayesian classifier is used to select the most appropriate 532 classifier from a set of candidate classifiers for the current test 533 sample, and then the classifier is used to perform expression 534 recognition on the current sample. For RAF-DB, the test set 535 with the second-highest recognition accuracy is DACL [19], 536 which combines center loss and an attention mechanism to 537 selectively penalize features for enhanced discrimination.

538
Compared with the above methods, our proposed 539 RASN outperforms those state-of-the-art methods, achieving 540 96.97%, 71.44%, and 90.91% on CK+, FER2013, RAF-DB, 541 respectively. It has been proved that the RASN performs well 542 on the expression recognition task and has good general-543 ization ability. It is to be noted that our RASN has higher 544 accuracy on CK+ and lower accuracy on RAF-DB and 545 FER2013. This is mainly caused by that CK+ is collected 546 in the laboratory, while RAF-DB and FER2013 are Collected 547 from real scenes. The CK+ collected in the laboratory has 548 a good posture, no occlusion, and higher label reliability. 549 However, RAF-DB and FER2013 collected in the wild have 550 problems such as low sample quality and noisy labels. 551    Our RASN can also be used as a real-time expression 576 recognition method. To demonstrate the efficiency of our 577 RASN, we compare RASN with baseline methods (tra-578 ditional ResNet18 without Sharing-module and Attention-579 module) in terms of the number of parameter, the number of 580 float-point operation, and the frame rate. As shown in Table 8, 581 in terms of computing power, compared with Baseline, the 582 number of parameters of RASN has increased from 11.2M 583 to 14.4M, and the FLOPs of RASN have increased from 584    In the future work, we will continue to study how to build 616 a more effective RASN to improve the robustness to sample 617 imbalance. In addition, we will be testing our method in 618 different networks.