Discussions of Different Deep Transfer Learning Models for Emotion Recognitions

In recent years, facial emotion recognition (FER) has been a popular topic in affective computing. However, FER still faces many challenges in automatic recognition for several reasons, including quality control of sample data, extraction of effective features, creation of models, and multi-feature fusion, which have not been thoroughly researched and therefore are still hot topics in computer visualization. In view of the mature development of deep learning, deep learning methods are increasingly being used in FER. However, because deep learning requires a large amount of data to achieve effective training, many studies have employed transfer learning to compensate for this drawback. Nevertheless, there has been no universal approach for transfer learning in FER. Accordingly, this study used the five classic models in FER (i.e., ResNet-50, Xception, EfficientNet-B0, Inception, and DenseNet121) to conduct a series of experiments: data preprocessing, training type, and the applicability of multi-stage pretraining. According to the results, class wight was the optimal technique for data balance. In addition, the freeze + fine-tuning training type can produce higher accuracy, regardless of the size of the dataset. Multi-stage training was also effective. Compared with the model accuracy in previous studies, the accuracy achieved in this study using the proposed transfer learning method was superior for both large and small datasets. Specifically, on AffectNet, the accuracy for the ResNet-50, Xception, EfficientNet-B0, Inception, and DenseNet-121 models increased by 8.37%, 10.45%, 10.45%, 8.55%, and 5.47%, respectively. On FER2013, the accuracy for these models increased by 5.72%, 2%, 10.45%, 5%, and 9%, respectively. These results proved the validity and advantages of the experiments in this study.


21
Facial expression is essential to non-verbal communication 22 and is one of the most natural ways to convey our internal feel-23 ings in interpersonal interactions. In this study, we focused on 24 facial expression recognition (FER) based on computer vision 25 in the field of emotion recognition. Compared with the infor- 26 mation from other non-verbal expressions, the facial expres- 27 sion information obtained from image processing enables a 28 machine to operate in a way similar to how our brain pro- 29 The associate editor coordinating the review of this manuscript and approving it for publication was Wei Wei .
cesses and recognize information. The abundant information 30 it provides then offers important clues for the machine to infer 31 human intentions. Therefore, FER can be widely applied in 32 many fields and is indispensable in affective computing. 33 Most of the early studies of FER used datasets developed in 34 the laboratory, such as JAFFE [1] and CK+ [2]. However, the 35 facial expression data obtained from these datasets were too 36 similar; only positive expressions without any occlusion were 37 included, and therefore cannot be applied to complex scenar-38 ios in real life. To solve this problem, many FER studies have 39 established datasets in the unconstrained wild [3], [4], [5], [6]. 40 Therefore, this study selected AffectNet and FER2013 as the 41 training and evaluation with the databases of UTD-MHAD, 98 HDM 05, and NTU RGB+D 60 to achieve an accuracy of up 99 to 98%. Bhattacharya et al. presented the Ensem-HAR, a deep 100 learning ensemble model for HAR, which was designed using 101 four CNN and LSTM models to perform feature extraction, 102 integrating these features and inputting them into a random 103 forest for categorization [14]. When accessed for its accuracy 104 with different databases, this model achieved 98.70% with 105 WISDM, 97.45% with PAMAP2, and 95.05% with UCI-106 HAR. 107 The studies cited above indicate that, regardless of the 108 line of research, deep learning models can be integrated, 109 or trained with ensemble learning, to improve their accuracy. 110 This is why the present study conducted experiments using 111 five typical deep learning models. Indeed, when a transfer 112 learning method appropriate for training each of the models 113 to perform FER is determined, these models can have further 114 application. 115 This study examined the importance of transfer learning 116 to FER and the influence of two decisive factors (source 117 domain and training type) of transfer learning. Many stud-118 ies on transfer learning with respect to FER have improved 119 model accuracy by using multi-source transfer learning or 120 training certain models, but no researchers have discussed the 121 influence of different fine-tuning methods executed through 122 transfer learning or the generalizability of transfer learning 123 methods to different models. This study compared different 124 transfer learning methods performed on several models using 125 datasets of different sizes, and analyzed, from different per-126 spectives, transfer learning methods deemed most appropriate 127 for FER. 128 In this paper, Section 2 reviews relevant studies on transfer 129 learning, Section 3 describes the experiments conducted in 130 this study, and Section 4 discusses the research outcomes and 131 delineates the contribution of this study to FER. Conclusions 132 are made in Section 5.

134
Thanks to open source deep learning software, such as 135 Keras, Torch, and TensorFlow, which provides many famous 136 ImageNet-pretrained models (e.g., ResNet50, Xception, and 137 EfficientNet), the threshold for using transfer learning has 138 been lowered, leading to a considerable increase in the appli-139 cation of pretrained models to training in FER.   imbalanced. Therefore, this study also compared the effects 209 of data balancing and data augmentation on AffectNet and 210 FER2013. Data recorded in the real world usually consti-211 tute small datasets. Therefore, in this study, knowledge is 212 transferred from ImageNet firstly to AffectNet and then to 213 FER2013 to verify the effect of multi-stage transfer learning, 214 thereby identifying the transfer learning method most suitable 215 for FER.

286
• Xception [27] is a trendy CNN model. It combines 287 the concepts of GoogleNet and ResNet but the incep-288 tion module is replaced with a separable convolutional 289 layer. The separable convolutional layer assumes that 290 the spatial and cross-channel patterns can be separately 291 simulated. Therefore, its number of parameters, memory 292 usage, and computational complexity are lower than 293 those of the convolutional layer; it also performs better 294 than the convolutional layer.

295
• EfficientNet [28] is a new network model developed by 296 Google that has become popular due its features of being 297 light, fast, and accurate. The model design was inspired 298 by MnasNet. It not only used the mobile inverted bot-299 tleneck convolution (Macon) but also introduced the 300 attention mechanism of Squeeze-and-Excitation Net-301 work (SENet called an inception module to use parameters more effi-310 ciently, that is, the number of parameters was minimized 311 while the accuracy of the network was ensured. Some 312 variants were subsequently proposed. In this study, 313 Inception-v3, which has been most frequently used in 314 FER, was adopted.

315
• DenseNet-121 [30] directly connects all the layers that 316 generate feature maps, which enables each convolu-317 tion layer to know the feature maps output from the 318 previous convolution layer, thereby realizing the dense 319 connections between layers. In addition, DenseNet 320 to ensure the convergence of all frozen layers, the number 376 of training iterations at both the first and second stages was 377 set at 30. In all the experiments, the model that showed the 378 best performance in the validation set was selected.

379
As for the equipment, Keras and TensorFlow were adopted 380 for model training. The experiments were conducted on a per-381 sonal computer with an Intel i7-8700 CPU, NVIDIA GeForce 382 RTX 2060 GPU, and 32.0 GB RAM, and the environment 383 was Anaconda on Windows 10. The learning rates were 384 0.01 and 0.001 for freeze training and fine-tuning training, 385 respectively. The optimizer was Adam; the batch was 48; the 386 attenuation rate for fine-tuning training was 0.000001. The 387 learning rate was reduced by 0.01 every three times when 388 the accuracy of the validation set cannot be improved. For 389 the evaluation criteria, the accuracy of top-1 was reported, 390 which is the proportion of samples being accurately pre-391 dicted. Because most of the datasets are imbalanced, this 392 study also reported the weighted F1 score, as shown in Eq.  where #samples is the number of measurements in a dataset,  Table 1 shows the ImageDataGenerator parameters for data 436 augmentation.

437
The results of data augmentation are shown in Fig. 8, the  Table 2. On the basis of the results, the accuracy and F1 score 451 of models without class weight balancing showed greater 452 differences, ranging from 1% to 6%, whereas those of models 453 with class weight balancing were similar. This indicated that 454 class weight balancing allowed the models to be trained with 455 significantly fewer emotion classes and exhibit greater per-456 formance. Such differences were not found when the models 457 were trained with FER2013. This suggested that the models 458 could not effectively learn the features extracted from small 459 datasets (because of the limited sizes of the data) thus yielding 460 similar accuracy values and F1 scores (Table 3).

461
An observation of the confusion matrices showed that, 462 when class weight balancing was not performed, none of the 463 models could be trained to predict the Contempt or Disgust 464 class when they were fed with AffectNet as show in Table 4. 465 All the emotion classes of were limited in size, and they 466 predicted many of the classes as Neutral. However, after class 467 weight balancing, the models all became more accurate in 468 predicting Contempt and Disgust classes when trained with 469 AffectNet as show in Table 4.

470
As for the results of being trained with FER2013 as show 471 in Table 5, none of the models could be trained to predict the 472 Disgust class when no class weight balancing was performed, 473 but they all became more accurate in predicting this class if 474 class weight balancing was performed. In addition, accuracy 475 across all the models declined after class weight balancing, 476 because the models focused less on the classes they had accu-477 rately predicted, thereby becoming less accurate in predicting 478 them (e.g., Happy).

479
After determining how class weight balancing improved 480 model training with datasets of different sizes, further data 481 augmentation was performed to ascertain whether it could 482 improve accuracy-and we obtained surprising results. The 483 accuracy of models trained with FER2013 decreased, rather 484 than increased. This was probably because images in all 485 classes were highly similar, and the classes showed negligible 486 differences. For these reasons, the accuracy errors became 487 larger after data augmentation, preventing the models from 488 effectively distinguishing between the classes. This assump-489 tion was validated by examining the patterns in the confusion 490       AffectNet, which contained massive data and thus allowed class weight balancing to exert its influence, showed lim-  506 We also observed that the accuracy of the model trained 507 using class weight in small datasets (FER2013) was unable 508 to exceed the accuracy of the models without data prepro-509 cessing. This is because the datasets were too small to allow 510 every class to learn enough features. Therefore, the overall 511 accuracy was low. However, this shortcoming can be com-  All the layers were fine-tuned to adapt to the tasks in the targe 534 dataset. This technique can also serve as the regularizer to 535 prevent the effect exerted by overfitting. In the fine-tuning 536 training of this study, all layers of the model were directly 537 trained, to enable the model to be applicable to the task classes 538 in the target domain. The backpropagation algorithm in CNN passes loss (i.e., the 541 gap between predicted value and real value) backward and 542 computes the gradient of each layer according to the error sent 543 back, thereby updating the parameters of the designated layer. 544 In freeze training, the designated layer was not involved in the 545 model training, that is, the parameter updated by backpropa-546 gation after each iteration will not enter the frozen layer, and 547 therefore the weight parameter value in the kernel composed 548 of neurons will not be updated. Thus, the feature extraction 549 method of the designated layer remains unchanged before the 550 model training finishes. This is to prevent the learned knowl-551 edge (weight) from being destroyed when the knowledge is 552 transferred, which may result in failing to extract effective 553 features. Although not many FER studies have adopted this 554 technique, it has been widely applied in other fields, such as 555 medicine and malware detection.

556
The model training conducted in this study consisted of 557 two parts. The first part involved freezing until convergence; 558 all transferred layers were frozen and only the classifica-559 tion layer was trained. The second part entailed fine-tuning 560 all layers of the models until they converged. For model 561 training with AffectNet, accuracy not only remained con-562 stant for EfficientNet-B0 and improved for the fine-tuned 563 ods, because the model learned features effectively. At the  imbalanced data, and its accuracy and F1 score showed the 584 largest differences, whereas EfficientNet-B0 sustained the 585 least influence of the data, and it was subject to less influence 586 than the other models across different source domains and 587 transfer learning methods.

589
This study explored the effects and verified the necessity 590 of using transfer learning in FER. This section discusses 591 the training process and training methods (training types, 592 data preprocessing, and multiple source domain training) and 593 compares the results of this study with those of other studies. 594 First, this study concluded that it is necessary to use 595 transfer learning in FER. Compared with the model trained 596 from scratch, when the target domain was AffectNet, all the 597 models showed increased accuracy, by 3% for ResNet-50, 598 3% for Xception, 4% for EfficientNet-B0, 4% for Incep-599 tion, and 2% for DenseNet-121. When the target source was 600 FER2013, two baselines were adopted for comparison. The 601 first one is the highest accuracy achieved by the model trained 602 from scratch (i.e., the result of model training without class 603 VOLUME 10, 2022  improved the model accuracy in the large dataset (AffectNet), 624 with the accuracy for ResNet-50, Xception, EfficientNet-B0, 625 Inception, and DenseNet-121 increasing by 8%, 9%, 5%, 626 8%, and 8%, respectively. In the small dataset (FER2013), 627 because the amount of data that can be trained is low and the 628 model paid the same attention to all classes, the overall accu-629 racy was lower. This study also found that because of FER's 630 high requirement for data quality, data augmentation would 631 not increase model accuracy; instead, the accuracy decreased 632 in most of the models. Accordingly, this study suggested only 633 using class weight balance for transfer learning in FER model 634 training.

635
Regarding multiple source domain training, most of the 636 models pretrained twice in similar domains achieved higher 637 accuracy than those pretrained in a single source domain. 638 As for the training types for transfer learning, with the 639 large dataset (AffectNet), the accuracy of the models trained 640 through freeze + fine-tuning was higher than those of the 641 models trained through only fine-tuning. For the small dataset 642 (FER2013), the models trained through freeze + fine-tuning 643 also exhibited better accuracy than those trained through only 644 fine-tuning, despite the difference in the source domains. 645 However, we also noticed that when the source domains were 646 similar, using a different training type would not generate 647 distinct results. This may be because the high-level features 648 of the source domain (AffectNet) were similar to the tasks 649 in the target domain. Therefore, conducting freeze training 650 would not make much difference. Accordingly, this study 651 suggested conducting freeze + fine-tuning training for FER 652 model training on large datasets. On small datasets, although 653 we also suggested using freeze + fine-tuning training, more 654 effort should be placed on pretraining in multiple similar 655 domains to enhance accuracy.

656
In terms of training speed, in the small dataset (FER2013), 657 the transfer learning speed of both the training types 658 can be accelerated, while the speed increase in the large 659 dataset (AffectNet) was not significant. Moreover, the models 660  emotions (Figs. 13-14). Moreover, when the models were 680 trained with FER2013, they became the most accurate in 681 identifying Happy and least accurate in identifying Fear 682 (Figs. 19-28). Both types of transfer learning facilitated the 683 training process where the models were trained with the small 684 dataset FER2013, but their facilitation became less noticeable 685 when the large dataset AffectNet was used.

686
As for which model was the most accurate in identifying 687 which emotion class when AffectNet was used for train-688 ing, Xception achieved the highest identification accuracy in 689 Disgust and Fear; Inception did so in Anger, Neutral, and 690 Surprise; ResNet50 did so in Sadness; EfficientNet-B0 did 691 so in Happy; and DenseNet121 did so in Contempt as show 692 in Table 8. 693 When all models were trained with FER2013, they became 694 the most accurate in identifying Happy but the least accu-695 rate in identifying Fear. It is also observed in Table 9 that 696 ResNet50 achieved the highest identification accuracy in 697 Anger and Fear, EfficientNet-B0 did so in Disgust and Sur-698 prise, and Inception did so in Neural. Overall, Xception 699 was more accurate than the other models when trained with 700  TABLE 8. Comparison between best models tested with AffectNet regarding their accuracy of identifying emotion classes .   TABLE 9. Comparison between best models tested with FER2013 regarding their accuracy of identifying emotion classes.   training types in transfer learning, depending on the model 704 (Tables 10 and 11).        lack of software programs, can draw on this study to seek 724 appropriate models and transfer learning methods. We ana-725 lyzed these CNN models, which differ in the framework.

726
They are transfer learning models commonly used across dif-727 ferent fields: ResNet-50, Xception, EfficientNet-B0, Incep-728 tion, and DenseNet121-all of which differ in the framework. 729 We conducted exhaustive experiments on datasets of dif-730 ferent sizes, using different data preprocessing techniques, 731 different training methods, and different models. We first 732 compared these data preprocessing techniques and decided to 733 use class weight training after taking into account the consid-734 erable differences in sample size between all emotion classes 735 of FER2013 and the generalizability of datasets large and 736 small. Next, we examined the influence of different training 737 types in transfer learning and concluded that, with ImageNet 738 used as the source domain, all models achieved the highest 739 accuracy when they underwent freeze + fine-tuning whether 740 with the large dataset AffectNet or the small dataset FER2013 741 (except for EfficientNet-B0, whose accuracy remained con-742 stant, and DenSeNet121, which became less accurate than 743 the other models). Regarding the speed of model training, 744 the models were trained faster with freeze + fine-tuning than 745 with re-training or fine-tuning, regardless of the size of the 746 dataset used.

747
Moreover, an observation of a multi-source transfer learn-748 ing experiment conducted using FER2013 on source domains 749 relating to the target domain, indicated that the transfer learn-750 ing methods exerted less influence and there were no signifi-751 cant differences in training results when these methods were 752 performed using AffectNet as the source domain, compared 753 to ImageNet. This finding suggested that using the right 754 source domain can lead to more significant improvement for 755 transfer learning models than using the right training method. 756 However, yielding the optimal training results involving dif-757 ferent source domains and training types in transfer learning, 758 depending on the model. 759 We also determined the best models with respect to 760 different datasets. Specifically, while all models averaged 761 59% accuracy when trained using AffectNet, Xception was 762 the most accurate (61%,) followed sequentially by Incep-763 tion (60%), EfficientNet-B0 (59%), ResNet-50 (58%), and 764 DenseNet-121 (57%). All models averaged 69.8% in accu-765 racy when trained using FER2013; DenseNet-121 was the 766 most accurate (71%), followed sequentially by ResNet-50, 767 Xception, and EfficientNet-B0 (all achieving 70%), and 768 Inception (68%).

769
Regarding the future direction of our research, we may 770 attempt to further improve model accuracy by taking into 771 account existing ensemble learning methods. We may also 772 focus on the practical application of the models by building 773 datasets, training the models with the data through transfer 774 learning, and using them for real-time identification.