Thermography for Emotion Recognition Using Deep Learning in Academic Settings: A Review

Understanding students’ emotional states during the learning process is one of the important aspects to improve learning quality. Measurements of emotion in an academic setting can be performed manually or automatically using a computer. However, developing an emotion recognition method using an imaging modality that is contactless, harmless, and illumination-independent is challenging. Thermography, as a non-invasive emotion recognition method, can recognize emotion variance during learning by observing the temperature distributions in a facial region. Deep learning models, such as convolutional neural networks (CNNs), can be used to interpret thermograms. CNNs can automatically classify emotion thermograms into several emotional states, such as happiness, anger, sadness, and fear. Despite their promising ability, CNNs have not been widely used in emotion recognition. In this study, we aimed to summarize the previous works and progress in emotion recognition in academic settings based on thermography and CNN. We first discussed the previous works on emotion recognition to provide an overview of the availability of modalities with their advantages and disadvantages. We also discussed emotion thermography potential for the academic context to find if there is any information in the available emotion thermal datasets related to the subjects’ educational backgrounds. Emotion classification using the proposed CNN model was described step by step, including the feature learning illustration. Lastly, we proposed future research directions for developing a representative dataset in the academic settings, fed the segmented image, assigned a good kernel, and built a CNN model to improve the recognition performance.

temperatures changes. These changes can be interpreted as 95 signals to understand the human body and mind. 96 The human face has been widely chosen as a local area 97 of emotion recognition because, as a part of the body, it is 98 highly responsive to emotions [20]. It can express more than 99 30 emotional states [21], be easily recorded, and is naturally 100 exposed to social stimuli [22]. This condition is suitable for a 101 classroom setting where the face becomes the most exposed 102 part during the learning process. Thermal changes on facial 103 regions have also been dominantly explored in their relation 104 with human affective states considering a human face consists 105 of a number of micro-muscle units [23]. It causes temperature 106 changes whenever they are activated [24]. 107 Recently, several computer-based methods have been 108 developed to recognize facial expression through thermo-109 grams [ 111 manually and not specifically designed for the education 112 sector. However, so far, there is no study focusing on develop-113 ing non-invasive emotional expression thermography using a 114 Deep Neural Network (DNN), especially for the education 115 sector. 116 Considering the current limited resources, it can be said 117 that the work on emotion recognition using facial thermog-118 raphy based on DNN for the education sector is still at 119 its early stage. Hence, significant effort is required to ini-120 tiate the development of a reliable non-invasive technol-121 ogy to enable the recognition of emotional expressions for 122 academic purposes. The study can be directed and focused 123 on substantial issues identified during research to provide 124 a better understanding of the most suitable approach to be 125 implemented. 126 In this study, we aimed to review the current progress 127 in emotion expressions recognition using Deep Learn-128 ing (DL) and the use of thermography as a non-invasive 129 approach. We also highlighted necessary future research 130 directions to improve the accuracy of emotion recognition 131 using thermal-imaging and DL for the academic context. 132 The novelty and contributions of this study are arranged as 133 follows: 134 • Section II presents review strategy on selecting refer- 135 ences used on this paper 136 • Section III describes an overview of emotions in aca-137 demic settings.

138
• Section IV presents the current measurements of emo- 139 tions in academic settings. 140 • Section V presents the state-of-the-art of CNN as an 141 image classifier in the DNN model for emotion recog-142 nition. 143 • Section VI discusses previous research on emotion clas-144 sification using the available algorithms and CNN mod-145 els.

146
• Section VII proposes recommendations for future works 147 • Section VIII summarizes future direction and its chal-148 lenges to improve the accuracy and processing speed.

150
In this study, we considered the articles from journals, con- to our goal, leaving 123 studies. Figure 1 shows the distribu-178 tion of studies based on publication source.

182
In this research, each paper was classified into one of

194
Academic emotions are defined as emotions experienced by 195 students in a learning environment [33]. Academic emotions 196 have a strong correlation with students' achievement in the 197 learning process [34]. Achievement emotions are emotions 198 related to the activities or outcomes based on competency 199 set by certain standards [5]. In education, the activities are 200 mostly related to academic activities, such as studying, doing 201 exams and homework, having class discussions, doing stu-202 dent projects, succeeding or failing in these activities. The 203 emotions can also be caused by cognitive loads of infor-204 mation and time taken to process the information related to 205 knowledge-generating aspects of cognitive activities [35]. 206 During a learning process, a student can experience var-207 ious types of emotion depending on the focus of attention. 208 In addition, emotion can be stimulated by the topic being 209 discussed and influence students' and teachers' interest and 210 motivation in an academic environment [36]. Lastly, social 211 emotions have a strong influence on students' engagement 212 during class interactions and emotions caused by the events 213 outside school, such as problems in the family [37].

215
Emotions are multicomponent structures that can be differen-216 tiated from one another. The structures help us know the emo-217 tions that play a role in learning and teaching, the emotions 218 that should be encouraged and discouraged, and the ways to 219 regulate emotions in educational settings [5]. 220 Emotions consist of multiple components viz subjective 221 feeling, action tendency, appraisal, motor activity, and phys-222 iological component [38], [39], [40]. Each component is 223 associated with a different function. Subjective feeling is 224 associated with a monitoring function, action tendency with 225 written thoughts [49]. Reading and writing activities involve 256 positive and negative emotions which may cause anxiety [50]. 257 Emotions may also appear in daily classroom life. Emo-258 tions during interrelationship between students and teach-259 ers have a central role in supporting learning achieve-260 ment [51]. Cultural backgrounds may uniquely involve emo-261 tions depending on race, ethnicity, and identity during the 262 learning process [52]. The number of AC studies in the education domain mod-267 erately has increased since 2010. They are grouped into 268 five categories, namely textual, visual, vocal, physiologi-269 cal, and multimodal channels [17]. The methods used to 270 assess emotional states vary from self-reporting and expert 271 observation [53], [54], [55], [56]  preferences.

292
Visual channel also offers several benefits. First, it is nat-293 urally exposed. Second, it can be observed visually. Third, 294 it is practical to use. Last, the equipment it requires is afford-295 able. However, the noise, image processing complexity, and 296 privacy issues have become the issues of this modality type.

297
Being natural, noticeable, accurate, practically deployable 298 are the advantages of the vocal modality. However, it also 299 has some limitations, such as using dialogue-based systems,  Basically, visible cameras mimic how human eyes work that 326 only sensitive to a narrow range of visible light of electro-327 magnetic spectrum. They collect data from objects through 328 the radiations in the visible spectrum objects' surface emits 329 or reflects when hit by source of light [77]. This means that 330 without emission from visible light sources such as the sun or 331 incandescent bulbs, this vision system is generally unable to 332 sense objects. 333 However, thermal cameras are designed to capture infrared 334 radiations while visible cameras are not. According to 335 Planck's law, every object above absolute zero temperature 336 emits thermal radiation. Most of emitted radiations fall in the 337 infrared spectrum range (0.9 -14 µm) rather than visible 338 spectrum range (380 -780 nm) [ First, as a non-invasive method, it provides a better oppor-404 tunity to capture actual emotions. The use of a contact-405 based method may prevent elicitation of genuine emotions 406 while wearing the device [18]. This is suitable for capturing 407 emotions during the learning process. Second, it is a risk-408 free monitoring system. The use of other measurements, such 409 as sound and magnetic force, can harm our health [103]. 410 Third, it needs a low-cost thermal camera that has been 411 available in the market, unlike other methods that require 412 expensive equipment with electromagnetic spectra, such as 413 gamma, x-rays, ultraviolet, and other higher ranges of fre-414 quency [104]. Last, thermography does not depend on the 415 illumination effect because it only relies on thermal emission 416 from an object where a visible camera is light-sensitive [105]. In deep learning, a dataset can be treated by a computer 420 for analytic and prediction purposes. This paper attempts 421 to explore the available datasets of emotion thermography 422 to identify the correlation with education by investigating 423 the educational backgrounds of human subjects used on the 424 datasets. Table 3 presents the available emotion datasets of 425 the human subjects with their educational backgrounds.
426 Table 3 shows that there are only two datasets that 427 contain the information on the human subjects' educa-428 tional background information, namely the USTC-NVIE and 429 KTFE database. Although all datasets are made for gen-430 eral purposes, these two datasets are the readiest datasets to 431 implement in the academic context. Having compared both 432 datasets, we found that USTC-NVIE is superior to KTFE for 433 several reasons. First, USTC-NVIE represents more general 434 features because it has a greater number of participants. It also 435 consists of 215 students while KTFE only has 26 students. 436 Second, USTC-NVIE only has one age group (17-31 years 437 old), whereas KTFE has more diverse age groups ranging 438 from children to adults (12-32 years old). Children are not 439 small adults. Unlike adults, children's neurological develop-440 ment is still actively growing [115].

442
The main goal of feature extraction is to obtain the most 443 relevant information from the original data and represent 444 the information in a lower dimensionality shape [116]. For 445 the computational process, when the data to be input to an 446 algorithm are too large and have potential to be reduced, 447 transforming them into a reduced representation set of fea-448 tures is necessary.  To retain temperature for data analysis [107], thermogram 462 images are segmented manually into five regions to ensure 463 consistent segmentation, as shown in Figure 5.

464
The three-step ANOVA analysis using five statistical 465 parameters was used. The first step is to ensure which sta-  to analyze which emotional states differ most in each facial 471 sub-region.

474
Artificial Neural Network (ANN) mimics the physiology 475 and functioning of the human brain. Like the human brain, 476 each neuron receives input and performs a dot operation 477 with weights and biases. Weight describes the strength of the 478 connection between two nodes, whereas bias is an external 479 value that changes the network input of the activation function 480 [120]. Nodes are described as individual processing units in 481 each layer. Figure 6 illustrates the mathematical model of how 482 NN operates.
The activation function defines a linear combination of 489 input x with respect to neurons and parameters, followed 490 by element-wise non-linearity. The function also decides 491 whether the neuron status is active or inactive based on the 492 weighted sum of input signals. produce a system with more accurate measurement and lower 549 computational cost in the future. The ability to shift from hand-crafted feature extraction 553 to automatic learning through Neural Networks (NN) has 554 brought some advantages for thermal image translation to 555 visible image translation [142], [143], [144] and automated 556 vector extraction of facial emotion recognition [145]. Early 557 works on the implementation of thermal FER in Deep Learn-558 ing (DL) began in 2014. Table 5 summarizes the studies of 559 thermal FER in DL. 560 VOLUME 10, 2022  because CNN is a deep network that imitates how the brain 565 processes and recognizes images [150]. CNN enables feature 566 extraction to learn patterns from high dimensional inputs   In the convolution operation, the size of stride and padding 592 must be taken into account. Stride is the parameter that deter-593 mines the steps taken along the horizontal positions followed 594 by vertical positions. For instance, if the stride size is 2, the 595 kernel steps will consist of 2 pixels in a horizontal position 596 and 2 pixels in a vertical position [126]. The smaller stride 597 produces more detailed information retrieval. However, the 598 smaller stride size is not always related to good performance. 599 Output dimension will always be smaller than the size of 600 the input dimension, except the kernel size being 1 × 1 width 601 and the stride size being 1 × 1. Since the output will be 602 fed as input for the next layer, more information will be 603 rendered unnecessary. To overcome this obstacle, a padding 604 parameter is applied to the input. Padding is the parameter 605 determining the number of pixels to be added at each side of 606 the input to manipulate the output dimension of the feature 607 map. By applying the padding to all input sides, the output 608 dimension can be made equal. This allows a deeper convo-609 lutional layer to be applied, which results in more features 610 being extracted. The padding step may improve the DNN 611 performance by allowing the convolution filter to identify true 612 information among zero values.

613
The feature map from the feature layer process is then 614 fed into the pooling layer. The pooling layer comprises one 615 filter with a certain size of stride. In the convolutional layer, 616 the feature map is up-sampled. To avoid overfitting, in the 617 pooling layer, the dimension of the feature map is reduced. 618 There are two commonly used activation functions in this 619 layer: max pooling and average pooling. The maximum value 620 of the feature maps is selected in the max-pooling, whereas 621 the average value of feature maps is selected in the average 622 pooling.

623
CNN layers are commonly followed by a non-linear acti-624 vation function. The activation function takes an input with 625 a real value and transforms it into small ranges, such as 626 [0,1] and [1,1]. The implementation of the activation function 627 allows NNs to learn from non-linear mapping. It works like a 628 switch that decides whether a neuron can be activated or not 629 when provided with certain inputs. Sigmoid, Tanh, and ReLU 630 activation functions are widely used in DNN [126].

631
In the learning features, CNNs iterate convolution and 632 max-pooling processes several times to recognize the features 633 of the input. Figure 7 illustrates the convolutional process 634 using facial expression thermograms as the input images. 635 Since each input has three channels (RGB), each kernel also 636 comprises three kernels. The size of each kernel is determined 637 by the number of feature maps.   The training process where a thermogram with a given emo-669 tional state label is known as supervised learning [153].

670
CNN often produces the categories with different proba-671 bilistic values that will decide the types of emotions being 672 displayed in the thermograms. The output categories will be 673 an array of numbers between 0 and 1. One common type of 674 output model is the soft-max function. The soft-max function 675 works by calculating the probability of an output image over 676 possible target classes [152]. 677

678
Backpropagation is performed in the final layer of CNN and 679 is only used during the training process. With backpropa-680 gation, NNs learn from errors during training. This process 681 iteratively updates weights and changes the biases' values 682 to zero based on the differences in the target output and 683 predicted output.

684
An optimization algorithm is needed to reduce loss. 685 Recently, several algorithms applied as optimizers, such 686 as stochastic gradient descent (SGD) [154], limited-687 BGFS [155], parallelized SGD [156] FIGURE 8. Visualization of segmentation process of a facial emotion thermogram; modified from a previous study [151]. The availability of a representative dataset is important for the 692 training process. A good dataset will increase the robustness 693 of training performance. Several factors must be considered 694 when working with a certain dataset. The first factor is the 695 quantity of the dataset. A large number of samples will 696 provide more accurate mean values and reduce the margin 697 error. The second is the quality of the dataset, which has been 698 described in data reliability and feature representation [159]. 699 better learning system. Thermography has been proposed 752 considering its advantages compared to other computer-based 753 emotion recognition methods. Thermography enables emo-754 tion recognition to be interpreted from signals generated 755 from internal physiological activities represented in thermal 756 distribution.

757
Thermal distribution on facial regions can be evaluated 758 using computer-assisted technology to measure emotional 759 states. This technology can automatically perform feature 760 extraction to minimize errors. Our review has shown that the 761 current NN models have achieved higher accuracy rates in 762 emotion recognition classification. Nevertheless, the perfor-763 mance of the NNs model still has to be improved.

764
Further research needs to work toward an improved clas-765 sification of facial emotion thermograms in the academic 766 context. This will require providing representative datasets, 767 preparing suitable ROIs, assigning good kernels, and imple-768 menting lightweight models. These objectives will improve 769 performance in terms of computation time efficiency and 770 increase classification accuracy rates. A suitable method 771 using thermography can be proposed for self-evaluation and 772 the learning process in a classroom during learning.