An Emotion Recognition Method Based on Eye Movement and Audiovisual Features in MOOC Learning Environment

In recent years, more and more people have begun to use massive online open course (MOOC) platforms for distance learning. However, due to the space–time isolation between teachers and students, the negative emotional state of students in MOOC learning cannot be identified timely. Therefore, students cannot receive immediate feedback about their emotional states. In order to identify and classify learners’ emotions in video learning scenarios, we propose a multimodal emotion recognition method based on eye movement signals, audio signals, and video images. In this method, two novel features are proposed: feature of coordinate difference of eyemovement (FCDE) and pixel change rate sequence (PCRS). FCDE is extracted by combining eye movement coordinate trajectory and video optical flow trajectory, which can represent the learner’s attention degree. PCRS is extracted from the video image, which can represent the speed of image switching. A feature extraction network based on convolutional neural network (CNN) (FE-CNN) is designed to extract the deep features of the three modals. The extracted deep features are inputted into the emotion classification CNN (EC-CNN) to classify the emotions, including interest, happiness, confusion, and boredom. In single modal identification, the recognition accuracies corresponding to the three modals are 64.32%, 74.67%, and 71.88%. The three modals are fused by feature-level fusion, decision-level fusion, and model-level fusion methods, and the evaluation experiment results show that the method of decision-level fusion achieved the highest score of 81.90% of emotion recognition. Finally, the effectiveness of FCDE, FE-CNN, and EC-CNN modules is verified by ablation experiments.

decision-making, learning, or cognitive activities.Emotion recognition is an interdisciplinary issue, including computer science, psychology, and cognitive science.An important application of emotion recognition is online education based on massive online open course (MOOC).Since the first year of MOOC in 2012 [1], MOOC has developed rapidly.The three major platforms, Coursera, Udacity, and edX, have cooperated with universities around the world, making MOOC very popular.Online MOOC learning platforms have become even more important and necessary since the COVID-19 outbreak in 2020, as learners rely more than ever on distance learning as a source of their education.Compared with traditional teaching methods, the core advantage of MOOC is that they are not constrained by time and space, making learning more flexible and thus promoting the sharing of high-quality educational resources.However, MOOC is also facing a number of problems, for example, one of the biggest concerns is the high dropout rate.According to the University of Pennsylvania, the average dropout rate of MOOC is as high as 90% [2].Most of the current studies have focused on the impact of course quality, course evaluation, and the influence of students' messages on students' grades.A few studies have focused on the impact of students' emotional states on learning outcomes, especially in online video learning [3].In fact, students' learning outcomes are affected by their emotional state.Generally speaking, a positive emotional state will promote students' learning outcomes, and similarly, negative states reduce the efficiency of the learners.Therefore, emotion plays an important role in MOOC learning [4].
At present, the common modals used for emotion recognition are physiological signals, facial expressions, voice signals, texts, and so on.Most physiological signals, such as EEG signals, are collected by wearing sensors, which may cause the participant to have unnatural reactions.For students in learning, wearing sensors will distract them and produce interference to their learning.Therefore, it is hard to distinguish whether the emotions are caused due to the electrode cap or the learning materials.Eye tracker can record the characteristics of eye movements, while the person is processing visual information.As a nonintrusive device, screen eye tracker will not interfere with the learners' learning experience.Thus, it is suitable for MOOC learning environments.The generation of emotion is closely related to the context of the stimulating material.In the MOOC learning environment, students' emotional state is mainly stimulated by video learning materials.If the speaker has a flat tone, the students may feel uninterested, while a tone that appropriately fluctuates will attract the students' attention and enhance their interest.Monotonous video images will make students feel bored, while fast-changing images will evoke students' interest in the material matter.Therefore, the audio and visual features in the instructional video, as part of the MOOC learning scenarios, are just as important as the eye movement features to infer the emotional state of students.However, there are few studies to analyze students' emotions by combining eye movement signals and audiovisual features.
This article proposes an emotion recognition method based on the fusion of eye movement signals and audiovisual features in a video learning scenario.This method focuses on the audio and video images of learning videos and the eye movement signals generated by learners and integrates them through a variety of methods.The main contributions of this article are given as follows.
1) This study proposes that not only learners' physiological signals, such as eye movement features, but also learning scenario features, such as instructional video features and audio features, should be considered for learner emotion classification in MOOC learning scenarios.The combination of physiological features and scenario features is helpful to improve the classification accuracy.
The experimental results show that learning scenario features are closely related to the learners' emotional states and can improve the recognition accuracy of the model.

A. Eye Movement-Based Approaches
The research of emotion recognition using eye movement signals has attracted more and more attention in recent years.Li et al. [5] proposed a method that combined eye movement signals with other physiological signals to identify depression and achieved good classification accuracy.The method proves that eye behavior is one of the main features of depression, which has an important reference value for the automatic diagnostic system for establishing clinical applications.Tarnowski et al. [6] evoked people's emotions by presenting 21 dynamic film video clips.Emotion categories are high arousal and low valence; low arousal and moderate valence; and high arousal and high valence.The highest average classification accuracy is 80% obtained by eye movement features.In [7], a method of combining electrooculogram (EOG) signal and eye movement signal is proposed.According to the eye movement track, the invalid EOG signal can be removed.The quality of EOG signal is improved and the accuracy of classification recognition is improved.Eye movement is used in many fields, but at present, it is mostly used to supplement EEG and EOG signals [8].In [9], EEG signals and eye movement features were fused by deep canonical correlation analysis (DCCA) to achieve a good recognition effect.Also, it was found that EEG signals and eye movement signals are complementary to each other in distinguishing positive and negative emotions.In [10], [11], and [12], the stability of EEG and eye movement signals over time was studied.The two types of signals were fused at different levels, and the results showed that the fusion can provide more supplementary information for identifying emotions.For the identification of neutral emotions and fear, eye movement signals have obvious advantages.In addition, in order to obtain high arousal emotion, most experiments use movie clips with strong stimulation [6], [9], [10], [11].However, the academic emotion and its intensity induced in MOOC learning scenarios are different, which needs further research.

B. Audio-Based Approaches
Speech is a type of complex signal, which contains a variety of information, such as the message to be conveyed, the speaker's language, gender, and emotion.Speech emotion recognition is very important for natural human-computer interaction.In [13], the acoustic features are extracted from the speech signal, and the mel frequency cepstrum coefficient (MFCC) coefficients are extracted to recognize the speaker's emotion.The average recognition accuracy of this method for happiness, sadness, and anger is about 80%.In [14], a speech emotion recognition model based on an improved deep belief network (DBN) is proposed to enhance the representation ability of speech signals and the recognition accuracy of speech emotion recognition.The model uses RELU instead of the traditional DBN activation function to extract the shorttime energy, short-time zero crossing rate, the fundamental frequency, formants, and MFCC coefficients of the speech signal as the basic features.Using these features as the input of the model, the model can automatically recognize six emotions: anger, fear, joy, calmness, sadness, and surprise, with an average recognition accuracy of about 60%.In [15], a new speech emotion recognition technology based on the combination of deep and shallow neural networks is proposed.The parallel training sample set is established, and the DBN is used to automatically extract and recognize the speech emotional features.The shallow neural network is used to obtain the final recognition results.The five emotions of sadness, surprise, anger, happiness, and neutral emotions were classified, and the average recognition accuracy was 89.8%.At present, audio modal is widely used in emotion recognition, but it is rarely used as the stimulus signal and combined with subjects' physiological.

C. Video-Based Approaches
At present, video is the most common stimulus material used to evoke subjects' emotions.Mao et al. [16] proposed a multimodal local-global attention network (MMLGAN) for affective video content analysis, which extends the attention mechanism to multilevel fusion and includes a multimodal fusion unit to obtain a global representation of affective video.The effectiveness of the method was proven by public datasets.In [17], a variety of machine learning algorithms and neural network models are used to fuse video features and EEG signals.The result shows that the video emotion classification accuracy achieves 96.79% for valence (positive/negative) and 97.79% for arousal (high/low).In [18], audiovisual features were obtained by 3-D CNN and then fused into DBNs.This method performs well on three common datasets.From the above research, it can be seen that movie clips are currently selected in most studies, but learning videos in MOOC are seldom studied as stimulating materials.
The rest of this article is arranged as follows.Section III introduces the data collection experiment in the MOOC learning scenarios.Section IV introduces the process of data preprocessing and feature extraction, including FCDE and PCRS.In Section V, the single modal emotion classification experiment and multimodal emotion classification experiment are introduced and the experimental results are analyzed.Section VI draws the conclusion.

III. DATA COLLECTION
This section is part of the data collection experiment related to this article.In the experiment, we selected four instructional videos of different types as stimulating materials.The eye tracking device is TobiiTX300, which includes an eye movement module and a 23-in display module, and the sampling frequency is 60 Hz/s.An HP desktop computer connected to the eye tracking device is used to control the experimental procedure and record the data.The 68 subjects are all college students aged 20-23, with a male-to-female ratio of 1:1.
The experiment was carried out in a laboratory environment with constant brightness.The eye tracker was calibrated based on the individual subject.Then, the subjects were asked to stare at a cross in the center of the screen for 30 s to obtain the baseline value of the pupil diameter.In the process of learning by video, when the subjects felt bored, interested, happy, or confused, they can press the corresponding button on the keyboard to mark it.After learning, the subjects need to review the study videos and videos of their facial expressions during the study and expand the marked point into an emotional event.
Each emotion event has a pair of start and end points, and an emotional intensity value, from strong (A5) to weak (A1) across five scales.
Excluding the subjects whose pupil calibration accuracy is less than 80% and also excluding the subjects whose data loss rate is more than 25%, the eye movement data of 59 subjects are finally selected.The subjects' eye movement data and synchronized audio and video image data of the instructional videos are extracted to build the datasets for this study.

IV. DATA PREPROCESSING AND FEATURE EXTRACTION A. Data Preprocessing
There might be a deviation at the start time and the end time of the emotional event labeled by the subjects during the review, so the first 30 frames and the last 30 frames of the interval are removed.The characteristics of the data with weak emotional intensity are not obvious enough, so we extracted the data with emotional intensity between A3 and A5 to build the dataset for this experiment.
The most commonly used indicators of eye movement signals are pupil diameter and eye movement coordinates (also called gaze points, the pixel coordinates of the eye on the screen are calculated by the Pupil-CR technology [19]).The common eye states are eye blinking, fixation, saccade, and so on.During the data acquisition process, data of the pupil diameter and eye moving coordinates might be lost due to eye blinking and excessive head movements.The eye movement data might be lost due to the shake of the head and the low calibration accuracy.Next, the methods of data completion for missing eye state, pupil diameter, and gaze points are introduced.
Pupil diameter is a kind of physiological information, which conforms to the law of continuous changes in physiological signals over time.In order to avoid the change of data structure and reduce the data standard difference, the linear interpolation method is used to complete the lost pupil diameter data, as in the following equation: where (x 0 , x 1 ) is a known pupil diameter pair, (y 0 , y 1 ) is a known time pair corresponding to (x 0 , x 1 ), x is the lost pupil diameter data, and y is the time corresponding to the lost pupil data.To find the missing value x, Formula (1) is transformed to get the following equation: Gaze points are not physiological information, and the coordinate changes are all controlled by the subjects.The change of gaze points has no obvious rule, and the distribution of gaze points is not continuous.Thus, linear interpolation is not suitable for the missing gaze points.Therefore, the average interpolation method is used to supplement the data of total gaze points, as in the following equations: where n is the number of the missing values, i means the i th missing values (i = 1, 2, . . ., n), z is the difference between a known value and an adjacent missing value, (gp x0 , gp y0 ) is the eye movement coordinate before the first missing value, (gp x1 , gp y1 ) is the eye movement coordinate after the last missing value, and (gp x , gp y ) is the missing eye movement coordinate data.Eye movement coordinate means the position of the subject's gaze on the screen.Eye state is a constant process for each state, for example: fixation, whereby the subject gazes upon a certain time interval, the eye state remains unchanging.The lost eye state data can only be copied from the previous frame.
Eye movement signals belong to physiological signals, and baseline values of physiological signals in a state of calm are different for each individual [20], [21], [22].Therefore, we calculated the pupil diameter baseline value of 59 subjects in a calm state, as shown in Fig. 1.As can be seen from the figure, the pupil diameter of different subjects varied greatly.The subjects with the largest pupil diameter baseline value are 1.72 times that of the subjects with the smallest.Therefore, in order to eliminate individual differences in pupil diameter [23], the individual pupil diameter baseline value is subtracted from pupil diameter for each subject, as in the following equation: where x 0 is the original pupil diameter and n is the number of data frames in the process of watching the cross; thus, the average pupil diameter values ( n i=1 x i /n) are calculated as the individual pupil diameter baseline value.

B. Feature Extraction of Eye Movement Signal
Wang et al. [24] studied intention recognition through pupil diameter, saccade amplitude, and fixation time, and found that the more eye movement indicators there were, the higher the recognition accuracy would be, and the position characteristics of fixation points had a great influence on the accuracy.Therefore, we extract statistical features from the four indexes of fixation, saccade, eye blink, and pupil diameter as a part of the index of eye movement signal.After correlation calculation, these characteristics are highly correlated with emotional state.In addition to the above features, this article proposed the Only individual eye movement coordinates cannot provide enough effective information, so the feature FCDE is extracted by combining these data with the corresponding video data.In terms of the different states of eye movement, FCDE is divided into two features FCDE s and FCDE f .FCDE s means the FCDE in a saccade process, and FCDE f means the FCDE in fixation.FCDE s is calculated with Formula (6).In a saccade process, the eye movement coordinate of the first frame (x p1, y p1 ) is taken as the starting point of the eye movement trajectory.The coordinates of the same position as (x p1 , y p1 ) in the video image are called feature points (also known as corner points), denoted by (x v1 , y v1 ) The trajectory of the saccade is obtained by successively connecting the eye movement coordinate points in continuous frames.The trajectory of the feature point in the image is obtained by successively connecting the feature point.The new coordinate of the feature point in the next video frame (x vi , y vi ) can be obtained by using the optical flow method [25].From frame i = 1 to n, calculate the distances of every pair of corresponding coordinate points of the two trajectories, sum up the distances, and finally calculate their mean value.n indicates the number of frames sampled during saccade.There are two examples of feature point trajectory and eye movement coordinate trajectory, which is shown in Fig. 2. In Fig. 2(b), in a state of confusion, the two tracks are in the same direction, which means that the subject still focuses on a particular object on the screen.However, in Fig. 2(a), in a state of boredom, the two tracks start in the same direction, but then there is a large deviation, which indicates that the subject's attention has declined from the particular object.
FCDE f is calculated with Formula (7).During fixation, the eye movement coordinate (x, y) remains the same, and taking this coordinate as the feature point of the first frame in the video image, the new coordinate of the feature point in the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.next video frame (x vi , y vi ) is obtained by the optical flow method as well such as in the saccade process, then summing the distance of the new coordinates and the gaze points, and finally calculating the average value There are two examples of gaze points' trajectory and feature points' trajectory under the fixation state, as shown in Fig. 3. Gaze points are red and feature points are green.In Fig. 3(a), in a state of boredom state, the trajectory of feature points has a large deviation from the trajectory of gaze points.In Fig. 3(b), in a state of interest, the feature trajectory is very close to the trajectory of gaze points, and the overlapping part of the two trajectories is yellow.
In Formulas ( 6) and ( 7), (x, y) is the original coordinate of eye movement, and (x pi , y pi ) and (x vi , y vi ) are the coordinates of eye movement in frame i and the feature point coordinates in frame i , respectively.The smaller FCDE s or FCDE f is, the more similar that the trajectory of gaze points and the trajectory of features points are.Also, this means that the subjects' attention is more concentrated.If there are more than one fixation or saccade process in a sample window, the average value of each FCDE s and FCDE f is calculated as the value of FCDE.
The effectiveness of FCDE features is verified from two aspects: correlation and recognition accuracy.Table II lists the correlation coefficients between FCDE features and emotional state, as well as the average correlation coefficients between other eye movement features and emotional state.From this table, we can see that the correlation between FCDE features and emotional state is much higher than the average correlation between other eye movement features and emotional state.This indicates that FCDE features, compared with most other features, are more suitable for emotion classification in learning scenarios with video.

C. Video Feature Extraction
Instructional video is used as stimulus material to induce the learners' emotional states, so audio and image features are closely related to the emotional state of learners.We extract the features of audio and image modals from the instructional video for further emotion classification.For audio feature extraction, the mel cepstral coefficient in audio has good robustness and accuracy in emotion recognition [26], so for each frame, we extract the MFCC coefficients of audio and calculate its first-order difference, which can further reflect the dynamic characteristics of audio.
Using the method in [27], for each frame, we obtained the 13 MFCC coefficients from the audio spectrum through a set of MFCC filters using different parameters.Because in the time domain, the characteristics of the signal are difficult to be shown, the Fourier transform is used to transfer the signal to the frequency domain.The spectrum is smoothed and the influence of harmonics is eliminated through the MFCC filter.MFCC coefficients are obtained by discrete cosine transform (DCT).Then, the first-order differentials of MFCC coefficients are calculated.For a given time window, a total of 15 statistical features, including the maximum, minimum, mean, range, median, standard deviation, and variance of MFCC coefficient, the same statistical values of the first-order differentials of MFCC coefficient, and the kurtosis factor, are extracted.A total of 195 features are extracted from the 13 MFCC coefficients.
For video image feature extraction, PCRS based on the sequence of pixel change rates is proposed.Most studies use some exciting clips in the movie to induce the subjects' emotions, in which the high-speed switching of the scene and the fast movement of the target can elicit strong emotion.However, there are few high-speed switching screens and Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
fast-moving targets in instructional videos.The feature PCRS is supposed to be related to learners' emotional arousal.PCRS is calculated in Formulas ( 8) and (9).Z in Formula (8) represents the level of the pixels changed in two adjacent frames of images, and C in Formula (9) indicates how fast the images are switched in a time window In Formula (8), A t = X t − X t−1 is the matrix of difference between two adjacent grayscale images too.Take the absolute value of the entries in the matrix.X t represents the pixel gray value matrix in frame t. n is the total number of pixels in the gray image.X t (i, j ) is the element in row i and column j of the matrix, which is a pixel gray value.Z is obtained by summing up the values of all the elements in the matrix A t and then average them.Z represents the pixel change level between the two images.Calculate the value of Z for every two adjacent images in the window to get a sequence of pixel change level where u is the number of image frames in the window.This sequence can reflect the change of the video image.Usually, slow change in the sequence makes people feel dull or bored, and fast change sequence causes people to feel higher emotional arousal.We calculate the total number of the changed pixels for adjacent images in a sampling window, which is represented by C in the following equation: In Formula (9), A t is the matrix of difference between two adjacent grayscale images and m is the number of the grayscale images in a sampling window.R * A t is the number of nonzero elements in the matrix A t .Therefore, the greater the value of C, the greater the pixel change of the image in the time window.A total of seven features of the video image include the mean, maximum, minimum, median, standard deviation, variance of the sequence Z , and the number of nonzero pixels C.

D. Feature Dimension Reduction
Principal component analysis (PCA) is used to reduce feature dimensions and eliminate redundant information.In order to reduce the interference of the test set to the training set, the data of 12 subjects who are randomly selected out from 59 subjects are used as the test set.First, the dimension reduction is carried out on the data of the remaining 47 subjects and the feature matrix is obtained.Then, the feature matrix is used to reduce the dimension of the test set.It shows the contribution rate and accumulated contribution rate of the feature principal components of eye movement signal in Fig. 4(a), principal components of MFCC coefficient feature in Fig. 4(b), and principal components of video image feature in Fig. 4(c).Since the features in the same modal are highly correlated, we reduced the dimensions of the three modals to retain the principal component whose contribution rate is greater than 1%.Finally, there are 19 principal components extracted from 52 features in eye movement signals, 19 principal components extracted from 195 features in audio signals, and three principal components extracted from seven features in video image that are retained.

V. EXPERIMENT
This section includes a single modal emotion recognition experiment and multimodal emotion recognition experiment.Shen et al. [28] found that time windows of different lengths had a great impact on experimental results, so we determined the optimal window through experiments.In [29], it was found that the best window for emotion recognition of eye movement signals is 2-3 s.Zhang et al. [30] used the audio dataset with a time window of 3 s to identify the five emotional states and the accuracy rate is 87.8%.In [31], a generation adversarial network (GAN) was used to generate 3-5-s speech samples to enhance the dataset, which improved the speech emotion recognition accuracy rate by 5%.Therefore, the dataset is divided with the time window of 2, 3, and 5 s in this study.The experiment uses the Pytorch open-source framework to train the network with Windows10 and NVIDIA GeForce RTX 2080 GPU.In parameter settings, Adam is selected as the optimizer, and the learning rate is set to 0.005, the batch size is 32, and the dropout is 0.5.In the single modal experiment, the optimal window size is selected from the three time window sizes and the features in the three modals under the optimal window are analyzed further.In the multimodal experiment, the best fusion strategy is selected by three methods: the feature-level fusion method, the decision-level fusion method, and the model-level fusion method.

A. Single Modal Emotion Recognition 1) Construction of FE-CNN and EC-CNN:
Single modal emotion classification uses the combination of shallow features and deep features to classify the emotional state.The specific emotional categories include interest, happiness, boredom, and confusion.The original features after PCA dimension reduction belong to shallow features, which contain more detailed information, but the original feature has less semantic information and higher noise.The deep feature is the feature vector extracted from the original feature by the FE-CNN network.Compared with shallow features, deep features are more abstract, can represent the internal relations between different features, and contain more potential information.FE-CNN network parameters are shown in Table III.Inspired by the residual idea from the Resnet network, in the fusion process of shallow features and deep features, after each extraction of deep features, the shallow features are added to the deep features by means of addition and concatenating, so as to ensure that the original shallow features are also retained while extracting deep features.By this method, the shallow features and deep features are fully extracted and fused to provide high-quality input for the classifier.The FE-CNN network is only used for feature extraction without feature reduction and emotion classification.
Fig. 5 shows the process of fusion using original features and deep features.Taking eye movement features as an Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.In this experiment, a convolution neural network called emotion classification CNN (EC-CNN) is designed for classification.Because the shallow features in the feature sequence are obtained by PCA dimension reduction, the redundant information is removed, and deep features have been extracted twice by FE-CNN, so the classifier does not need too many network layers.The EC-CNN model has four-layer convolution, and the BatchNorm1d layer and RELU activation function are added between convolution layers.The parameters of EC-CNN are shown in Table IV.
2) Experimental Results and Evaluation: As mentioned before, 2, 3, and 5 s are used as time window sizes for the data segmentation of the three modals.Cross validation is suitable for small sample datasets, so we use the evaluation method  of tenfold cross validation [32].The best time window size is determined based on the average recognition accuracy on the test set Table V shows the recognition accuracy of the three modals using four different machine learning models by three different time window sizes.The results show that the accuracy of emotion recognition using only audio or video images is higher than that using only eye movement signals, so it is confirmed that the audio and video image features of MOOC videos are closely related to the emotion generated by learners.Combining individual eye movement features with environmental features can further improve the accuracy of emotion recognition.This will be discussed in Section V-B.
From the aspect of modal, the audio signal has the highest recognition accuracy.From the aspect of time window size, the eye movement signal and the video image achieve the highest accuracy in the 3-s window.The recognition effect of the eye movement signal is particularly prominent in the 3-s window.The audio signal has the best result in the 2-s window, but the recognition accuracy of 74.94% is very close to the accuracy value of 74.67% in the 3-s window, so we chose 3 s as the time window size to build the dataset of all modals.
For each modal, ten models were obtained from the tenfold cross validation and sorted according to the recognition accuracy on the test set, and the fifth model in the middle was selected for analysis.The reason why we choose the model with the accuracy ranking in the middle is to make the output result more consistent with the general actual situation [33].
Fig. 6 shows the confusion matrix of classification results by eye movement signal, audio signal, and video image with the model.Table VI shows the precision (P), recall (R), specific (S), and F1 scores of each modal.As can be seen from Fig. 6 and Table VI, among the three modals, the classification accuracy and other evaluation results of boredom are very high, which is the easiest to identify compared to the other emotions of interest, happiness, and confusion.In audio modal and video image modal, confusion and happiness also have a high recognition accuracy.However, the accuracy of interest is low in the three modals.In the eye movement modal, interest is easily misidentified as happiness and confusion, and in the audio modal and video image modal, interest is mostly misidentified as happiness.
As can be seen from Table VI, the F1 score of happiness, confusion, and boredom in audio and video image modals is all high, indicating that audio and video image modals are effective in the recognition of happiness, confusion, and boredom.In addition, the precision and recall of these three emotions in the two modals of audio and video images are all greater than 0.9, and in the eye movement modal, the precision and recall of confusion and boredom are also high.This indicates that the model has a strong ability to classify confusion and boredom.Compared with the other three emotions, the precision recall of interest in the three modals are relatively low, indicating that interest can easily be identified as other emotions by this model.Fig. 7 shows the receiver operating characteristic (ROC) curves of the three modals.As can be seen from Fig. 7, the average ROC curve area under curve (AUC) of the three modals is all greater than 0.8, which proves that the three modals are relatively stable and have strong generalization ability.Also, the three modals all have 1-3 AUC values greater than 0.9, which means that the three modals have a better recognition effect on one or several emotions.The average AUC of the audio modal is the largest, while that of the eye movement modal is the smallest.AUC values of boredom in the three modals are all greater than 0.9.The AUC values of confusion and happiness states in the audio modal are both greater than 0.9, and their AUC values in the video image modal are also higher than 0.8.Similar to the confusion matrix, the recognition of boredom is the most stable.Confusion and happiness also showed good performance in audio signals and video images.
Pupil diameter is very important index in eye movement signal, so we further analyze the original data of pupil diameter.Fig. 8 shows the comparison of pupil diameter in the state pairs of confusion-interest, happiness-interest, and confusion-boredom.The pupil diameter data of each frame in the emotional state pair are put into a coordinate axis with the pupil diameter of the left eye as the vertical axis and the pupil diameter of the right eye as the horizontal axis.The overlap of the two colors in the figure is the overlap of the pupil diameters of the two emotions.As can be seen from Fig. 8(a) and (b), the pupil diameter distribution of the interest state overlaps considerably with the confusion and happiness states.This is the same as the performance of the confusion matrix, where interest is easily misidentified as happiness and confusion.
According to Fig. 8(a)-(c), the pupil diameter distribution under boredom has less overlap with the pupil diameter distribution of the other three emotions.This is the same as the previous analysis results, and boredom has a good performance in each evaluation index.
Both audio signal features and video image features belong to video content.In these two modals, a large part of "interest" is misidentified as "happiness."In terms of stimulus material, a video that makes people feel happy will usually also be interesting, but a video that makes people feel interested will not necessarily make people feel happy and may be confusing.The performance in the confusion matrix also confirms this point.The emotion of interest is most easily misidentified as happiness, followed by confusion, while only a small part of happiness is misidentified as interest.
3) Computational Complexity Analysis: The recognition accuracy of the deep learning model is significantly better than that of the machine learning model.Therefore, we analyze the computational complexity of the deep learning model.We compared floating-point operations (FLOPs), memory access cost (MAC), and the number of parameters (NP) between the FE-CNN+EC-CNN model and the ResNet18 model.FLOPs are the number of FLOPs in one training turn, which is used to measure the time complexity of the model.One MFLOP equals one million FLOPs.MAC represents the memory usage, which is used to evaluate the memory usage of the model at runtime.NP represents the total NP inside the model, which is used to measure the size of the model.

B. Multimodal Fusion Emotion Recognition
The multimodal fusion method is the core of multimodal emotion recognition.The common fusion methods include feature-level fusion, decision-level fusion, and model-level fusion [34].We studied three fusion methods to determine the best fusion model.Fig. 9 shows three fusion methods, and IDSF represents the process of integration of deep and shallow features in Fig. 5.
1) Feature-Level Fusion: The feature-level fusion is shown in Fig. 9(a).After feature extraction, the principal components of the three modes are obtained.By concatenating the principal components of the three modals, 0 = [PCA1 e , . . ., PCA19 e , PCA1 a , . . ., PCA19 a , PCA1 v , PCA2 v , PCA3 v ].F0 is sent into the IDSF module for the fusion of shallow features and deep features, and then, the fused feature vector is sent into the EC-CNN module for classification.Finally, the recognition accuracy of 76.02% is obtained.
2) Decision-Level Fusion: The decision-level fusion is shown in Fig. 9(b).The classification vectors (shape: [1,4]) obtained from the three modals are weighted and fused according to a certain proportion to obtain the fused classification vectors.
The vectors fc1,fc2, andfc3 output by the full connection (FC) layer of EC-CNN network in the three models are assigned with the different weight value w1,w2, and w3 and fused to get the final classification vector output = fc1 * w1 + fc2 * w2 +fc3 * w3,(w1+w2 +w3 = 1).After normalization by the softmax function, the final classification results are obtained.Also, the recognition result after fusion is obtained.
w1,w2, and w3 are set empirically and adjusted by experimental results.First, the three modals are fused in pairs, such as in the following equation: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.The recognition accuracy of the fusion model in Table VIII shows that the recognition effect of the fusion by two modals is higher than that of the single modal, which proves that the three modals all have certain complementarity.The decision-level fusion between eye movement and audio achieved the highest recognition accuracy of 80.68%.
According to Models 4 and 5 in Table VIII, when the eye movement features are combined with the audio and video images of the learning scenario features, the recognition accuracy of the learned emotion is increased by 10.68% and 16.36%, respectively.This indicates that the combination of scenario features can effectively improve the accuracy of emotion recognition in MOOC scenarios.
Then, the weights of the three modals in weighted decision fusion are considered.The weights of the modals of eye In Formula (11), w1 is the weight of the eye movement model, w2 is the weight of the audio model, w3 is the weight of the video model, and w4 is the new weight of the eye movement and audio fusion model.The experimental results show that when w4 = 0.9, w3 = 0.1, combined in the optimal ration in Table VIII, namely, the decision-level fusion with (w1 = 0.225, w2 = and w3 = 0.1) achieves the optimal recognition accuracy of 81.90%.
3) Model-Level Fusion: The model-level fusion is shown in Fig. 9(c).The features of the three modals are processed by the corresponding IDSF module to get the feature vectors.Then, the feature vectors are sent to the corresponding EC-CNN module, which is without an FC layer, and the vectors D1,D2, and D3 containing classification information are obtained.The complete classification information vector D is obtained by concatenating D1, D2, and D3.Then, the local information in vector D is captured through a layer of convolution.The input dimension and output dimension of the convolution are both 1 and kernal_size is 3.The output vector from CNN is sent to an FC layer for final emotion recognition, and the recognition accuracy is 80.16%.Fig. 10 shows the recognition accuracy of the three fusion methods.It can be seen from the figure that the recognition accuracy of feature-level fusion is the lowest among the three methods, indicating that the feature-level fusion method cannot make full use of the complementary information in the three modals.It is more suitable to use the decision-level fusion method.

C. Discussion
In this section, we discuss the impact of different modals and components on recognition results.
The experimental results show that, compared with the single modal, the effect of the fusion of any two modals is better, so the fusion method is effective.By the single modal emotion recognition with each of the three modals, the recognition accuracy of interest is the lowest, and the recognition accuracy of happiness is average.
After the decision-level fusion, the recognition accuracy of the model on happiness has been significantly improved.The confusion matrix and ROC curve are shown in Fig. 11.In addition, it can be seen from Fig. 11(b) that the AUC is 0.87 in the fused model.Compared with the single modal (AUC is 0.81 in the eye movement modal, AUC is 0.83 in the video image modal, and AUC is 0.84 in the audio signal modal in Fig. 7), the stability and generalization ability of the fused model is better.However, the interest recognition effect of the fused model is still poor, which proves that there are limitations to recognize the emotion of interest by the three modals.
In order to evaluate the effectiveness and necessity of FCDE features, FE-CNN model, and EC-CNN model, a group of ablation experiments are designed.
The experimental results are shown in Table IX.The configuration for each experiment is described as follows.It can be seen from experiments 1 and 7, the recognition accuracy of input data containing FCDE features is better than that of input data without FCDE features.After adding FCDE features, the accuracy of model is improved by 0.7%.From experiments 2, 3, 4, and 7, it can be seen that the model with two FE-CNN modules can fully extract the deep features and fully integrate the deep features with the shallow features, and it achieves higher accuracy than the models with zero, one, or three FE-CNN modules.Experiments 5-7 show that the EC-CNN model is more effective in emotion classification than some baseline machine learning methods.
The experimental results in Table VIII show that the emotion recognition effect of modal fusion is better than single modal.The ablation experiment in Table IX shows that all the modules, including the FCDE features, FE-CNN, and EC-CNN modules.

VI. CONCLUSION
In this article, through the single modal experiment, it is found that eye movement signal, audio signal, and video image are all suitable for identifying learning emotion in MOOC learning scenarios, and the best performing modal is the audio signal.Among the four learning emotions, interest is the most difficult one to identify.In the multimodal fusion experiment, the fusion effect of feature level is not ideal, while the fusion of decision level can make better use of the complementary information of the three modals, achieving 81.90% recognition accuracy.The fusion model has a strong ability to distinguish between confusion, boredom, and happiness.
Future work can focus on the following aspects.First, in the data collection stage, more stimulus materials, which can induce the state of interest, should be added to improve the sample quality of category interest.Second, more modals, which can be obtained by noninterventional means, such as micro-expression and photo plethysmo graph (PPG) signals, can be adopted for emotion recognition.Third, the video semantic information and learners' cognitive state can also be combined to further analyze emotion recognition, especially in learning scenarios.

Fig. 2 .
Fig. 2. Feature point trajectory and saccade trajectory.Green is the feature point track and red is the saccade track.(a) Track in a state of boredom.(b) Track in a state of confusion.

Fig. 3 .
Fig. 3. Gaze points and feature points trajectory in fixation state.(a) Trajectory in boredom state.(b) Trajectory in interest state.

Fig. 4 .
Fig. 4. Principal component contribution rates and accumulated contribution rates of PCA after dimensionality reduction.(a) Eye movement signal feature.(b) MFCC coefficient feature.(c) Video image feature.

Fig.
Fig. Flowchart of original feature and deep feature fusion of eye movement signal, audio signal, and video image.

Fig. 6 .
Fig. 6.Confusion matrix of classification results by using different modal features.(a) Confusion matrix of eye movement modal.(b) Confusion matrix of audio modal.(c) Confusion matrix of video image modal.

Fig. 7 .Fig. 8 .
Fig. 7. ROC curve of classification results by using different modal features.(a) ROC curve of eye movement modal.(b) ROC curve of audio signal modal.(c) ROC curve of video image modal.

Fig. 11 .
Fig. 11.Confusion matrix and ROC curve of decision-level fusion model: (a) confusion matrix and (b) ROC curve of decision-level fusion.

1 )
ALL: Complete configuration of the optimal model, including eye movement features (including FCDE), audio and video features, IDSF module (deep feature and shallow feature fusion twice), and EC-CNN classification module.

TABLE I FEATURES
EXTRACTED FROM EYE MOVEMENT SIGNALS FCDE.All the eye movement features used in this study are shown in TableI.The sampling window size is a fixed time interval, such as 2 s.

TABLE II CORRELATION
COEFFICIENT BETWEEN EYE MOVEMENT FEATURES AND EMOTIONAL STATE

TABLE IV STRUCTURE
AND PARAMETERS OF EC-CNN MODEL

TABLE V EMOTIONAL
STATES RECOGNITION ACCURACY WITH DIFFERENT SINGLE MODALS/DIFFERENT TIME WINDOWS/METHODS

TABLE VI EVALUATION
OF SINGLE MODAL EMOTION RECOGNITION MODEL IN DIFFERENT AFFECTIVE STATE

TABLE VII COMPUTATIONAL
COMPLEXITY ANALYSIS OF THE MODELSThe three metrics are calculated using the open-source library ptflops.As shown in Table VII, our model is better than ResNet18 in time complexity, memory usage, and NP.

TABLE VIII EMOTIONAL
STATES RECOGNITION ACCURACY OF SINGLE MODAL/BIMODAL

TABLE IX RESULTS
OF ABLATION EXPERIMENT OF THE MODEL 2) All/FCDE: Remove FCDE features from the eye movement feature.3) ALL + FE-CNN: Based on the IDSF module, add an FE-CNN module to extract the deep features again.4) All/IDSF: Remove the IDSF module.5) All/FE-CNN: Remove the second FE-CNN module.6) EC-CNN/SVM: EC-CNN is replaced by SVM.7) EC-CNN/ResNet18: EC-CNN is replaced by ResNet18.Table IX lists the ablation experimental results for each experiment.