A Lightweight Convolutional Neural Network for Real-Time Facial Expression Detection

In this paper our group proposes and designs a lightweight convolutional neural network (CNN) for detecting facial emotions in real-time and in bulk to achieve a better classification effect. We verify whether our model is effective by creating a real-time vision system. This system employs multi-task cascaded convolutional networks (MTCNN) to complete face detection and transmit the obtained face coordinates to the facial emotions classification model we designed firstly. Then it accomplishes the task of emotion classification. Multi-task cascaded convolutional networks have a cascade detection feature, one of which can be used alone, thereby reducing the occupation of memory resources. Our expression classification model employs Global Average Pooling to replace the fully connected layer in the traditional deep convolution neural network model. Each channel of the feature map is associated with the corresponding category, eliminating the black box characteristics of the fully connected layer to a certain extent. At the same time, our model marries the residual modules and depth-wise separable convolutions, reducing large quantities of parameters and making the model more portable. Finally, our model is tested on the FER-2013 dataset. It only takes 3.1% of the 16GB memory, that is, only 0.496GB memory is needed to complete the task of classifying facial expressions. Not only can our model be stored in an 872.9 kilobytes file, but also its accuracy has reached 67% on the FER-2013 dataset. And it has good detection and recognition effects on those figures which are out of the dataset.


I. INTRODUCTION
With the rapid development of human-computer interaction and pattern recognition, coupled with the rapid update of computer hardware, people can deliver complex work to computers to meet certain life and market needs. It brings great convenience to humanity. Facial expression recognition is an intelligent human-computer interaction method that has emerged in recent years. It has a wide range of applications, such as VR games, medical care, online education, driving, security, and so on. Nowadays, many cameras have added smile mode, that is, when a smile is detected on the camera, a photo is taken automatically without having to manually press the shutter, which makes the user experience better. In some European countries, people use facial expression recognition to capture the mood fluctuations of elementary school students in classes, so as to analyze their learning status and treat students as individuals. Some models of The associate editor coordinating the review of this manuscript and approving it for publication was Nizam Uddin Ahamed .
Toyota's high-end brand Lexus monitor the driver's eyes and facial expressions to detect fatigue driving so as to avoid some traffic accidents.
People's facial expression is one of the important ways to express their own emotions. Sometimes it is easy to find one's inner thoughts by his expressions. The main function of facial expression is to capture the emotional changes of the subject through facial emotions. Compared to other methods of communication, facial expressions are more diverse. It is easier to show someone's own true feelings inadvertently.
In 1971, Ekman [1] first divided expressions into six basic forms, including sadness, happiness, fear, disgust, surprise, and anger. A normal expression has been added to the FER-2013 dataset [2]. Fig. 1 shows the samples of the expressions from the FER-2013 dataset [2]. And as we can see, it is difficult to sort them out manually. Moreover, human beings can classify the images of faces with an accuracy of 63%±5% among the seven emotions.
The most advanced methods dealing with images, such as image classification and object detection, are based on convolution neural networks. Laerence and Giles [3] propose a hybrid neural-network which uses local image sampling, a self-organizing map (SOM) neural network, and a convolutional neural network in combination for human face recognition; Shin et al. [4] use deep convolutional neural networks to deal with the computer-aided detection problems, their model involves 5 thousand to 160 million parameters which has high requirements to computer hardware; Chang et al. [5] constructed a convolution neural network for extracting the features of the input images. The complexity-aware classification algorithm is used to divide the dataset into a simple classification sample subspace and a complex classification sample subspace, which reduces the complexity of facial expression recognition caused by environmental factors; Georgescu et al. [6] combined the automatic features learned by the convolution neural network with the manual features calculated by the bag of the visual word, and used support vector machines as classifiers to predict the class label. Du and Gao [7] present a method which realizes segmentation by the multiscale convolutional neural network, edits each input image in multi-scale analysis, obtains the feature mapping of the focus and defocused regions, and finally achieves the optimal fusion performance in both qualitative and quantitative aspects; Uddin et al. [8] Propose a new robust method for feature extraction, called local directional position pattern (LDPP), which can provide robustness for better facial features; A depth camera-based novel method put forward by Uddin et al. [9] can extract eigenvalues more robust.
Millions of parameters are required in the CNN architecture of these tasks [10], which makes it difficult to deploy on embedded devices. In GoogleNet [11] and AlexNet [12], the use of large convolution kernels on high-dimensional feature graphs to directly reduce dimensionality does not generate excessive calculations. And continuous large convolution kernels instead of small convolution kernels can reduce the complexity of the model and further compress the number of parameters. In this paper, we propose and design a convolution neural networks framework for detecting facial emotions in real-time and in bulk. Our model employs a Global Average Pooling layer instead of a fully connected layer and marries the residual module and depth-wise separable convolution to subtly reduce a large number of parameters and make our network structure simpler. Moreover, the accuracy of the recognition rate achieves 67% on the FER-2013 dataset.

II. RELATED WORK
The current facial expression recognition methods are mainly divided into two categories, one is the traditional manual method, and the other is the network model using deep learning. Although the traditional method is widely used, it is very limited in practical applications [13], [14].
Using deep learning to classify facial expressions is usually learning how to use strong supervision methods [15]- [19] to represent the emotional features of great sample data. These datasets mainly come from the 4 papers [20]- [23]. Barros [25]. Xiao et al. [26] combined the Region of Interest (ROI) and K-Nearest Neighbor algorithm for facial expression recognition and solved the problem of the poor generalization ability of deep neural networks in the case of small data. Liu et al. [27] proposed a deep learning method based on the geometric model of the facial region for facial expression recognition. Zhao et al. [28] proposed a lightweight expression detection model that can solve the delay problem under natural conditions. Abate et al. [29] proposed a neural network model for face attributes recognition based on transfer learning to group faces according to common facial features.
In some common classic CNN models, the part used for feature extraction usually contains a set of fully connected layers at the end. And the number of parameters in the fully connected layers is often extremely large. For example, VGG Net [30] contains approximately 90% of all their parameters in its last fully connected layers. The main work of VGG16 is to prove that increasing the depth of the network can affect the final performance of the network to a certain extent. An improvement of VGG16 compared to AlexNet is to utilize several consecutive 3 × 3 convolution kernels to replace the larger convolution kernels in AlexNet (11 × 11, 7 × 7, 5 × 5). For a given Region of Interest (ROI), the use of stacked small convolution kernels is better than the use of large convolution kernels, because multiple nonlinear layers can increase the depth of the network to make the learning method have more complex patterns and fewer parameters. The network structure of VGG16 is shown in Fig. 2, which contains 13 convolution layers and 3 fully connected layers. The entire network utilizes the same size of convolution kernels (3×3) and maximum pooling size (2×2).
Secondly, the combination of several small filter (3 × 3) convolution layers is better than one large filter (5×5or7×7) convolution layer. Thirdly, it is verified that the performance can be improved by continuously deepening the network structure. However, VGG-Net also has shortcomings. It consumes more computing resources and uses more parameters, resulting in more memory consumption. Most of the parameters are from the first fully connected layer.
In recent years, an open source model, Inception V3 [31], reduces the number of parameters in the last layers by adding a Global Average Pooling operation. The fully connected layers integrate the feature representation and output it. This operation greatly reduces the impact of feature location on classification. But it has a few issues, such as, too many parameters, slowing down the training speed, and it is easy to overfit. Global Average Pooling reduces each feature image into a scalar value by taking the average over all elements in the feature image. The average operation forces the network to extract global features from the input image. Modern CNN architecture, Xception [32], further reduces the number of parameters by utilizing deep residual learning [33] and depthwise separable convolutions [34]. Separating the processes of feature extraction and composition within the convolution layer can impel it better.
Whether it is VGG16 or Inception V3, they all improve their accuracy by increasing the depth of the network. However, the first problem with increasing the depth of the network is that these added layers are signals of parameter updates. Because the gradient is propagated backward, the gradient of the front layers will be small after increasing the depth of the network. This means that the learning of these layers is basically stalled, which is caused by gradient disappearance. The second problem of deep networks is training. When the network is deeper, it means that the parameter space is larger, and the optimization problem becomes more difficult. Therefore, simply increasing the depth of the network causes more training errors. The emergence of ResNet solves this problem. Fig. 3 shows the core idea of ResNet, Shortcut Connection. Similar to GoogLeNet, ResNet finally adopts a Global Average Pooling layer. A 152-layer residual network can be trained by using the residual module.
The model designed by us incorporates the idea of GoogLeNet, uses the Global Average Pooling layer in the end, and reduces the number of network layers. 2-norm is added to control the weight coefficient. With these improvements, the model will have a strong antidisturbance ability and a good recognition rate.

III. APPROACH A. DATASET
This paper adopts the open source dataset FER-2013 [2]. The original dataset is in CSV format, so we need to exploit pandas to parse and extract the images. After parsing, the dataset consists of 35,887 facial expressions. Among them, the train set is 28709, the Public validation set and the Private validation set are both 3589. Each figure is composed of a grayscale image with a fixed size of 48 × 48. There are 7 expressions, which correspond to digital labels 0-6 respectively: 0, anger; 1, disgust; 2, fear; 3, happy; 4, sad; 5, surprised; 6, normal. In the train set, there are 3995, 436, 4097, 7215, 4830, 3171, 4965 figures of the seven kinds of expressions respectively.
The WIDER FACE dataset is a benchmark dataset for face detection which contains 32203 images and 393,703 faces. These faces have a wide range of changes in scale, pose, and occlusion. The images selected by WIDER FACE are mainly derived from the public dataset WIDER. And the producers who come from the Chinese University of Hong Kong select 61 event categories of WIDER. For each category, one of 40%, 10%, and 50% is randomly selected as training, validation, and testing sets. Fig. 4 shows some samples of the WIDER FACE dataset.
The Karolinska Directed Emotional Faces (KDEF) is a set of totally 4900 pictures of human facial expressions, which belongs to a dataset of small data samples. This group of pictures contains 70 people, showing 7 different emotional expressions, including neutral, happy, angry, afraid, disgusted, sad, surprised. These 70 people include 35 women and 35 men. They did not wear makeup, beards, jewelry or glasses when taking pictures. Their expressions were pre-rehearsed. The image noise in the KDEF dataset was very small, which was very suitable as training data for expression recognition. Fig. 5 shows some samples of the KDEF dataset.

B. MODEL
This paper proposes and designs an emotion recognition model combining MTCNN [35] detection method. We abandon the traditional OpenCV face detection and replace it with MTCNN which uses cascade detection methods and has good detection effects in recent years. We achieved good results in the final experimental test. We eliminate the interference factors of the multiple faces in the image, so that the effect of emotion recognition is greatly improved. In the expression recognition model, we learn from the idea of Xception [32], which fuses the use of deep residual learning and depthwise separable convolutions. The main purpose of this design method is to achieve the best identification accuracy in multiple parameter ratios.
Our initial model uses the Global Average Pooling to completely remove the fully connected layer. This is achieved by placing a feature map in the final convolution layer consistent with the number of classes and using the SoftMax activation function to deal with the classification problem. Our model is trained with the ADAM optimizer [36]. The model structure of our expression classification is shown in Fig. 6. The Network in Network model proposed by Lin et al. [37] uses the Global Average Pooling method to replace the fully connected layer in the traditional deep convolution neural network model and achieves good results on the CIFAR-100 dataset. The model of using global average pooling gives the network output layer channels a clear meaning. It makes each channel of the feature map be associated with the corresponding classification category. To a certain extent, this method eliminates the black box characteristics [38] of the fully connected layer. Therefore, this paper draws on this idea, employs the Global Average Pooling Operation to average each feature map of the feature fusion, and uses it as a new feature map. Global Average Pooling can be linked to global information, strengthen the connection between spatial information, and learn more detailed and comprehensive facial expression features. At the same time, the pooling layer contains no parameters, reduces network parameters, and avoids overfitting.
In the convolution operation, the size of the output image is calculated by using (1), where W is the matrix width, H is the matrix height, F is the width and height of the convolution kernel, P is padding, and S is the step-size. The padding mode used in this paper is the same because the padding operation of the same mode can keep the size of the feature map unchanged after convolution. When the Pooling operation is used, the size of the output image is calculated by using (2), and the final result is rounded down. ( As shown in Fig. 7, in Same mode, the orange part is the image, the blue part is the filter, and the white part is filled with 0. When the center of the filter (K) coincides with the corner of the image, the filter will make a convolution operation to the image. It can be seen that the range of motion is smaller than before. Same mode can keep the size of the feature map unchanged during the forward propagation. Although our group has expanded the dataset, the number of pictures is still relatively small and there are a lot of duplicate faces. At the same time, the pixels of the pictures are exceptionally low and even there is some noisy pictures in the FER-2013 dataset. In order to avoid the phenomenon of overfitting, we add 2-norm to the weight coefficient. We choose 2-norm rather than 1-norm because 2-norm can obtain parameters with small values. And 2-norm can not only prevent the overfitting, but also make our optimal solution stable and fast. The fitting process usually tends to keep the weights as small as possible, and finally constructs a model with all parameters relatively small. When the parameters are small enough, it can make the antidisturbance capability of the model strong. 2-norm loss function is (3), which can be written as (4). (5) means that the gradient descent method is used to update the parameters, no matter what w is, it tries to make it smaller. It is tantamount to multiplying each matrix by a coefficient (1 − αλ/m), the coefficient is less than 1. So 2-norm is also called ''Weight Decay''.
Adding 2-norm can avoid the weight matrix being too large. If the regularization λ is set large enough and the weight matrix W is set to a value close to 0, the intuitive understanding is to set the weight of multiple hidden units to 0, it can basically eliminate many impacts of these hidden units. In this case, the greatly simplified neural network will become a very small network, as small as a logistic regression unit, but it contains many layers of the network. And the network is closer to the state of ''High bias'' from the state of ''High variance'', when an intermediate value λ is given, the network can be in the intermediate state, ''Just right'', as shown in Fig. 8. In other words, if λ is increased enough, W will be close to 0, but in fact this will not happen. We try to eliminate or reduce some impacts of hidden units to make the network simpler and get closer and closer to logistic regression eventually. We intuitively think that a large number of hidden units are completely eliminated. Actually, this is not so, all hidden units of the neural network will still exist, but their impacts VOLUME 9, 2021 become smaller and the neural network becomes simpler so that overfitting is less likely to occur.
In addition, this model also marries the residual modules and depth-wise separable convolutions. Depth-wise separable convolutions resolve the traditional convolutions into a depthwise convolution plus a 1 × 1 convolution. Fig. 9 (a) shows the standard convolution. Let us assume that the size of the input feature map is DF × DF ×M , the size of the convolution kernel is DK × DK ×M the size of the output feature map is DF × DF ×N , and the parameter of the standard convolution layer is DK × DK ×M × N . Fig. 9 (b) shows depth-wise convolution, and Fig. 9 (c) shows point-by-point convolution. The combination of these two convolutions is depth-wise separable convolution. The depth-wise convolution is responsible for filtering, the size is (D K , DK , 1), the number is M, acting on each input channel. The point-by-point convolution is responsible for converting the channel, the size is (1, 1, M ), the number is N, acting on the output feature mapping of depth-wise convolution. The amount of parameters of depth-wise convolution is DK × DK ×1 × M , and the amount of parameters of point-by-point convolution is 1 × 1 × M × N , so the number of parameters of depth-wise separable convolution is (1/N + 1/ D 2 K ) of the standard convolution.
Our final model is a neural network containing 4 residual depth-wise separable convolutions and combined with an MTCNN detection to achieve the facial expression recognition. Each of these four convolutions is followed by batches of normalized operation and a ReLU [39], [40] activation function. The last layer adopts the Global Average Pooling layer and a soft-max activation function for classification. This architecture has a total of 58423 parameters, of which there are 56951 trainable parameters.
We test on the FER-2013 dataset. The accuracy of sentiment classification achieves 67%. At the same time, the weight of our final recognition architecture can be saved in an 872.9 kilobytes file.

IV. RESULTS
The experiment of our research group is run on an Intel (R) Core (TM) i5-8400 CPU @ 2.80GHz processor, using 16G memory, NVIDIA GeForce GTX 1060 GPU, and Ubuntu 16.04 operation system.
We retrain the MTCNN model using the WIDER FACE dataset and save the obtained weight parameters in the graph file for face detection. Fig. 10 is the true positive rate of the MTCNN we have trained, which can reach about 95%. We have achieved good results in face recognition by using the MTCNN model instead of the traditional Application Programming Interface (API), OpenCV. As can be seen in Fig. 11, (a) is the effect of using OpenCV to recognize faces, (b) is the effect of using MTCNN to recognize faces. Obviously, the detection effect of (b) has been improved a lot, not only to remove the noise in the picture, but also to recognize and classify the unrecognized faces in (a).  Table 1 is the recognition rate of the seven expressions based on the standardized confusion matrix of our network architecture. As can be seen, when it comes to recognizing the fear category, the accuracy is still flawed. The main reason for this result is that the diversity of the number of fears in the dataset is low. It means most of them are European faces and lack of data samples of other types.     As can be seen from Fig. 12, the accuracy rate on the train set can reach about 71%, and the accuracy rate on the validation set can achieve 67%. At the same time, the result of our final model on the test set is shown in Fig. 13, and the accuracy converges to a range between 66.8% and 67.0%. As can be seen, after merging deep residual learning and depth-wise separable convolutions, our model has a relatively high degree of accuracy. Because we use Global Average Pooling rather than the fully connected layer and add 2-norm, the number of parameters in the model is reduced, which ultimately impels our model more portable. Table 2 is a comparison of the experimental results of our model and several other models. Since the code is not published in some literature, we cannot retrain their models. We can only get the accuracy of their models on the test set from their literature. We plot Fig. 14 through Table 2 to show the difference of accuracy of these models more intuitively. VOLUME 9, 2021   It can be seen from Fig. 14 that the accuracy of our model is higher than that of other models except DenseNet-1. Table 3 shows the data comparison of parameters as well as the complexity between our model and other models. From Table 3, we can see that the parameters of DenseNet-1 are nearly twice of our model's parameters, but our accuracy is only about 3% lower. So, our model still has certain advantages in lightweight. We also train the FER-2013 dataset with three models, VGG-Net, ResNet-50, and CNN, and we control their number of layers so that the number of layers is basically similar to our model. The training results are shown in Fig. 15, Fig. 16, and Fig. 17. It can be seen from these figures that the accuracy of VGG-Net is 59.32%, the accuracy of ResNet-50 is 57.48%, and the accuracy of CNN is 58.90%. The accuracy of these three sets is basically the same. However, the loss value of ResNet-50 suddenly increases during training, which causes the gradient to be too large, then overfitting may occur later. To compare the relationship between the parameters and accuracy of each model, we plot Fig. 18. From Fig. 18, we can see that the accuracy of our model is the second with the least parameters, behind to DenseNet-1. It can be seen from Table 3 that our model has more pooling layers than other models. This is because the pooling layer does not contain parameters. We use the global average pooling layer to replace the fully connected layer, which significantly reduces the number of parameters.    Experiments have shown that although we cancel the fully connected layer, it does not affect the accuracy of our model to a large extent.
At the same time, we also monitor the hardware resource occupation of the four models in the training and testing stages. We record the CPU usage, the percentage of CPU occupied by user space, the percentage of CPU occupied by    kernel space, the percentage of physical memory occupied, and total memory of the 4 models when they are trained. The results are shown in Table 4. Table 5 records the hardware resources occupied by the 4 models during testing. Although there is no data in the papers to show the occupation of hardware resources of the models which are shown in Table 2, it can be seen that their models are more complicated than the models we designed by analyzing their model structure. So, it can be deduced that their models' occupation of hardware resources must be higher than ours. As can be seen, when we test the 4 models, ResNet-50 has the highest CPU usage and CNN has   the largest memory usage. In our model, the usage of CPU is depressed, but the occupation of memory is extremely low, only consuming 3.1% of the memory. It can also be seen in Table 2 that the accuracy of our model is almost the highest, which allows us to deploy our model on embedded devices. At the same time, we can get a relatively accurate recognition rate. In contrast, for lightweight models, the performance of ResNet-50 is not so prominent, the recognition rate is below the average, and the dependence on CPU is also immense. So, it is not desirable to use ResNet-50 as a lightweight detection model.
We also trained our model on the KDEF dataset and compared it with other recent models, such as Table 6. Fig. 19 shows the change curves of Loss value and Accuracy value of our model on the KDEF dataset. At the same time, the result of our final model on the test set can reach 87.71%. Fig. 20 shows some representative faces searched by us, such as, the face with covering on the left or right side, the face covered by two hands on the cheek, the face covered by teacup on the chin, the face with cap, and the face cocked to one side. For these kinds of faces, this model also achieves a good detection effect.

V. CONCLUSION
In this paper, our group proposes and designs a lightweight convolutional neural network for recognizing facial expressions. Our network model reduces the number of parameters in the convolutional layer by eliminating the fully connected layer, combining the residual depth-wise separable convolution, and adding the 2-norm regularization term. And our model has no obviously adverse effect on detection and classification. Our model obtains the good detection results by identifying images outside the dataset, which proves that the model designed in this paper is suitable for multiclassification of facial expressions. In general, we have realized a visual system that can be integrated on devices with low computing power to achieve facial expression classification and reduce a large number of parameters. After comparing with the models in recent years, the accuracy of our model is higher than theirs, and it has achieved good detection results in images outside the dataset from the experimental results.
Although our model has achieved some results, there may be a lot of noise in the facial expressions captured in real life, such as the images with too strong or too dark lights, blurred images, most of the face is blocked, and other factors that are not conducive to detection. In order to solve this kind of problem, we need to continue our efforts.