Recognizing Spontaneous Micro-Expression Using a Three-Stream Convolutional Neural Network

Micro-expression recognition (MER) has attracted much attention with various practical applications, particularly in clinical diagnosis and interrogations. In this paper, we propose a three-stream convolutional neural network (TSCNN) to recognize MEs by learning ME-discriminative features in three key frames of ME videos. We design a dynamic-temporal stream, static-spatial stream, and local-spatial stream module for the TSCNN that respectively attempt to learn and integrate temporal, entire facial region, and facial local region cues in ME videos with the goal of recognizing MEs. In addition, to allow the TSCNN to recognize MEs without using the index values of apex frames, we design a reliable apex frame detection algorithm. Extensive experiments are conducted with five public ME databases: CASME II, SMIC-HS, SAMM, CAS(ME)2, and CASME. Our proposed TSCNN is shown to achieve more promising recognition results when compared with many other methods.


I. INTRODUCTION
A special facial expression, a micro-expression (ME) is a rapid facial movement that is not subject to people's conscious recognition and can reveal someone's genuine emotion [1].Compared with typical macro-expressions, MEs are of short duration (typically only 1/25s to 1/3s) and low intensity (the muscle movements only emerge in small facial regions) [2].Due to these facts, MER is very difficult for a human to perform, and Ekman suggests that for MER tasks, people without training perform only slightly better than chance on average [3].Thus, an automatic and reliable MER method should be developed to assist people in recognizing MEs accurately, particularly for application in fields such as The associate editor coordinating the review of this manuscript and approving it for publication was Inês Domingues .
Recently, MER became a popular research topic, and extensive effective approaches have been proposed to perform this task.Typical MER approaches have two main components: facial feature extraction, which aims to extract useful information from facial videos to describe ME, and ME classification, which designs a classifier based on features extracted in the first step for MER tasks.Designing reliable facial features that can effectively describe the subtle changes of MEs would improve performance when performing MER tasks [8].Facial feature extraction has attracted increasing attention from researchers.Among these feature extraction methods, local binary patterns on three orthogonal planes (LBP-TOP) and its variant are widely applied in video-based MER and other computer-vision tasks [9]- [11].In addition, it is observed by the researchers in studies [5], [12] that the temporal dynamics of video sequences can improve MER performance because they can represent the motion across a sequence of ME frames effectively.Some researchers have employed optical flow (OF) based techniques to extract spatiotemporal motion-dependent information from MEs, and many studies [13]- [15] have demonstrated their effectiveness for MER problems.For ME classification, various types of classifiers are mainly based on machine learning, such as support vector machine (SVM), relaxed K-SVD, and group sparse learning (GSL).Specifically, Zong et al. [16] proposed a kernelized GSL to facilitate the process of learning a set of weights from hierarchical spatiotemporal descriptors that can aid the selection of the important blocks from various facial blocks.Zheng et al. [17] proposed a relaxed K-SVD that learns a sparse dictionary to distinguish different MEs by minimizing the variance of sparse coefficients.
In recent years, researchers have also investigated deep learning methods to address the MER problem.For example, Kim et al. [18] proposed a deep learning method based on LSTM for MER tasks with spatiotemporal information extracted by CNN.Another study that uses a similar deep spatiotemporal structure is ELRCN [19], which uses optical flow features for the VGG-Faces model and then passes them on to recurrent layers.In [20], Xia et al. proposed a spatiotemporal extension of RNNs to jointly learn from both spatial and temporal cues of the ME samples to recognize MEs.A recent study [21] by Reddy et al. proposed two 3D-CNN-based models (MicroExpSTCNN and MicroExpFuseNet) to recognize MEs by extracting both the spatial and temporal information simultaneously by applying a 3D convolution operation to ME videos.Zhi et al. [22] similarly suggested 3D convolutional neural networks (3D-CNNs) architecture for self-learning feature extraction to represent facial MEs.Khor et al. [23] proposed a lightweight dual-stream shallow network as a pair of truncated CNNs with heterogeneous input features in MER tasks.Gan et al. [24] proposed an OFF-ApexNet to recognize MEs by learning optical flow features from some key frames of a ME video.In [25], Zhou et al. proposed a dual-inception network for MER.These deep learning methods perform well with MER tasks and outperform manual features and shallow classifiers.
Inspired by the success of these methods with MER, we propose a novel MER method called the three-stream CNN (TSCNN) in our conference paper [26].The TSCNN consists of three major convolutional recognition streams that are used to learn the static-spatial, local-spatial, and temporal features from three different cues in ME videos, respectively.Some of recent studies [8], [27], [28] showed that the apex frame contains more ME-aware information, and thus we design a static-spatial stream CNN in the TSCNN to learn the static-spatial feature from the gray image of the apex frame for MER.The main reason of adding the local-spatial stream CNN is mainly inspired by recent findings in [9], [16], [29], [30].Their studies have proved that the facial local region information has indeed contributions to distinguishing different MEs.Finally, following some studies in [19], [24], [25], [31], a dynamic-temporal stream CNN is also included in TSCNN to learn the temporal features from the optical flow field to deal with MER.Thus, such design explicitly describes facial texture and the subtle changes between ME video frames, reducing the complexity of ME recognition.Decoupling the static-spatial and temporal recognition streams also allows us to exploit the availability of large amounts of annotated image data by pretraining the static-spatial recognition stream using some large facial expression databases such as the FER2013 or ImageNet databases.In addition, our proposed method only analyzes three key frames (the onset, apex, and offset frames) instead of spotting facial micro-movements in all frames in an ME short video.It can avoid the interference of useless frames on the accuracy of MER and reduce redundant data.Thus, the TSCNN achieve parameters fitting in a short time and provide the possibility for real-time application.
This paper is an extended version of our conference paper.We will reinvestigate some problems in MER and extend our conference work.In addition to the contribution of our preliminary work, this paper contains the following main contributions: 1) A reliable apex frame detection algorithm is designed for the TSCNN without using the index values of apex frames given in ME videos from databases.Furthermore, we investigate the influence of parameter λ on the accuracy of TSCNN in MER tasks when locating apex frames from ME videos.2) More extensive experiments are conducted to evaluate the TSCNN on five public ME databases: CASME II, SAMM, SMIC-HS, CAS(ME) 2 , and CASME.In addition, we investigate the networks that have different combinations of three separate recognition streams using the above databases.The rest of the paper is organized as follows.Our proposed method for MER is presented in Section II.Then, we discuss the experimental results for our method in Section III.Finally, conclusions are drawn in Section IV.

II. PROPOSED METHOD
In this section, we present our proposed method for recognizing MEs in detail.As shown in Fig. 1, our proposed method can be divided into three main parts: apex frames location, spatiotemporal feature extraction, and TSCNN modeling.In the first part, we introduce a reliable apex frame detection algorithm designed for the TSCNN in MER tasks without using the indices of apex frames given in ME videos.In the second part, we introduce the process of extracting spatiotemporal feature from the perspective of feature fusion on the network layer of the TSCNN, as well as its form.In the third part, we present the proposed TSCNN, including its detailed structure and how it deals with MER tasks.The details of each part are described in the following subsections.

A. IDENTIFYING APEX FRAMES
Index values of some key frames are typically given in ME videos from databases.For example, as shown in Fig. 2, the starting frame when an ME occurs is called an onset frame.The apex frame is the frame where ME intensity reaches its maximum.The ending frame is called the offset frame.The apex frame carries more spatial information about facial muscle micro-movement than other frames because micro-expression intensity reaches its maximum in this frame.The changes of optical flow between these three frames are most obvious in the whole video.So we suggest that the proposed TSCNN network can learn the significant features from spatiotemporal information carried by these three frames.In addition, some studies [8], [28] also suggest that the apex frame is typically the most expressive in an ME video, making it more discriminative and effective for ME recognition.For these reasons, we only need to analyze three frames (the onset, apex, and offset frames) instead of spotting facial micro-movements in all frames in an ME short video.Thus, apex frame location plays a critical role in MER tasks, especially when analyzing ME videos without using the indices of apex frames given by databases.In this subsection, we introduce an approach that can locate apex frames from ME videos for the TSCNN.
To avoid the interference from blank regions without MEs, we use a face-detection method, based on the work of Rowley et al. [32], to segment the facial region in each frame of the ME videos and then use the landmark algorithm in [33] to locate 68 facial landmarks using an ensemble of regression trees (ERT).To remove the influence of head posture, we first align facial regions.As shown in Fig. 3, we locate two inner eye corners, (x 1 , y 1 ) and (x 2 , y 2 ), and calculate the rotation matrix R as described below. 2 .
In-plane rotation and facial size variations within the facial region are corrected based on x , y = (x, y) R T . ( After facial alignment, we obtain the inner eye corners (x 1 , y 1 ), (x 2 , y 2 ) and the nasal spine point (x 3 , y 3 ).We then determine the width ω = (x 2 − x 1 )/2 and the height h = (y 3 − y 1 )/2 of every division block and the starting point (2x 1 − x 2 , 2y 1 − y 3 ) based on these three points.Then, the facial region is divided into 6 × 6 equal-sized blocks, as shown in Fig. 3.
To distinguish the relevant peaks from local magnitude variations in each frame in ME short videos and determine when ME reaches its maximum, we analyze facial texture and shape appearance.In many studies, LBP and its variant are preferred when analyzing facial texture and shape appearance [9].Ojala et al. [34] proposed a uniform pattern LBP (UP-LBP) to reduce the sparse conditions caused by feature dimensions and improve the statistical properties of facial features.Thus, we calculate the UP-LBP histogram from each block in the facial area of each frame to describe facial texture and shape appearance to determine which can yield the best performance.For each frame of the input video, we calculate the UP-LBP histogram with P = 8 & R = 3 for each of its 36 blocks as H i,0 , H i,1 , . . ., H i, 35 .The dimension of each histogram is 10; thus, each frame in videos will correspond to 36 10-dimensional vectors as H i : H i,0 , H i,1 , . . ., H i, 35 .
Feature difference (FD) analysis compares the differences in the appearance-based features of sequential video frames within a specified interval and provides information about spatial location in identified facial movements.To capture the greatest changes, we use FD values to roughly locate the apex frame (i.e., the highest intensity frame of rapid facial movements).The FD value between H i of the i-th frame and H j of the j-th frame is given by: where j,k respectively represent the values of the same dimension α in the k-th UP-LBP histogram corresponding to H i and H j .Only the largest λ values among the 36 distances are used in calculations, because the occurrence of an ME will result in larger d i values in some (but not all) blocks between two adjacent frames.More details using different values of λ are presented and discussed in Section III.D.
Finally, we calculate H onset of the onset frame and H offset of the offset frame in an ME short video respectively, and then calculate the average value H between H onset and H offset .If the FD H k ,H value between H k of the k-th frame and H exists the greatest value, the k-th frame is seen as the apex frame in the whole video.As shown in Fig. 4, a higher value of FD H k ,H indicates that a muscle movement with a larger amplitude exists in the facial area of the frame.

B. SPATIOTEMPORAL FEATURE EXTRACTED BY THE TSCNN FOR MER
Spatiotemporal feature is characterized by the type of information encoded in space and time, which can describe MEs in videos and allows it to represent subtle expressions in videos more efficiently.In this paper, our proposed spatiotemporal feature consists of three components: static-spatial, localspatial, and temporal components.Details of each component are described below.

1) STATIC-SPATIAL COMPONENT
Static-spatial information, especially some appearance and overall outline information, has gained increasing attention in facial image analysis and has been shown to be effective in tackling the MER problems [9], [35], [36].The static appearance and overall outline of a whole face is the most intuitive since some facial expressions are strongly associated with particular facial muscle contractions.In an ME video, the apex frame carries more spatial information because facial muscle micro-movement of this frame is more obvious than that of other frames.Thus, we consider the gray image of the whole face in the apex frame as the input of the staticspatial recognition stream in the TSCNN, which is cropped to 48 × 48 pixels.Finally, the static-spatial feature extracted from the whole face is fused together with two other feature vectors from the other two recognition streams at the second fully connected layer of the TSCNN network.
2) LOCAL-SPATIAL COMPONENT However, it is not sufficient to represent all characteristics of ME videos if only static-spatial components are considered when performing ME recognition.Since ME muscle movements only appear in facial local regions (e.g., mouth, cheek, eyebrows, and eyes), motion changes at these regions and conveys meaningful information from different MEs.Blockbased segmentation of a face to extract facial local features is a common practice when extracting facial local features, that can be assigned to regions that contain key facial features with the goal of enhancing recognition power [16].Some studies [9], [16], [29], [30] have proved that the facial local region information has indeed contributions to distinguishing different MEs.
However, many methods use block-based segmentation of a face without considering the effects of block size.Ideally, the contribution from all blocks in a face should be varied greatly from different grid divisions of a face.Thus, we use spatial grids with multiple sizes {n × n|2 × 2, 3 × 3, 4 × 4} to divide the grayscale image of the apex frame into several facial blocks and then stack them up to obtain a facial block sequence to serve as the input of local-spatial stream CNN in the TSCNN, where the division detail is shown in Fig. 5. Specifically, the gray image of the apex frame in an ME video sample is scaled to 48n × 48n before image segmentation.The facial block sequence as input is an n 2 -channel gray image, and the size of each channel is 48 × 48 pixels.Finally, we obtain the local-spatial component at the last fully connected layer of the local-spatial stream in the TSCNN.

3) TEMPORAL COMPONENT
Compared to still image classification, videos provide data augmentation for single image classification.The temporal components of videos provide an additional information for MER.Many muscle movements emerge in facial regions and can be reliably recognized based on the motion information [31].For example, we select the onset frame F 21 , the apex frame F 54 and the offset frame F 76 in the sub01/EP04 02.avi sample of the CASME II database, to calculate the horizontal and vertical optical flow field and visualize it (see Fig. 6).The FACS label of this sample (AU4) indicates a frowning action.Using this image of the optical flow field, we can observe the muscle movements in the subject's eyebrows from the occurrence to the disappearance of an angry micro-expression, although the amplitude of the facial muscle motion between adjacent frames is very small.Thus, using only the spatial component does not capture the motion well in ME videos.In this section, we describe the process of extracting the temporal component from ME videos using dynamic-temporal stream in the TSCNN.Many studies have used methods based on optic-flow (OF) [37]- [42] to characterize the local dynamics of a temporal texture and detect motion information between adjacent frames.The optical flow field is a set of displacement vector fields between pairs of consecutive frames.The horizontal and vertical components of the vector field can be thought of as image channels.Thus, it is suitable for deep networks to learn advanced features.
Optical flow fields between three frames (the onset, apex, and offset frames) are calculated by the approach in [43], in which the function flow(F 1 , F 2 ) takes two frames as inputs and a horizontal optical flow field X and a vertical optical flow field Y as outputs, as described below, where F onset , F apex , and F offset represent the onset frame, the apex frame and the offset frame in an ME video, respectively.Two sets of optical flow fields are obtained via the formula above.Each set contains two optical flow fields (horizontal and vertical) that move pixels in the x-and ydirections, respectively.Thus, the two sets of optical flow fields can completely represent ME movements from occurrence to peak and then from peak to termination.Since data in optical flow fields is represented as float64 values, we must normalize the optical flow matrix via min-max normalization as follows: where H org and H norm are the matrix before and after normalization, respectively.By transforming the original matrix linearly, all elements fall into the [0,1] interval.Thus, we obtained two sets of normalized optical flow fields for each ME video in a given database and then stack them in the same way as processing the local-spatial component, which can be considered a 4-channel image of size 48 × 48 pixels.Finally, we take the 4-channel image as the input of the dynamic-temporal recognition stream and obtain the temporal component that is a 1024-dimension vector.

C. TSCNN MODEL FOR ME RECOGNITION
The proposed TSCNN is based on the research on CNN networks.It is composed of multiple processing layers in a multi-stream architecture and can learn representations of data using multiple levels of abstraction.The TSCNN consists of three-stream CNNs (i.e., the static-spatial stream (S), the local-spatial stream (L), and the dynamic-temporal stream (T)), which learn discriminative features for recognizing ME from three different clues in three key frames from ME videos.Its detailed structure and how it deals with MER tasks, is shown in Fig. 7.
To reduce the redundant parameters and realize parameters sharing, each stream module in the TSCNN has the same structure.This design aim to make the TSCNN achieve parameters fitting in a short time and reduce the amount of training.As shown in Table 1, each recognition stream is a simplified network that uses a 2D convolution kernel and pooling cell to automatically represent the properties of subtle facial movements.The three recognition streams are then combined by late fusion in a fully connected layer.
Among the three recognition streams in our TSCNN, the static-spatial recognition stream (S) operates on individual video frames (e.g., the apex frame), effectively performing action recognition using still images.We consider the gray image of the apex frame as the input of this recognition stream.The local-spatial recognition stream (L) operates on the n 2 -channel gray image after stacking n × n blocks of the gray image of the apex frame.The input to the dynamictemporal stream (T) contains optical flow displacement fields between three frames (the onset, apex, and offset frames), whose center frame is the apex frame.We use the dynamictemporal stream with optical flow sequences to ensure that the TSCNN networks can further acquire higher-level features.Such inputs explicitly describes the motion between video frames, which significantly improves accuracy and makes ME recognition easier.
Each recognition stream is compacted with only 9 layers: 5 convolutional layers, 3 pooling layers and 1 fully connected layer.For the first convolution layer in each recognition stream, the kernel size is set equal to 5×5, the stride size is set equal to 1, and zero padding is set equal to ''valid''.For other four convolutional layers in every recognition stream, we use a kernel size of 3×3 with a stride of S = 1, and zero padding is set equal to 1.The number of kernels (N) for each layer is 64, 64, 64, 128, and 128 respectively.The N value of the last two convolutional layers is much larger than that of the other layers and will increase the computational complexity of the network.Many studies [9], [44] have demonstrated that a large N can cover more abstract features of certain important facial regions, such as the eye or mouth region, and thus improve the performance of MER.
Three pooling layers of every stream are used to downsample the spatial dimensions of the input, which contains a max pooling layer with a window size of 5×5, and two mean pooling layers with a window size of 3×3.The stride of each pooling layer is 2, and the number of kernels (N) is set equal to 64, 64, and 128, respectively.This design is important in real applications because there is no agreed standard frame rate so far for recoding the micro-expressions (i.e., the ME video could be recorded in various frame rate).The design of different network streams can adapt to different frame rates, which may make the whole network robust to the frame rate of the input data.
The final layer of every recognition stream is a fully connected layer that has the same configuration.Their output dimension are all set equal to 1024 to reduce the number of parameters in the model, and prevents overfitting.Then, the output of three recognition streams are merged into a 3072-dimensional feature vector.In the final layer of the TSCNN, we transform the feature vector to one having the same dimension as the ME class number in MER tasks.Thus, the output dimension of the final layer is different for different databases.
All hidden layers are equipped with the Parametric Rectified Linear Unit (PRELU) function, which is defined as follows: where i denotes the channel, and a i is a parameter obtained during training.Compared with the traditional activation function (sigmoid, tanh, ReLU, etc.), the PReLU can improve classification of the CNN model at a cost of overfitting and computational complexity.Cross entropy is used to calculate the loss function of TSCNN, which can be defined as: where N denotes the number of the training samples, Y is the number of emotion types, y n is the label of n-th training sample and P n,j represents the value of the prediction that the n-th training sample is predicted to be the j-th class.We use the backpropagation (BP) algorithm to minimize the loss function of the TSCNN and update the weight parameters.The training optimizer is the stochastic gradient descent (SGD) algorithm with Nesterov Momentum.The iterative process is as follows: where α represents the learning rate.The attenuation of the weight parameters is set equal to 10 −5 , and the correction factor is set equal to 0.9.

III. EXPERIMENTS
In this section, we present experimental results of our proposed method in detail, including the datasets we used, the implementation details, and the comparison of experimental results, etc.

A. DATABASES AND EXPERIMENT SETTING
In this section, we conduct extensive MER experiments to evaluate our proposed TSCNN method.The CASME II [45], SMIC-HS [46], SAMM [47], CAS(ME) 2 [48], and CASME databases [49] (57), contempt (12), happiness (26), surprise (15), and others (26), where the number in the brackets are the number of corresponding MEs present in the database.4) The CAS(ME) 2 database (Chinese Academy of Science Macro-and Micro-expression) was established by the Chinese Academy of Science.The database includes both spontaneous macro (300) and micro (57) expression video sequences of 22 subjects (13 females and 9 males).These videos have been captured by a camera with a 500-ms shutter speed, and the recorder's resolution was set equal to 640×480 pixels at 30 frames per second.By extracting more than 600 AUs, these image sequences are categorized into three emotion classes: anger, happy and disgust.In our experimental setup, we selected 341 image sequences, anger (102), happy (151) and disgust (88) of macro-and microexpressions, where the number in the brackets is the number of corresponding expressions present in the database.To ensure fair comparisons and following other methods, such as 3D CNN based techniques in the literature [21], we also report the recognition results under the same conditions as the literature.All samples were coded with onset, apex and offset frames with action units (AUs) marked and emotions labeled.There are 8 classes of the micro-expressions in this database: tense, disgust, repression, surprise, happiness, fear, sadness, and contempt.Since the three classes of happiness, fear and sadness contain very few samples, we chose the remaining four classes in our experiment: tense (69), disgust (44), repression (38), and surprise (20).For all experiments in the above five public databases, the leave-one-subject-out (LOSO) protocol is used to calculate the recognition accuracy and mean F 1 -score to report the performance of the MER methods.In each fold, the samples of one subject are used as the test set, while the remaining samples are used for training.This method can eliminate appearances of samples from the same subject in the training and verification sets, thus ensuring the reliability of the experimental results.
The accuracy rate can be calculated as follows: where T i and N i are the number of correct predictions and the number of testing samples, respectively, when the samples of the i-th subject is used as the test set.The accuracy rate shows the average ''hit rate'' across all classes and does not evaluate the performance of the algorithm objectively.
The CASME II, SMIC-HS, SAMM, CAS(ME) 2 , and CASME databases are highly imbalanced [19], [48]- [52], which means that the number of one type of micro-expression samples is significantly more or less abundant than other types of ME samples.The performance of the classifier that deals with each emotion class is not revealed.Thus, we calculate an F 1 -score to describe the classification effect for each class, and use it as a criterion to measure the network performance along with the accuracy rate.The F 1 -score can be defined as: where i and r i are the precision and recall of the i-th micro-expression, respectively, and c is the number of microexpressions.

B. IMPLEMENTATION DETAILS
We set the input image size of each recognition stream in the TSCNN equal to 48×48, and the facial block number in the local-spatial stream is set equal to 3 × 3. The base learning rate is set equal to 10 −3 in the experiment due to difficulties related to the subtlety of MEs.The attenuation of weight parameters is set equal to 10 −5 , and the correction factor is set equal to 0.9 in the experiment.Dropout is used on all fully connected layers in the TSCNN model to avoid overfitting problem.The λ values of the carrying experiments with the CASME II, SMIC-HS, SAMM, CAS(ME) 2 , and CASME databases equal 25, 21, 23, 25, and 20, respectively, when locating apex frames from ME videos.
To train the TSCNN model to distinguish MEs, large amounts of training data is needed.However, only a few key expression frames can be selected for the training in an ME video.We thus expand the number of training samples by taking the original samples and applying a horizontal flip and clockwise/counterclockwise rotation in 5 or 10 degree increments a total of 10 times, as shown in Fig. 8.This process yields 2470, 1360, 1640, 3410, and 1710 samples from the CASME II, SAMM, SMIC-HS, CAS(ME) 2 and CASME databases, respectively.When the training data is ready, we begin to train the TSCNN network according to our purposes.
We pretrain the static-spatial recognition stream using the large facial expression database FER2013, where obtained weights are used for initialization.The weights of the localspatial recognition stream and dynamic temporal recognition stream are randomly initialized.Mini-batch is not applied in the experiment due to the small sample size.Early stopping is used to train our TSCNN model over 500 iterated epochs in each fold.When the validation loss curve is generally stable, training for each fold will stop, and our TSCNN model will output the emotion classification label.

C. COMPARISON WITH THE STATE-OF-THE-ART METHODS
In this subsection, we compare the best result achieved by our method with those of other state-of-the-art methods [5], [9], [16], [18]- [25], [29], [50]- [84] using the five public ME databases (CASME II, SMIC-HS, SAMM, CAS(ME) 2 , and CASME).The LOSO protocol was used for all the methods.In Tables 2 through 6, TSCNN-II represents the results achieved by TSCNN when using the apex frame given by databases.TSCNN-I represents the results achieved by TSCNN when using the apex frame located by our proposed approach in Section II.A.
From Tables 2 through 6, our TSCNN is shown to yield an accuracy of 80.97% and an F 1 -score of 0.8070 with the CASME-II database; 71.76% and 0.6942 with the SAMM database; 75.41% and 0.7463 with the CAS(ME) 2 database; and 73.88% and 0.7270 with the CASME database when we use index values of some key frames given by these databases in MER tasks.Thus, our TSCNN shows significant improvement in recognition compared to other methods.Additionally, our TSCNN model achieves improved classification results in MER tasks, especially when assuming that these databases do not provide us with index values and that apex frames must be located.In this case, the accuracies and F 1 -score with the CASME II, SMIC-HS, SAMM, CAS(ME) 2 , and CASME databases are 74.05% and 0.7327; 72.74% and 0.7236; 63.53% and 0.6065; 71.62% and 0.7129; and 70.73% and 0.6736, respectively.
As described above, the experimental performance of the TSCNN-I in MER is worse than that of the TSCNN-II.This result agrees with our expectations, because accurately locating an apex frame in an ME video is difficult and may decrease the performance of the deep learning method.Additionally, many other methods [8], [28], [85], [86] only locate apex-feature time intervals roughly.
Next, to analyze the recognition performance of our TSCNN in MER tasks, we only compare the results of the TSCNN-I and other methods when simulating the MER problem without using true indices given by the databases.
As shown in Table 5, our TSCNN-I yields a recognition accuracy of 71.62% and an F 1 -score of 0.7129 when a total of 341 image sequences of macro-and microexpressions are selected.Compared with the results of other state-of-the-art approaches [65], [76], [77], our method is very competitive using this database.We also compare the TSCNN with two 3D-CNN methods (MicroExpSTCNN and MicroExpFuseNet) that were proposed in [21] using only micro-expression videos that have the same number of samples as  (LTOGP(with FS) [82], LTOGP(without FS) [82], FDM [54], DiSTLBP-RIP [83], and STCLQP [53]), our method exhibits an improvement of 2.09%, 9.66%, 14.59%, 6.4%, and 13.42% in accuracy.Thus, our TSCNN-I yields better recognition and outperforms other methods, especially in the absence of index values given by the databases.We also calculate the confusion matrix for each of the five databases to determine the recognition of the TSCNN for each emotion label, as shown in Fig. 9.For the CASME-II database, the TSCNN yielded an improved recognition, especially on the ''surprise'' and ''others'' labels.However, the method still encountered a bottleneck with the ''repression'' label because repression emotions have a relatively small range of muscle motion and is thus more difficult to detect and classify correctly.For the SMIC-HS database, our TSCNN yielded an improved recognition result for all labels.For the SAMM database, the network also performed well on two labels (anger and others) but did not perform well on the ''contempt'' and ''surprise'' labels.This result agrees with our expectations because all poorly performing labels have a small sample size, which hinders such deep-learning methods.For the CAS(ME) 2 database, the network performed well on two labels (anger and disgust) but did not perform well on the ''happy'' labels.For the CASME database, the TSCNN performed well on three labels (disgust, surprise, and tense) but did not perform well on the ''Repression'' labels.The above results show that for MER tasks, the recognition of our TSCNN is superior to that of the image feature extraction method used by most previous researchers.

D. PARAMETER ANALYSIS
In this subsection, we analyze the parameters of the proposed methods and evaluate the impact of these parameters individually.The block pattern of the input image for the local-spatial stream, the number of the network's recognition stream, and the effect of λ on the TSCNN are reported and  discussed in this section.The following experimental results are obtained with the assumption that the five databases do not provide apex frame indices.

1) EVALUATION OF DIFFERENT BLOCK PATTERNS
The facial block sequence served as the input of the local spatial recognition stream in the TSCNN is different when we choose spatial grids with multiple sizes (2×2, 3×3, and 4 × 4) to divide the gray image of the apex frame.To test which block pattern is optimal, we compared the performance of the TSCNN under three above cases.Experiment results are shown in Table 7, which shows that the TSCNN yields the best results (74.05% for the CASME II database, 72.74% for the SMIC-HS database, 63.53% for the SAMM database, 71.62% for the CAS(ME) 2 database, and 70.73% for the CASME database) when the gray image of the apex frame is divided into 3×3 image blocks and used as the input of the local-spatial recognition stream in the TSCNN.

2) EVALUATION OF THE TSCNN ARCHITECTURE
To analyze our network's structure in depth and find the most prominent module, we compare the results between the TSCNN with that of the network that retains two recognition streams and that of the network that only retains a single stream.We set the block pattern equal to 3×3 for the localspatial stream.
The performance of the TSCNN is significantly better than those of single-stream and two-stream networks, particularly for the dynamic-temporal stream.These results agree with our assumptions, because the calculated image of the optical flow field can describe the two-dimensional projection of an ME movement intuitively and make it easy to distinguish ME emotion categories.Additionally, the results show that the three streams of our TSCNN can better utilize various forms of effective characteristics for ME recognition, yielding better performance for MER tasks than single characteristics.

3) THE IMPACT OF PARAMETER λ ON THE TSCNN
In this subsection, we analyze how λ (see Section II.B) affects the proposed TSCNN model for MER tasks.Its value is evaluated using the CASME II, SMIC-HS, SAMM, CAS(ME) 2 , and CASME databases.Specifically, we change the value of λ to observe the recognition results of the TSCNN with the five databases, as shown in Table 8.The MER accuracy is shown to be stable even when λ varies within a given range.As shown in Table 8, we can see that the occurrence of MEs is a process of gradual change in facial expression intensity.If the apex frame located by our method falls on adjacent frames of the real apex frame, the classification performance of the TSCNN in MER tasks is stable and satisfactory when these location frames are applied to the TSCNN.

IV. CONCLUSION
In this paper, we propose a three-stream convolutional neural network (TSCNN) for ME recognition.Experiments are conducted on five public spontaneous ME databases, (CASME II, SMIC-HS, SAMM, CAS(ME) 2 , and CASME) to evaluate the proposed method.The experimental results show that our method can effectively improve recognition accuracy in MER tasks compared with the results of other methods using the same five databases.Additionally, this paper also summarizes the problems that have not received sufficient attention in research to date but are crucial for feasible MER interpretations.Incorporating static-spatial, local-spatial and temporal information associated with MEs is shown to be important when describing MEs and aides distinguishing MEs.In our method, the dynamic-temporal recognition stream plays a critical role and, depends on the calculation of optical flow.However, this calculation has a high computational cost and thus must occur offline; this is the key bottleneck to the application of this method.In the future, we plan to study faster optical flow calculation methods to facilitate using the proposed method in real-time identification.Additionally, we plan to design a simpler network structure with multiple recognition tubes to handle ME details and use different datasets of spontaneous MEs with various kinds of metrics.

FIGURE 1 .
FIGURE 1.The framework of our proposed method for micro-expression recognition.

FIGURE 2 .
FIGURE 2. A demonstration of an ME short video.The FACS label of this sample is AU4, which indicates angry.The apex frame presents at the 54-th frame of this video.We can easily notice a subtle frowning action on the apex frame when observing each frame of the ME video with the naked eye.

FIGURE 3 .
FIGURE 3.An example of 36 facial blocks yielded by 6×6 grid on a frame in the ME short video.

FIGURE 4 .
FIGURE 4.An illustration of how to identify apex frames from ME short videos.The top describes the UP-LBP histogram of each block in facial regions, which describes the facial texture and shape appearance of each frame.The bottom presents that FD is calculated to locate the apex frame (i.e., the highest intensity frame of rapid facial movements).

FIGURE 6 .
FIGURE 6.The horizontal and vertical optic flow fields and visualization.

5 )
The CASME database was built by the Chinese Academy of Sciences.The database contains two datasets A and B with 195 ME samples from 19 subjects; videos were recorded at 60 fps.The video clips in dataset A of the database were recorded with the resolution of 1280 × 720 pixels in natural light.The samples in dataset B were recorded with the resolution of 640 × 480 pixels under LED illumination.

TABLE 1 .
The configuration of the TSCNN network.
FIGURE 7. The architecture of the TSCNN network.
are used in our experiments as they are widely used spontaneous ME databases.Details of the five spontaneous ME databases used in this paper are listed below.1) The CASME II database was collected by Yan et al. from the Institute of Psychology, Chinese Academy of Science.The database includes 247 ME samples with high spatial and temporal resolutions from 26 subjects.
(27)face videos were recorded at 200 fps, with an average face size of 280×340.These samples are categorized into 5 ME classes: happiness(32), surprise(25), disgust(64), repression(27), and others (99), where the number in the brackets are the number of corresponding MEs present in the database.We pick all ME samples in the CASME II database for experimentation.

TABLE 2 .
Comparison between our method with some state-of-the-art methods on CASME II database.

TABLE 3 .
Comparison between our method with some state-of-the-art methods on SMIC-HS database.

TABLE 4 .
Comparison between our method with some state-of-the-art methods on SAMM database.

TABLE 5 .
Comparison between our method with some state-of-the-art methods on CAS(ME) 2 database.

TABLE 6 .
Comparison between our method with some state-of-the-art methods on CASME database.

TABLE 8 .
The influence of parameter λ on TSCNN model.