Cross-Database Micro-Expression Recognition Based on a Dual-Stream Convolutional Neural Network

Cross-database micro-expression recognition (CDMER) under semi supervised conditions is a difficult task, where the target (testing) and source (training) samples come from different micro-expression (ME) databases, resulting in the inconsistency of the feature distributions between each other, and hence affecting the performance of many existing MER methods. To address this problem, we propose a dual-stream convolutional neural network (DSCNN) for dealing with CDMER tasks. In the DSCNN, two stream branches are designed to study temporal and facial region cues in ME samples with the goal of recognizing MEs. In addition, in the training process, the domain discrepancy loss is used to enforce the target and source samples to have similar feature distributions in some layers of the DSCNN. Extensive CDMER experiments are conducted to evaluate the DSCNN. The results show that our proposed DSCNN model achieves a higher recognition accuracy when compared with some representative CDMER methods.

These above methods are evaluated in an ideal scenario in which the testing samples and training samples are sourced from the same databases. In this case, it can be thought that such training and testing samples abide by the same or similar feature distributions. However, in many applications, the testing samples and training samples may come from different databases (e.g., the target database, and the source database) that recorded by different camera, different subjects, stimu-lus materials under different environments. Some theoretical and empirical results [21]- [26] have shown that different training and testing databases have the large feature distribution difference and increase the test error in proportion. It thus brings us a new topic in micro-expression analysis, i.e., cross-database micro-expression recognition (CDMER), in which the training and testing samples come from two different micro-expression databases collected by different cameras or under different environments [27]. The CDMER can be viewed as a domain adaptation problem (DA). For two different databases, traditional classifiers learned in a source domain do not necessarily transfer well to target domains. We may learn proper feature representations that are discriminative and domain invariant by optimizing the DA methods. Recently, there are many classical domain adaptation methods for cross-database recognition can be applied to cross-database micro-expression recognition, e.g., Zong et al. [28] proposed a domain adaptation method based on target sample regenerator (TSRG) to deal with CDMER problem. Hassan et al. [29] proposed an importance-weighted SVM (IW-SVM) to eliminate the feature distribution mismatch between different samples and improve the classification accuracy under different databases. In the work of [30], Long et al. proposed the application of transfer kernel learning (TKL) to learn a domain invariant kernel for eliminating the feature distribution difference between the samples that come from different databases. Gong et al. [31], [32] proposed a method called the geodesic flow kernel (GFK) to bridge two different databases and narrow their gaps with a welldesigned geodesic flow kernel on a Grassmann manifold. Chu et al. [33], [34] proposed a selective transfer machine (STM) to model the relationship between the training samples and their AU information, which aims to ensure that the testing samples have the similar feature distribution as the training ones by studying a group of weight values in the STM. Fernando et al. [35] proposed another method called subspace alignment (SA) for seeking a mapping function that can align the subspace in which the source samples lie with respect to the target samples. Pan et al., [36] proposed a transfer component analysis (TCA) method based on a reproducing kernel Hilbert space to eliminate the distribution difference of samples from different domains by seeking some transfer components across domains. Li et al. [37] proposed a target-adapted least-squares regression (TALSR) method based on the enabled learned regression coefficient matrix, which can learn a regression coefficient matrix from the source samples and their label information to suit the target ME database.
Benefiting from the above methods, we propose a dualstream CNN (DSCNN) to address the CDMER task by studying a group of weight values from the labeled source samples and the unlabeled target samples. We calculate the MMD value [38] between the output distribution of two domains on some fully connected layers of DSCNN as the domain discrepancy loss in the training process, which can eliminate the feature distribution difference between samples from two domains. Two stream branches of the DSCNN can jointly learn spatio-temporal features through different input clues in ME samples, which aim at improving the representation ability of MEs and optimizing for cross-database microexpression classification. In addition, we visualize the feature maps of intermediate activations that are output by various convolution and pooling layers in the DSCNN. Extensive cross-database experiments are conducted under the designed protocol in [39], and the experimental results are compared with some representative methods in dealing with CDMER tasks. This result proves that the DSCNN has advantages over these representative methods.
The rest of the paper is organized as follows. In Section II, we describe the dual-stream convolutional neural network (DSCNN) model for CDMER in detail. Extensive experiments and analyses are given in Section III. Finally, the conclusion is drawn in Section IV.

II. PROPOSED METHOD
The DSCNN consists of two stream branches, which can jointly learn spatio-temporal features from two separate input clues in ME video samples. Each branch in the DSCNN is a convolutional neural network that uses 2D convolution kernels, pooling cells, and fully connected cells, which have the same structure. The structure of the same branches can allow the DSCNN achieve parameter fitting in a brief time by reducing the redundant parameters and realizing parameter sharing. Specifically, each stream branch in the DSCNN consists of 9 network processing layers: 1 fully connected layer, 3 pooling layers, and 5 convolutional layers, as shown in Table. 1.
For 5 convolutional layers in each branch, the number of convolutional kernels (N) is set equal to 64, 64, 64, 128, and 128. The N value of the last two convolutional layers is much larger than that of the first three convolutional layers. Many studies [40], [41] show that the N value gradually increases from small to large and can learn more abstract features that come from some important facial regions related to expression, such as the mouth or eye region. For the convolutional kernel on the first convolution layer, we use a kernel size of 5×5 with a stride size of S=1, and the zero padding is set equal to "valid". Meanwhile, the kernel size on the other four convolutional layers is set equal to 3 × 3, the stride size is set equal to 1, and the zero padding is set equal to 1.
For 3 pooling layers in each branch, the number of kernels (N) is set equal to 64, 64, and 128. For the max pooling layer, we use a window size of 5×5 with a stride size of 2, and the zero padding is set equal to 2. For 2 average pooling layers, we use a window size of 3×3 with a stride size of 2, and the zero padding is set equal to 0 and 1. Three pooling layers aim at downsampling the dimensions of features that are studied from spatio-temporal cues in ME video samples.
For the final connected layer in each branch, their output dimensions are all set equal to 1024, which aims to reduce the number of parameters in the DSCNN. At the end of two recognition stream branches, the output is merged into a 2048-dimensional feature vector. In the last fully connected layer of the DSCNN, the output dimension is set equal to the number of sample categories in the ME databases. All hidden layers of the DSCNN are equipped with the PReLU function in [42], which is defined as follows: where i is the channel number, and a i is a parameter obtained in the training process. Compared with other activation functions, such as sigmoid, tanh, and ReLU, etc., the PReLU activation function can improve the classification ability of the CNN model at no cost of overfitting and computational complexity.
Micro-expressions are transient in an ME video, and the facial muscle actions emerge in only small regions of the face during a surprisingly short time. In the training process of the DSCNN, to reduce data redundancy and improve computational ability, we only use three important frames (i.e., the onset, apex, and offset frames) in each ME video. We use two stream ConvNets in the DSCNN to learn excellent feature representation from spatial and temporal cues in this three frames from ME videos. In each ME video, this three frames are resize to 48 × 48 after face alignment and face cropping. The apex frames of ME videos can be selected by the automatic apex frame spotting strategy in [43], which has the largest facial action amplitude and carries more expression information because facial muscle micromovement of this frame is more obvious than that of other frames. The spatial stream ConvNet in the DSCNN operates on the gray image of the resized apex frame, learning some useful clues associated with particular facial action from the single frame. The input to the temporal stream ConvNet is the optical flow displacement field between three resized frames, which calculated by the method in [44]. Such input can explicitly describe the motion between video frames, and does not need to estimate series of subtle facial movements throughout the whole ME video implicitly. The temporal stream ConvNet ensures that the DSCNN can further learn higher-level features from temporal cues in ME videos for MER tasks.
To ensure that the DSCNN has sufficient training samples, we expand the number of samples by taking the gray image of the resized apex frame and the optical flow displacement field obtained from each ME video and applying a horizontal flip and clockwise/counterclockwise rotation in 5 or 10 degree increments a total of 10 times. When these sample is ready, we begin to train the DSCNN according to our purposes.
In the CDMER task, the testing samples and training samples may come from different databases, which can bring a large domain discrepancy and result in most MER methods being unsatisfactory [21]- [26]. Hence, directly training the DSCNN by using only the source samples often leads to overfitting of the distribution of the source samples, causing a significant reduction in the recognition performance in the target domain. For two biased datasets (left), traditional classifiers learned in a source domain do not necessarily transfer well to target domains, as outlined in Fig. 1. To address the feature distribution difference between this two different databases, we may learn proper feature representations in another feature space (right) that are discriminative and domain invariant by optimizing the ideas of domain adaptation (DA).
Benefiting from the DA, in the training process of the DSCNN, to ensure that source and target samples will have the similar feature distributions that are output by various convolution and pooling layers in the DSCNN, we should choose a proper metric to measure the feature distribution difference. There are many metrics to measure the difference of feature distribution between two different databases in some subspace, e.g., MMD [45], Wasserstein Distance [46], KLD [47], and A-distance [22]. In this paper, to measure the feature distribution difference between two domains on some layer of the DSCNN, we use the maximum mean discrepancy (MMD) from the work of [45] as the metrics. In addition, the MMD value is computed with respect to a kernel mapping operator, φ (.). In our DSCNN model, we define the output of deep features in the fully connected layer as φ (.), which operates on source data points, x s ∈ X S , and the target data points, x t ∈ X T . Then an empirical approximation to this distance of the feature distribution between the source and VOLUME 4, 2016 target data on the connected layer can be defined as: The smaller the MMD value, the more similar the distribution of the features obtained by the source sample and the target sample in each layer of the DSCNN.
The DSCNN is trained jointly on all labeled source data and unlabeled target data, as shown in in Fig. 2. Three MMD values (i.e., MMD 1 , MMD 2 , and MMD 3 ) are calculated based on the output of source data and target data on three selected connection layers in the DSCNN. To ensure that source and target samples will have similar feature distributions in some layer of the DSCNN, the domain discrepancy loss is defined as: where MMD 1 (X S , X T ) denotes the feature distribution distance in the fully connected layer F C 1 of spatial stream ConvNet. MMD 2 (X S , X T ) denotes the feature distribution distance in the fully connected layer F C 1 of temporal stream ConvNet. MMD 3 (X S , X T ) denotes the feature distribution distance in the fully connected layer F C 2 . The hyperparameter λ i determines how strongly we would like to confuse two domains, and during the training process, the values of these parameters are determined by the best recognition results in the CDMER tasks.
In contrast, only labeled source samples are used to compute the classification loss of DSCNN, which can be defined as: where N denotes the training sample size, Y denotes the category number of ME, y n denotes the label of the n-th training sample and P n,j denotes the prediction value that the n-th training sample is predicted to be the j-th category.
To ensure that feature representations have good adaptation performance in CDMER tasks, the joint loss function used in the DSCNN can be defined as: where L classif ication denotes the classification loss on labeled source data, and L domain denotes the joint loss between the source data, X S , and the target data, X T on three selected connection layers in the DSCNN. We consider that such representations can offer strong semantic separation and have domain invariance in CDMER tasks. The DSCNN uses the value of the joint loss function as a feedback signal to adjust the value of the weights by small amount in a direction that lowers the loss value for examples in CDMER tasks. This adjustment is the job of the "optimizer", which implements what is called the "backpropagation" algorithm (BP) [48]. We use the stochastic gradient descent algorithm with nesterov momentum as the training optimizer. The iterative process during the training of the DSCNN is shown as follows: where α denotes the learning rate. The correction factor is set equal to 0.9, and the attenuation of the weight parameters is set equal to 10 −5 . We use the strategy to minimize the value of the joint loss function in the DSCNN and gradually update the weight parameters to learn transferable feature representations between samples from two domains. After the recognition accuracy of the DSCNN in CDMER tasks tends to be stable, the optimization iteration process stops.
Once the optimal weight parameters in the DSCNN are learned, we can use the DSCNN to address the CDMER tasks.

A. EXPERIMENTAL SETTING
In this section, we conduct experiments by using many domain adaptation (DA) methods for respectively investigating CDMER problem. In these experiments, we compare the proposed DSCNN with some representative methods including importance-weighted support vector machine (IW-SVM) [29], transfer kernel learning (TKL) [30], geodesic flow kernel (GFK) [31], selective transfer machine (STM) [33], subspace alignment (SA) [35], transfer component analysis (TCA) [36], target sample regenerator (TSRG) [28], DR in the Label Space (DRLS) [49], and region selective transfer regression (RSTR) [39]. For these DA methods, we employ the temporal interpolation model (TIM) [50] to normalize the frame number of all the micro-expression video clips to 16 and resize each frame image to 112 × 112. We compute uniform LBP-TOP [51] with fixed parameters using four types of spatial grids (1 × 1, 2 × 2, 3 × 3, and 4 × 4) in [39] to serve as the micro-expression features. For uniform LBP-TOP, neighboring radius R and number of the neighboring points P for LBP operator on three orthogonal planes are fixed at 3 and 8, respectively.
In our experiments, we choose uLSIF [52] to learn the importance weights for IW-SVM, which has shown its excellent performance in CDMER [28], [49]. For TKL, we determine the optimal value of ζ by searching from the parameter space [0.1 : 0. In these experiments, we evaluate our DSCNN model using the same settings as in the work of [39]. Two publicly available ME databases (i.e., CASME II [54] and SMIC [55]) are used to build the CDMER tasks, which are often used in CDMER tasks. The two databases are shown as follows: To conduct cross-database experiments on the two databases, we need to make CASME II and SMIC have the same ME labeling. We select the samples of happiness, surprise, disgust, and repression from CASME II and then relabel them with the same ME labels in SMIC. The samples of happiness are relabeled as positive, and the samples of disgust and repression are relabeled as negative. The labels of surprise samples remain unchanged. The sample statistics of SMIC and relabeled CASME II can be found in Table 2.
In this paper, we conduct two types of CDMER experiments based on relabeled CASME II and subsets of SMIC.   Table 3.

B. RESULTS AND ANALYSIS FOR CDMER
The mean F 1 -score and accuracy are chosen as the evaluation metrics in the experiments. SVM is chosen as a baseline method to compare with other DA methods. The results of the TYPE-I and TYPE-II experiments are shown in Table  4 and Table 5, respectively. Compared with the SVM without domain adaptation, it is clear that these DA methods achieve significant improvement in the recognition ability in all the experiments. The results in Table 4 and Table 5 indicate that DA methods are effective ways to narrow the feature distribution gap between the samples from different ME databases when dealing with the CDMER problem. In addition, we also observe that the DSCNN achieves more promising results among all the representative DA methods selected for comparison. The DSCNN achieves an average mean F 1 -score/accuracy of 0.7795/78.09% in the TYPE-I experiments and 0.6956/70.77% in TYPE-II experiments, which are significantly higher than those of the most DA methods for comparison. The performance of the DSCNN should be attributed to the design of two streams in the DSCNN and the idea of DA based on the domain discrepancy loss. From Table 4 and Table 5, we observe that there are significant differences between the average results of each method in TYPE-I and TYPE-II experiments. TSRG, DRFS-T, and RSTR achieve the average mean F 1 -score/accuracy of 0.6991/70.05%, 0.7128/71.23%, and 0.7381/73.98% in TYPE-I experiments, which are much higher than their achieved results (0.5348/56.22%, 0.5498/57.65%, and 0.5587/57.74%) in TYPE-II experiments. The result shows that TYPE-II experiments are significantly more difficult than TYPE-I experiments.
When SMIC (NIR) is used as the target database, i.e., Expt.3, Expt.5, and Expt.11, we can observe that the average performance of all the DA methods can reach 0.6888/69.08% and 0.7213/73.40% in Expt.3 and Expt.5, whose the source databases SMIC (HS) and SMIC (VIS) are relatively classbalanced. The result shows that the remaining one drops to 0.5443/55.43% in Expt.11, where the source databases of Expt.11 are relabeled CASME II, and very class-imbalanced.
From the results of TYPE-I and TYPE-II experiments, we notice that three subsets (i.e., HS, VIS, and NIR) of SMIC in TYPE-I experiments have the same subjects, stimulus materials, recording environments and different cameras, which results in the relatively small feature distribution difference. Meanwhile, compared with three subsets (i.e., HS, VIS, and NIR) of SMIC, relabeled CASME II used in TYPE-II experiments has substantially different subjects, stimulus materials, recording environments, and different cameras, which results in the relatively a large feature distribution difference. Therefore, the performance of all DA methods is affected by the class-imbalanced or heterogeneous problem between the source and target database when dealing with the CDMER tasks.
To test the structure of the DSCNN and its ability to learn salient characteristics from the ME samples, we compare the results between the DSCNN and OSCNN-I (or OSCNN-II), which only retains a single stream. We notice that the DSCNN achieves better performance than the single-stream networks in the TYPE-I and TYPE-II experiments. The result shows that the dual-stream structure in DSCNN can better utilize various forms of effective spatio-temporal characteristics for CDMER tasks, achieving better performance than some single-stream networks, such as OSCNN-I, and OSCNN-II.

C. DSCNN VISUALIZATION
In this section, to understand how pooling and convnet layers of the two stream ConvNets in the DSCNN transform their input, we visualize intermediate activations, which consists in displaying the feature maps that are output by various convolution and pooling layers in the DSCNN. This gives a view into how an input is decomposed into the different filters learned by the DSCNN.
We randomly choose a CDMER task from either TYPE-I or TYPE-II as an example of intermediate activation visualization, such as Expt.8: HS → CAS. When the training of the DSCNN is completed, we randomly choose an ME video from the target domain (i.e., CASME II) as the input, and visualize intermediate activations on various convolution and pooling layers in the DSCNN, as shown in Fig. 3 and Fig. 4.
Firstly, we observe that from Fig. 3 and Fig. 4, the feature maps extracted by a layer get increasingly abstract with the 6 VOLUME 4, 2016 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and   depth of the layers in the DSCNN. The intermediate activations of layers higher-up carry less and less information about the specific input being seen, and more and more information about the class of the target : positive, negative, or surprise.
Secondly, in Fig. 3, we can observe that some intermediate activations in layers of the spatial stream ConvNet, which clearly show that some appearance and outline information of a whole face. It shows that the spatial stream ConvNet in the DSCNN operates on the gray image of the resized apex frame, learning some useful spatial clues associated with particular facial texture information from the single frame. Facial expressions are strongly associated with these particular facial texture information that is the most intuitive.
Thirdly, in Fig. 4, we can observe that some intermediate activations in layers of the temporal stream ConvNet, which clearly show that the muscle movements in the subject's eyebrows from the occurrence to the disappearance of an disgust micro-expression, although the amplitude of the facial muscle motion between adjacent frames is very small. It shows that the spatial stream ConvNet in the DSCNN operates on the optical flow displacement field between three resized frames, learning some useful temporal clues associated with the facial muscle actions during a short time.
Based on the above observations, two stream ConvNets VOLUME 4, 2016 in the DSCNN effectively act as an information distiller, with raw data going in, and getting repeatedly transformed so that irrelevant information gets filtered out while useful information from spatial and temporal cues in three frames of each ME video get magnified and refined.

IV. CONCLUSION
In this paper, we propose a dual-stream convolutional neural network called DSCNN to address CDMER tasks. Our method is novel in that we take a domain discrepancy loss and a classification loss to minimize the feature distribution difference between the source and target domains. Two streams in the DSCNN can jointly learn spatio-temporal features of ME samples to optimize for cross-database micro-expression classification through different input clues in ME samples. To evaluate the performance of DSCNN, we conduct TYPE-I and TYPE-II experiments on relabeled CASME II and three subsets of SMIC (i.e., HS, VIS, and NIR). Compared with some representative DA methods, our proposed DSCNN has an overall superior performance. We observe that the performance of DA methods is affected by the class-imbalanced or heterogeneous problem between the source and target database when handing the CDMER tasks. In the future, we could focus on designing a better spatio-temporal feature extraction method for CDMER tasks, and studying faster optical flow calculation methods. In addition, we plan to design a simpler network structure with multiple recognition tubes to cope with CDMER tasks and verify the effectiveness of the proposed model on more ME databases.

VOLUME 4, 2016
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3185132