Micro Expression Recognition Using Convolution Patch in Vision Transformer

Humans possess an intrinsic ability to hide their true emotions. Micro-expressions are subtle changes in facial muscles that are involuntary by nature and easy to hide. To address these issues, several machine and deep learning models have been proposed in the past few years. Convolution neural network (CNN) is a deep learning method that has widely been adopted in vision-related tasks due to its remarkable performance. However, CNN suffers from overfitting due to a large number of trainable parameters. Additionally, CNN cannot capture global information with respect to an input image. Furthermore, the identification of important regions for the classification of micro-expressions is a challenging task. Self-attention mechanism addresses these issues by focusing on key areas. Furthermore, specific transformers, known as vision transformers are widely explored in vision-related applications. However, existing vision transformers divide an input image into a fixed number of patches due to which local correlation of image pixels is lost. Further, a vision transformer relies on self-attention mechanism which effectively captures global dependencies but does not exploit the local spatial relationships in an image. In this work, we propose a vision transformer based on convolution patches to overcome this problem. The proposed algorithm generates $c $ number of feature maps from input images using $c $ filters through convolution operation. These feature maps are then applied to a transformer model as fixed-size image patches to perform classification. Thus, the proposed architecture leverages advantages of both convolutional layers and transformer, and captures both spatial information and global dependencies respectively, leading to improved performance. The performance of the proposed model is evaluated on three benchmark datasets: CASME-I, CASME-II, and SAMM and compared with state-of-the-art machine and deep learning models, which generated classification accuracy of 95.97%, 98.59%, and 100%, respectively.


I. INTRODUCTION
Micro-expressions (ME) are involuntary subtle facial muscle movements which represent true emotions of a person [1].There are a variety of possible applications for The associate editor coordinating the review of this manuscript and approving it for publication was Prakasam Periasamy .micro-expression recognition (MER), including forensics, security, surveillance, education, entertainment, and healthcare systems [2].However, identification and classification of ME is a challenging task due to a variety of reasons.Typically, ME appear for a very short duration of time, i.e., 0:04 to 0:50 seconds [3].Furthermore, ME show very subtle change in facial muscles, due to which identification and spotting of ME become difficult.
Traditional machine learning methods such as local binary patterns [4] and histogram of oriented gradients (HOG) [5], [6], [7], depend on handcrafted features for classification.This dependency has been avoided by the use of deep learning models.Convolutional neural network (CNN) is a deep learning method which has recently demonstrated remarkable performance in several vision based applications and outperformed both handcrafted features and shallow classifiers [8].A deep fusion-based CNN model proposed by [9] has been implemented for facial expression recognition, which shows the impact of transfer learning and feature fusion on the performance of the model.Similarly, CNN and transfer learning are incorporated by [10] to determine the level of engagement of hearing impaired and hard-of-hearing students by analyzing their facial expressions and categorizing these expressions as highly engaged, nominally engaged or not engaged.
CNN requires large training dataset; however, most of the publicly available MER datasets are small in size, [11] used a data augmentation technique for CNN to increase the size of the facial expression datasets.Similarly, a CNNbased MER model proposed by [12], exploits optical flow information related to subtle muscle movements through apex and reference frame.Then, this information is passed to a CNN model for classification of an emotion.In the past few years, performance of CNN has been elevated by using it in stream or branch based networks.
However, implementation of CNN models in MER is limited due to variety of reasons: (i) CNN requires large number of trainable parameters (ii) CNN based models often suffer from overfitting (iii) convolution operation only captures local receptive field of a pixel and it is incapable of handling global receptive field, (iv) CNN does not effectively handle sparse spatio-temporal information, and, (v) ME consist of subtle movements of facial muscles which are difficult to handle.
As mentioned above, CNN is incapable of handling spatiotemporal information.Hence, 3D CNN has been explored by [13], [14], and [15] to address this issue for MER.A Siamese 3D CNN (MERSiamC3D) proposed by [13] is based on two-stage learning.The first stage applies an optical flow estimation technique to explain the spatio-temporal information, followed by a Siamese CNN model.The second stage adjusts the network parameters obtained from the first stage.Similarly, [15] also exploits a 3D CNN in combination with SqueezeNet.Another work proposed by [16], incorporates Squeeze-and-Excitation Networks with a 3D DenseNet to exploit spatio-temporal features.
The ability of attention mechanism to concentrate on certain locations makes it effective.Attention mechanism is either employed in conjunction with CNN or it replaces certain components of CNN.Accurate detection of ME plays a vital role in improving performance of the MER model.Attention mechanism can be used to effectively detect the presence of micro-expression in a video frame.A dual attention network known as LGAttNet, was proposed by [17] for automatic detection of micro-expression.Similarly, microexpression analysis network (MEAN) proposed by [18], is used for simultaneous spotting and recognition of ME.
In this work, effective and accurate classification is performed by exploiting vision transformer which depends on self-attention mechanism.In the past few years, vision transformers have attained remarkable results on vision-related classification tasks with substantially fewer computational resources.A simple vision transformer typically divides an image into fixed size patches.These non-overlapping patches form a linear embedding which is provided to the vision transformer.This architecture captures global dependencies but cannot capture spatial information.On the other hand, the proposed vision transformer architecture takes convolution feature maps as input patches; these feature maps contain spatial information.These feature maps are then provided as the input patches for the subsequent transformer layers.Thus, the proposed architecture leverages advantages of both convolutional layers and transformer and captures both spatial information and global dependencies for improved performance.Due to its remarkable performance, the proposed model can have a wide variety of real-life applications across different domains.For instance, the proposed model can be used for early detection and diagnosis of mental health issues such as anxiety and depression.It can also be used in security and law enforcement, where, security personnel can improve their ability to recognize possible threats by identifying ME associated with suspicious behavior.Further, MER can also play a very vital role for applications based on human-computer interaction and cross-cultural studies.
The major contributions of this paper are as follows: 1) We propose a deep learning framework for MER through a vision transformer with low computational cost.  of the proposed model with current state-of-the-art models.Conclusions are provided in Section V.

II. RELATED WORKS A. MICRO-EXPRESSION RECOGNITION
Based on input data, MER models can be broadly categorized into single-image-based and sequence-image-based systems.Datasets such as AffectNet [19] and FER2013 [20] are single-image-based datasets, whereas, CASME-I [21], CASME II [22], SAMM [23], and SMIC [24] are sequenceimage-based datasets.Sequence-image-based datasets are widely adopted for spotting and recognition of microexpressions because they provide better insight into data.However, primitive sequence-image-based datasets such as USF-HD [25] and Polikovsky's [26] are not adopted at present because such datasets contain image sequences of posed expressions, and hence they cannot be used for practical implementations.Whereas, state-of-the-art ME datasets contain spontaneous image sequences captured in a laboratory-controlled environment.Due to the availability of these datasets, research in the MER domain has significantly accelerated.
Primitive approaches for MER rely on hand-crafted and low-level features such as local binary pattern (LBP) [4], gradient features and optical flow.Local binary pattern from three orthogonal planes (LBP-TOP) is a commonly used feature for MER which considers horizontal and vertical directions.However, LBP-TOP cannot capture muscle movements in oblique direction which is essential for MER.To address this problem, [27] proposed a new feature called LBP-FIP which could easily capture dynamic textures from images calculated through five intersecting planes.Similarly, [28] proposed an invisible emotion magnification algorithm (IEMA) which effectively magnifies the strength of facial muscle movement for better classification of microexpression.
However, it is difficult to accurately interpret and represent ME through low-level features.Thus, a combination of several low-level features forming high-level features can be exploited for a better representation of ME.High-level feature representations can be obtained by deep learning models such as CNN.At an early stage, researchers exploited only spatial features [29], [30] through CNN, however, studies demonstrate that MER involves facial movement which can be captured through long image sequences.Thus, state-ofthe-art MER models exploit both spatial as well as temporal information.A Deep 3DCNN-ANN model proposed by [31] performs micro-expression recognition by learning spatiotemporal features from the image sequences by combining deep 3DCNN and ANN through a feature called visual associations.However, it has been observed that CNN cannot capture the relationship of an entity with its parent as an image.To address this issue, [32] proposed CapsuleNet based on agreement routing mechanism for MNIST dataset.Inspired by its success, [33] experimented CapsuleNet for MER model on SMIC, CASME-II and SAMM datasets.It has been observed that training a model on a particular dataset may not necessarily perform well on other dataset.Thus to experiment with cross-dataset MER, [34] proposed a dual-inception network which exploits horizontal and vertical components extracted through optical flow.

B. TRANSFORMERS
Transformer model was originally designed for text based applications [35], where it has exhibited remarkable results.Inspired by its success, it has also been experimented in vision tasks [36].Vision transformers (ViT) take image as an input and represent it as a series of fixed size image patches as shown in Figure 1.The obtained image patches are flattened and subjected to lower dimensional linear embedding.Due to flattening of patches, the correlation between adjacent patch might be lost.Therefore, positional embedding is added to keep the correlation information intact.Furthermore, vision transformers rely on self-attention mechanism that provides global receptive field, unlike CNN, which yields local receptive field.
Considering the limitations of CNN models, vision transformer has been widely adopted for MER models.A latefusion based vision transformer proposed by [37], exploits motion features through optical flow.Late-fusion and optical flow mechanisms allow the model to deal with small ME datasets.Similarly, a muscle motion-guided network proposed by [38], exploits the subtle muscle motion features for accurate classification of ME through a two branch model.The first branch comprises of a continuous attention block, which focuses on modeling muscle movement, whereas, the second branch comprises of a position calibration module which consists of a vision transformer.
Studies show that MER is difficult due to the fact that they are highly dynamic in nature and appear on localized facial regions.To solve this problem, [39] proposed a sparse transformer which exploits multi-head attention for sparse representation of emotions appearing in localized facial regions, whereas, temporal attentional fusion is employed to deal with dynamic nature of ME.Furthermore, studies [40] show that combination of local and global spatio-temporal pattern can improve classification accuracy of MER.To address the spatial patterns, a spatial encoder is employed, whereas, a temporal aggregator models the temporal patterns.
Another work proposed by [41], exploits two swin vision transformers F_transformer and S_transformer placed in two parallel streams.F_transformer exploits short term motion dynamics through optical flow sequences, whereas, longterm motion dynamics are utilized through S_transformer.Later, feature fusion is performed on features obtained from these two streams for classification of emotions.
However, the existing vision transformer models for MER divides the input image into n patches, due to which the local correlation of pixels with its neighboring pixels is lost.To address this issue, in this work, we exploit feature maps generated by convolution operation.Furthermore, convolution operation helps to capture local receptive field, and self-attention mechanism in vision transformer allows the model to capture global receptive field.

III. PROPOSED METHODOLOGY
Existing vision transformer models [36], [42] create fixed size patches from input image, which are flattened and provided to transformer for classification.However, this technique limits the performance of vision based algorithms, because, image pixels exhibit correlation with their neighboring pixels.Dividing images into fixed size patch deteriorates the correlation with neighboring pixels.Thus, a major limitation of this technique is that it cannot handle correlation among pixels in an image.To address this issue, the proposed algorithm generates c feature maps by applying c filters on an input image.These feature maps are considered as fixed size image patches and passed to transformer model for classification.[43] is applied, where GELU is computed by Equation 1.

A. PRE-PROCESSING AND CONVOLUTION PATCH
Thereafter, another convolution operation is applied which takes 64 feature maps and applies 3 filters with stride 1. Next, GELU activation function is applied to the obtained feature maps of dimension 16 × 3 × 256 × 256, which are reshaped to obtain 16 × 256 × 256 × 3 feature maps.

B. VISION TRANSFORMER
Conventional vision transformer models divide an image of dimension h × w pixels into n × m number of fixed size patches (as shown in Figure 1), where each patch is of h/n × w/m pixel dimension.Thereafter, these patches are flattened and passed through linear projection.
However, in the proposed work, we exploit local correlation of images through convolution operation, shown in Figure 2. Here, feature maps of 16 × 256 × 256 × 3 dimension are flattened to form 16 × 256 × 768 feature vector.To maintain the order of sequence, we add positional embedding and perform reshape operation to generate feature vector of shape 257 × 16 × 256.It is further passed to six subsequent transformer encoders.The final feature vector is of shape 257 × 16 × 256, passed to multi-layer perceptron (MLP) head for classification of emotions.Figure 3 illustrates a single transformer encoder, which incorporates Multi-head attention, which is further based on self-attention mechanism.
Attention mechanism was introduced in encoder-decoder block of a neural sequence transduction model by [44].It enable content-based summary of data from a variable length sentence.Attention mechanism is widely adopted because it has the ability to learn to focus on key areas.Selfattention mechanism also called intra-attention [35], allows the model to identify the inputs we should pay more attention to.It is used by [45] for facial expression recognition to deal with intra-class variation and inter-class similarity.It computes a weighted average of sequence elements where the weights are dynamically determined using the element    attention is computed by Equation 2.

Attention(Query, Key, Value)
= Softmax( where, 1 is the scaling factor, used to monitor the variance of attention values.In Equation 2, Query and Key are two vectors with σ 2 variance, when a product operation is applied on Query and Key, it generates a scalar with d k times higher variance.Thus, there is a need to scale down the variance back 100500 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.to σ 2 , otherwise, softmax will make one random element saturate to 1 and other elements saturate to 0. Therefore, we use d k for scaling, to maintain the optimal variance of attention values.
A network can pay attention to a particular sequence with scaled dot product attention.However, it does not allow sequence elements to attend to different features.This can be achieved through multi-head attention.Here, key, query, and value matrices are converted into h sub-keys, sub-queries, and sub-values respectively.Each of these sub-components is then independently passed through a h i scaled dot product attention with weight matrices W Q i and W K i .Thereafter, these h heads are concatenated and it generates final weight matrix where, where, D is the input dimensionality.

A. DATASETS
Performance of the proposed model has been tested on three benchmark datasets CASME-I [21], CASME-II [22], and SAMM [23].Sample images of these datasets are shown Figure 4. Table 1 describes detail of these datasets on the basis of number of video samples, subjects, ethnicity, frames per second (FPS), resolutions (in pixels) and number of emotion labels.Figure 5 illustrates unbalanced nature of emotion samples in datasets.Furthermore, class-wise sample distribution is illustrated in Table 2. Video sequences containing the onset frame, progressing toward the apex emotion, and then ending with the offset frame are used to train the model.

B. EXPERIMENTAL SETUP AND TRAINING HYPERPARAMETERS
The proposed model is trained using Nvidia A100 provided by Google Colab Pro+.Adam optimizer is used for optimization of model weights, learning rate is set to 0.0003 and batch size is 16.We initially tuned the number of heads for training the proposed vision transformer model; to ensure a In order to rule out this possibility, we have used layer normalization (LayerNorm) as a regularization technique, as shown in Figure 3, and also applied dropout technique with value 0.2 to mitigate overfitting in our proposed model.For further analysis, we have plotted the training and validation curves as mentioned in Figure 10 (where, x-axis represents epochs and y-axis represents accuracy), to closely monitor      The obtained evaluation metrics are shown in Table 7, 8, and 9 for CASME-I, CASME-II and SAMM datasets, respectively.Because of the severe class imbalances in CASME I and SAMM datasets, the F1-Score is more reliable while comparing performance of the proposed model.Figures 6 -9 depict training loss curve, validation accuracy curve, validation loss curves and confusion matrix, respectively, where, number of iterations during training or validation are represented by x-axis, whereas, y-axis represents loss in Figures 6,8  Table 7 shows evaluation metrics for CASME-I dataset.Based on evaluation of F1-Score, it can be inferred that the proposed model correctly classifies contempt and fear emotions, which contain least number of training samples i.e., 52 and 63 respectively as compared to other emotions (shown in Table 2).Thus, it can be concluded that the proposed model addresses the issue of smaller training samples required by state-of-the-art deep learning models.However, sadness emotion also contain fewer number of training samples i.e., 79, but the model could correctly classify only 75% samples.Figure 9 (a) shows confusion matrix obtained for the proposed model on CASME-I dataset.It can be observed that 11 samples of sadness emotion are wrongly classified as disgust.This is due to low inter-class variation among these two classes.Emotions such as disgust (802), happiness (234), repression (777), surprise (393) and tense (1495) generate F1-scores: 93%, 94%, 99%, 91%, and 98% respectively.The lower recognition rate might be because of overfitting of the model for emotions with higher number of training samples.The overall classification accuracy of the proposed model on CASME-I dataset is 95.97%.
Table 8 shows evaluation metrics for CASME-II dataset.It can be observed, that the proposed model generates 98.59% classification accuracy for CASME-II dataset.The model correctly classifies fear (121), and sadness (108) emotion.F1-score for repression (251), disgust (373), happiness (266), other (298), and surprise (241) are 99%, 98%, 98%, 98%, and 98% respectively.It can be observed that as the number of samples increases, the performance of the model drops for specific emotions.The reason behind this might be overfitting of the model.
Table 9 shows evaluation metrics for SAMM dataset.It can be observed, that the proposed model generates highest possible accuracy i.e., 100% for SAMM dataset.It is because of the availability of large number of training samples.Moreover, Table 2 shows that SAMM dataset is highly unbalanced, still the proposed model outperforms existing state-of-the-art models.Thus, it can be inferred that our model can easily handle unbalanced nature of the training datasets.

2) COMPARATIVE ANALYSIS
We contrast our proposed vision transformer model based on convolution patches with a number of state-of-the-art methods.We have compared the proposed transformer model with various machine and deep learning algorithms such as principal component analysis (PCA), CNN, CNN-LSTM, graph-CNN, and transformer models.From Tables 10-12, it can be observed that the proposed model outperforms several advance deep learning models and generates 95.97%, 98.59%, and 100% classification accuracy for CASME-I, CASME-II, and SAMM datasets respectively.remarkable performance as compared to machine learning models.Thus, to show a fair comparison we have compared our proposed model with state-of-the-art CNN models also.A 3D flow CNN proposed by [14], exploits a 3D convolution operation to extract spatio-temporal feature information along with optical flow.In this method, overfitting is avoided by using dropout mechanism and batch normalization technique.This method generates 59.11% classification accuracy on CASME-II dataset.To identify and analyse spatiotemporal deformations of ME, a recurrent CNN was proposed by [51] which generates 80.30% and 78.60% classification accuracy on CASME-II and SAMM datasets, respectively.Another category of recurrent neural network, known as long short term memory in conjunction with CNN was proposed by [29], generates 47.30% classification accuracy on CASME-II.
A vision transformer based model, muscle motion-guided network (MMNet), proposed by [38], exploits a two-branch network.The main branch of MMNet extracts motionpattern related features through a continuous attention block, whereas a transformer encoder is exploited as a sub-branch of the model to generate positional embedding.Thereafter, the positionl embedding is added to motion-pattern features to generate 88.35% and 80.14% classification accuracy for CASME-II and SAMM datasets, respectively.Another vision transformer model based on optical flow and late fusion, proposed by [37], generates classification accuracy of 70.68% on CASME-II dataset.
100504 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

V. CONCLUSION AND FUTURE WORK
When an existing vision transformer is exploited for microexpression recognition, it divides the input image into small patches and a sequence of patch embedding is created by linearly embedding each patch.Due to this approach, the model may not exploit the local spatial relationships present in an image.To address this issue, in this work, a novel vision transformer based on convolution patches for microexpression is proposed, which captures local receptive field through patches generated by convolution operation, and global receptive field is captured through a vision transformer based on self-attention mechanism.
While implementing the proposed network architecture, the following problems were handled: However, experiments show that the performance is still limited due to the following factors: (i) the existing microexpression datasets are highly unbalanced in nature.It is evident from Table 2, CASME-II dataset is fairly balanced when compared to CASME-I dataset, thus, CASME-II generates better classification accuracy of 98.59% as compared to CASME-I i.e., 95.97%.Hence, it can be inferred that sample distribution plays a significant role in the performance of the model.It is to be noted that SAMM dataset is also not balanced (as shown in Table 2), but it contains large number of image samples for training, as compared to CASME-I and CASME-II, leading to the best possible classification accuracy i.e., 100%.Therefore, it is implied that large number of training samples can improve the performance of the model and help the model to overlook the unbalanced nature of a dataset.Thus, in future, we will address this issue by using data augmentation technique to generate a large number of samples for CASME-I and CASME-II datasets.Most of the existing MER datasets are laboratory controlled which limits the implementation of MER in real-life applications, thus, there is a need of in-the-wild datasets which contain a wide variety of images of individuals belonging to different age groups, gender, races, and cultural background.The existing deep learning models can only perform emotion classification based on pre-defined classes, to address this issue, deep continual learning can be explored which can identify an unknown emotion category [66].

2 )
Conventional vision transformers divide input image into fixed-sized patches; due to which it becomes difficult for the model to exploit the local correlation of pixels.The proposed model addresses this issue by maintaining the correlation of the target pixel with its neighbors through a local receptive field by using convolution patches.3) We exploit global as well as local correlation in an image through a vision transformer and convolution patch respectively, which improves the overall classification performance of the model.4) Extensive experiments have been performed on three benchmark datasets and comparison with existing state-of-the-art models validates the effectiveness of the proposed model.The remaining sections of the paper are arranged as follows.Section II discusses the related work.Section III presents the proposed vision transformer for MER.Section IV provides a description of the datasets, experimental setup, and hyperparameters for training the model, results, and comparison

FIGURE 1 .
FIGURE 1. Flattening of image patches in conventional vision transformer.

Figure 2
Figure2presents detailed network architecture of the proposed model.First, the input sequence frames are provided to the network through a pre-processing stage.The input frames are subjected to pre-processing operations such as horizontal flip, normalization and resize to 256 × 256 pixels.After preprocessing, the images of 3 × 256 × 256 pixel dimension are generated.Next, to exploit local correlation, two subsequent convolution operations are applied.First convolution operation takes images of 16 × 3 × 256 × 256 pixel dimension, where, 16 is the batch size and applies 64 filters with stride equivalent to patch size i.e., 16.Then, Gaussian error linear unit (GELU) activation function proposed by[43] is applied, where GELU is computed by Equation1.

FIGURE 2 .
FIGURE 2. Detailed architecture of proposed model for MER using vision transformer.

FIGURE 5 .
FIGURE 5. Unbalanced nature of emotion samples in datasets.

FIGURE 6 .
FIGURE 6. Training loss curve for proposed convolution patch based vision transformer.

FIGURE 7 .
FIGURE 7. Validation accuracy curve for proposed convolution patch based vision transformer.

FIGURE 8 .
FIGURE 8. Validation loss curve for proposed convolution patch based vision transformer.

FIGURE 9 .
FIGURE 9. Confusion Matrices obtained using proposed vision transformer for CASME-I, CASME-II and SAMM datasets respectively.
and accuracy in Figure 7. Validation accuracy and validation loss curves of CASME-II datasets in Figures 7 (ii) and 8 (ii) depict higher fluctuations as compared to other datasets.This might be due to lower number of training samples in CASME-II dataset.Figures 7 (iii) and 8 (iii) show less fluctuations for SAMM dataset as compared to CASME-II.However, despite of large number of samples, fluctuations in SAMM are higher than CASME-I dataset which is due to unbalanced training samples in SAMM dataset.
(i) Due to a large number of trainable parameters, self-attention-based operations, and long training time, high-performance computational resources are needed for the training of a vision transformer, thus, Nvidia A100 is utilized for training of the model which was provided by Google Colab Pro+, (ii) existing deep learning models are prone to over-fitting, thus we have employed layer normalization and dropout mechanism to avoid overfitting which is usually caused by limited training data.The performance of the model is evaluated in terms of standard evaluation metrics such as precision, recall, F1-score, and classification accuracy.It has been demonstrated that the proposed model outperforms several state-of-the-art machine and deep learning models on three benchmark datasets i.e., CASME-I, CASME-II, and SAMM.

TABLE 2 .
Number of frames against each emotion for (a) CASME-I, (b) CASME-II, and (c) SAMM datasets, used for training the proposed model.

TABLE 3 .
Comparison of number of heads in transformer encoder.fair comparison, same number of heads is used for ablation experiments.We have investigated the model based on 1, 2, 4, 8, and 16 heads.As shown in Table 3, it can be observed that the selection of 8 heads outperformed other variants.Thus, 8 heads are selected in multihead attention module of the transformer encoder for all the experiments.Other parameters used in the proposed transformer encoder are listed in Table 4.

TABLE 4 .
Parameter values for Transformer encoder.

TABLE 5 .
Comparison of different models on the basis of number of parameters and GFLOPS.

TABLE 7 .
Classification report over CASME-I dataset for 8 Classes.

TABLE 8 .
Classification report over CASME-II dataset for 7 Classes.

TABLE 9 .
[50]sification report over SAMM dataset for 8 Classes.10.Comparison of the proposed method with existing models for CASME-I dataset in terms of classification accuracy.A machine learning method proposed by[50], addresses two important characteristics of ME: low facial movement intensity and short duration of ME.The first issue is dealt by exploiting robust PCA and the sparse nature of ME in temporal domain is addressed by using local spatio-temporal directional features.This method generates 63.41% classification accuracy on CASME-II dataset.However, deep learning models such as CNN and LSTM generate

TABLE 11 .
Comparison of the proposed method with existing models for CASME-II dataset in terms of classification accuracy.

TABLE 12 .
Comparison of the proposed method with existing models for SAMM dataset in terms of classification accuracy.