Haphazard Cuboids Feature Extraction for Micro-Expression Recognition

Facial micro-expressions can reveal a person’s actual mental state and emotions. Therefore, it has crucial applications in many fields, such as lie detection, clinical medicine, and defense security. However, conventional methods have extracted features on designed facial regions to recognize micro-expressions, failing to effectively hit the micro-expression critical regions since micro-expressions are localized and asymmetric. Consequently, we propose the Haphazard Cuboids (<inline-formula> <tex-math notation="LaTeX">$HC$ </tex-math></inline-formula>) feature extraction method, which generates target regions by haphazard sampling technique and then extracts micro-expression spatio-temporal features. <inline-formula> <tex-math notation="LaTeX">$HC$ </tex-math></inline-formula> consists of two modules: spatial patches generation (<inline-formula> <tex-math notation="LaTeX">$SPG$ </tex-math></inline-formula>) and temporal segments generation (<inline-formula> <tex-math notation="LaTeX">$TSG$ </tex-math></inline-formula>). <inline-formula> <tex-math notation="LaTeX">$SPG$ </tex-math></inline-formula> is assigned to generate localized facial regions, and <inline-formula> <tex-math notation="LaTeX">$TSG$ </tex-math></inline-formula> is dedicated to generating temporal intervals. Through extensive experiments, we demonstrate the superiority of the proposed method. Afterward, we analyze two modules with conventional and deep-learning methods and find that they can significantly improve the performance of micro-expression recognition, respectively. Thereinto, we embed the <inline-formula> <tex-math notation="LaTeX">$SPG$ </tex-math></inline-formula> module into deep learning and experimentally demonstrate the effectiveness and superiority of our proposed sampling method in comparison with state-of-the-art methods. Furthermore, we analyze the <inline-formula> <tex-math notation="LaTeX">$TSG$ </tex-math></inline-formula> module with the maximum overlapping interval (<inline-formula> <tex-math notation="LaTeX">$MOI$ </tex-math></inline-formula>) method and find its coherence with the maximum interval of the apex frame distribution in CASME II and SAMM. Therefore, analogous to the human face’s region of interest (ROI), micro-expressions also inherit similar ROI in the temporal dimension, whose positions are highly relevant to the intensive moment, i.e., the apex frame.


I. INTRODUCTION
Expressions are visual reflections of human emotions, and non-verbal communication through facial expressions has proven effective and has become a powerful means for humans to deliver affective messages. Facial expressions can be divided into two categories: macro-expressions and micro-expressions. Macro-expressions are considered regular facial expressions, such as happiness, anger, or surprise, consistent with a person's deeds and lasting, ranging from 1/2s to 4s [1]. Micro-expressions are spontaneously generated expressions that emerge when a person tries to The associate editor coordinating the review of this manuscript and approving it for publication was Jon Atli Benediktsson .
Compared to macro-expression, micro-expression research has a relatively late start. The concept of micro-expressions was initially introduced by Haggard et al. in 1966 [6]. Then shortly, Ekman et al. [7] reported on the case of micro-expressions, where they observed a video of a conversation between a psychiatrist and a depressed patient, during which the patient tried to conceal his suicidal thoughts by smiling and yet there occasionally appeared several frames of extremely painful expressions. Researchers refer to such rapid, unconscious, spontaneous facial motions as micro-expressions when a person is VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ undergoing intensive emotions. Ekman and his colleagues have developed the Facial Action Coding System (FACS) [8] and Micro-Expression Training Tool (METT) over decades of research. FACS describes facial activity based on facial action units (AU) and METT is used to train coders to detect and recognize micro-expressions. Nevertheless, despite the trained coders, their recognition accuracy is not good, with only 47% reported in the known literature [9]. In addition, relying on a human to recognize micro-expressions is limited by professional training and time cost, making it challenging to be large-scale put into practice. Nevertheless, with the successful development of computer vision in recent years, many researchers have employed it to explore micro-expression recognition methods.
As a typical pattern recognition task, micro-expression recognition can be roughly divided into two parts. The first part is micro-expression feature extraction, which aims to extract features of micro-expressions from a microexpression clip. The second part is micro-expression classification, which categorizes micro-expressions by designing a classifier, e.g., a support vector machine (SVM). In the last decade, most researchers in micro-expression recognition have dedicated themselves to the feature extraction part. Such an approach is undoubtedly essential since designing reliable micro-expression feature extraction methods which capture tiny facial movements can effectively advance the micro-expression recognition task. The micro-expression feature extraction part can be roughly subdivided into two steps. The first step is selecting specific facial regions, and the second is applying spatio-temporal descriptors to the selected regions to extract micro-expression features. In traditional micro-expression recognition, facial region selection has been based on designed n × n non-overlapping blocks [10], [11] which can describe local variations better, or ROIs [12], [13] selected based on a priori knowledge for many years. These methods, in the field of macro-expressions, can yield promising results. Nevertheless, micro-expressions are rapid and asymmetric; and are derived from local facial variations. Therefore, we assume that employing the whole face, electing a specific ROI, or using the whole micro-expression clip may not be the optimum scheme. Exploiting the characteristics of the spatio-temporal dimension of micro-expressions is a critical breakthrough in the future. Therefore, we explore the facial region selection problem in micro-expression recognition as well as investigate the temporal frame selection problem from a new perspective, making the following contributions. 1) We propose an HC feature extraction method containing SPG and TSG modules, which facilitates the extraction of local micro-expression motions by haphazard sampling on the spatial and temporal domains of micro-expressions, and extensive experiments demonstrate that our method is superior to traditional facial selection with non-overlapping blocks and ROIs.
2) We embed the facial haphazard sampling method into deep learning and experimentally demonstrate the effectiveness and superiority of our proposed sampling method in comparison with state-of-the-art methods. 3) Analysis of TSG module by MOI method, it is found that micro-expressions also inherit similar ROI in the temporal dimension, whose positions are highly relevant to the intensive moment, i.e., the apex frame.
This paper reviews the related work on the microexpression classification in Section II. Then in Section III, the proposed framework and our HC feature extraction method are presented in depth. Section IV covers the experiments conducted, the experimental results, and discussion, and finally Section V concludes this work.

II. RELATED WORK
Micro-expression analysis is a relatively new field. In the last decade, it has attracted much attention, but little work has been fulfilled up to now. Primarily, much of the work was mostly borrowed from macro-expression recognition. However, due to the short duration and subtle variations of micro-expressions, borrowed methods' performance has been less impressive.
The work of Pfister et al. [10] was one of the very first attempts at automatic micro-expression recognition. Their approach is highly emblematic and presents a comparative benchmark for subsequent work on micro-expression recognition, where dynamic texture features of micro-expressions are extracted by applying local binary patterns from three orthogonal planes (LBP-TOP) [14]. Considering the redundancy and sparsity of LBP-TOP methods, many refined methods [15], [16], [17], [18] have been proposed. Zong et al. [19] designed a hierarchical spatial division scheme and employed kernelized group sparse learning (KGSL) model to process hierarchical scheme based spatiotemporal descriptors such that they are more effective for extracting the low-intensity facial muscle movement. Afterward, in [20], they proposed a kernelized two-groups sparse learning (KTGSL) model to learn two groups of weights for two groups of features and select more discriminative features according to the learned weights. Huang et al. [21] proposed an effective EDLPP algorithm, which uses matrix exponential to keep discriminant and local information. Very recently, In MEGC 2019, Liu et al. [22], the first place winner, adopted special data enhancement strategies while utilizing deep neural networks with adversarial training and expression amplification to boost recognition performance. Zhou et al. [23] proposed the Dual-Inception network, which combines the vertical and horizontal components of optical flow for micro-expression recognition. Liong et al. [24] proposed STSTNet, which encodes the horizontal and vertical components of the optical flow, in addition to the optical strain, to learn more compelling features. Quang et al. [25] applied Capsule Networks based on the apex frames to recognize micro-expression. Wang et al. [26] proposed a 2+1D spatio-temporal convolutional network, which uses 2D convolution to extract spatial features and 1D convolution to extract temporal features.
However, expressions are composed of basic facial action units [8], [27], which correspond to specific facial muscles and correlate with different facial regions. In other words, different facial regions contribute to facial expression unequally, which is particularly valid for micro-expressions derived from local, faint facial motions of muscles. Accordingly, many researchers have explored methods related to facial region selection.
Conventional handcrafted feature extraction methods on micro-expression recognition [19], [28] always separate the whole face into several equal overlapping blocks to better portray local variations. However, some researchers reduced the influence of useless regions by extracting features from ROI [11], such as the forehead, eyebrows, eyes, nose, and mouth, as shown in Fig. 1. Merghani et al. [13], [29] opted for ROIs based on FACS [8], and to eliminate the noise caused by blink movements, glasses, and stationary regions, Le et al. [30] proposed to mask the eye and cheek regions. However, eye movements contribute to the microexpressions, such as narrowed orbits correlating with disgust highly. Later, in [31], Li et al. constructed a deep local holistic network to learn features from local and global face regions. Cen et al. [32] proposed joint temporal local cube binary pattern and multi-task facial activity patterns learning framework to explore the relationship between action units and emotional states. In recent years, many researchers have used deep learning to explore micro-expressions due to its excellent performance in computer vision tasks. Xia et al. [33] found that the regions around the eyes, nose, and mouth are mostly micro-expression intensive regions which can be elected as ROIs by analyzing the heatmap differences on the micro-expression dataset. Song et al. [34] used a three-stream convolutional neural network to recognize micro-expression, where its block-based segmentation of a face is designed to learn local spatial features better. Zhang et al. proposed a key facial sub-region delineation method based on AU location and facial feature points [35]. [36] proposed a fusion model to introduce the action unit's matrix into the learned facial graph representation, to capture these subtle local variations. [37] proposed a micro-attention cooperating with residual network to focus on facial areas of interst covering different action units. Meanwhile, Li et al. [38] proposed to extract the small facial patches consisting of facial landmarks. In this way, the dimensionality of the learning space can be significantly reduced, which facilitates deep model learning on microexpression datasets. Xie et al. [39] and [40] employed graph attention convolutional neural network to extract the relationship between feature points and local optical flow patch to improve micro-expression recognition performance.
The above research demonstrated that micro-expressions are local in the spatial dimension. In the meantime, some traits exist in the temporal dimension: from the onset to the apex frame, micro-expressions motion becomes increasingly magnified, and from apex to offset frame, gradually weakens. In [41], they proposed to utilize only two images per video, namely, the apex frame and the onset frame. Afterward, [42] found that the apex frame correlates significantly with the amplitude change in the frequency domain and uses a deep convolutional neural network (DCNN) on the apex frame to recognize micro-expression. Much of the post-production work [24], [43] is based on the apex and neutral frames. In this paper, we analyzed both micro-expressions traits in the spatial and temporal dimensions by leveraging two modules of the proposed HC on the CASME II [44], SAMM [45] and 3DBcombined [46] datasets.

III. PROPOSED METHOD
The overall framework of our proposed micro-expression recognition is shown in Fig. 2. Firstly, we use dlib software [47] to detect face landmarks, perform face alignment, and simultaneously normalize the face to eliminate various confounding factors unrelated to facial expressions. Then we extract micro-expression features using the proposed HC method and finally train an SVM model for classification. The input of the framework is a micro-expression clip, and the output is a specific micro-expression category.

A. FACE ALIGNMENT AND NORMALIZATION
Face alignment is the task of identifying the geometric structure of faces in digital images and attempting to obtain a canonical alignment of the face based on translation, scale, and rotation. Suppose DB = {V 1 , V 2 , . . . , V n } is the selected database, and n is the total number of micro-expression clips. Where V j = {V j,1 , V j,2 , . . . , V j,m }, m denotes the total number of frames of the jth micro-expression clip. First, we use the dlib tool to detect 68 landmarks (as adopted in [40]) in the first frame of each micro-expression video clip, which is denoted as . Then, the transformation matrix T is calculated based on the coordinates of the detected landmarks corresponding to the left and right eyes in the image, so that the two eyes can be aligned on the same horizontal line. Since micro-expressions have few head movements in the whole clip, T can be applied uniformly for all frames in the clip, i.e., V j,t = T j × V j,t , t = 1, 2, . . . , m, where T j denotes the transformation matrix of the first frame of the jth microexpression clip.
Face normalization is the processing of facial images to eliminate various confounding factors irrelevant to facial expressions. For this reason, we similarly determine the facial cropping region by the first frame of each micro-expression clip and then employ the delineated region of the first frame as the facial cropping region of the whole micro-expression clip to reduce the effect of background noise. Suppose j is the 68 landmarks points of the first frame of the jth microexpression clip, where we choose the maximum and minimum values of the landmarks points' coordinates x and y, i.e., j,1,x , j,17,x and j,9,y as the left, right and below of the face region. For the upper facial region, in order to eliminate the effect of broken hair, we determine it with the value j,30,y − ( j,9,y − j,30,y )/1.3 (nose and chin) based on the facial construction and the effect of the experimentally cropped image. After performed above steps, the normalized database DB = {V 1 , V 2 , . . . , V n } is obtained.

B. HAPHAZARD CUBOIDS FEATURE EXTRACTION METHOD
Our proposed Haphazard Cuboids (HC) feature extraction method consists of the following steps. At first, spatial patches generation (SPG) module is initialized to generate haphazard patches on the cropped image. Simultaneously temporal segments generation (TSG) module is initialized to generate haphazard intervals in the temporal dimension. Then, each haphazard patch is assembled with temporal segments to form haphazard cuboids, which are utilized to extract the LBP-TOP features from the database DB . Ultimately we concatenate features of each haphazard patch to compose the final feature vector. To further illustrate the effect of TSG module, we employ the degraded version of HC, i.e., Haphazard Patches (HP), which uses entire continuous micro-expression clips, as an algorithmic comparison. In parallel, we investigate how distinct sizes of HC and the range of temporal segments can influence the performance of micro-expression recognition. The specific processes of feature extraction are described below.

1) SPATIAL PATCHES GENERATION
Given an image I ∈ R w×h , where w is the width and h is the height. We assign a patch size factor ρ, which is employed to generate the width and height of the haphazard patches, where ρ is the percentage of w and h, and ρ ∈ (0, 0.5). To reach approximately the same area as the whole face, we calculate the number of haphazard patches by applying the formula 1.
Under natural conditions, the human face has a large-tosmaller pattern from top to bottom, making the cropped face region a partial background noise. To overcome this effect, we horizontally divide the face into three equivalents, and then for each equivalent, we further constrain the horizontal interval of patches by jointly the maximum and minimum values of the horizontal coordinates of the face landmarks in that region to generate patches of the target area and finally merge the patches of each equivalent. The formula is described in lines 2 to 9 of Alg. 1. The SPG with ρ ∈ {0.1, 0.15, 0.2} is illustrated in the Fig. 3, respectively.

2) TEMPORAL SEGMENTS GENERATION
For a micro-expression clip, we assume its length is t. Analogous to the SPG, we set the temporal segment factor as η, where η is the percentage of t. In order to match the fragment length generated with the clip's length relatively, the number of segments can be calculated by the formula 2.
First, we scale the length of all micro-expression clips to unit length, then generate segments in the scope of unit length and eventually map them to the original clips. The formula is described in lines 10 to 12 of Alg. 1. The temporal segments generation with η ∈ {0.3, 0.5, 0.8} is illustrated in the Fig. 4, respectively.

3) HAPHAZARD PATCHES FEATURE EXTRACTION METHOD
To explore the temporal aspects of micro-expressions, we wield the HP method, which is invoked to perform feature extraction on the unbroken succession of micro-expression clips. HP is initialized with the SPG module and then uses spatio-temporal descriptors to extract features. As described in Section II, the LBP-TOP method is adopted to extract texture appearance and variations of micro-expressions in this paper due to its typicality and applicability. For the video, there are three dimensions, XYT , which refers to the width, height, and time dimensions. A video can be visualized as stacking XY plane in T dimension, XT plane in Y dimension, or YT plane in X dimension. The XY plane can depict the spatial features, and XT and YT can deliver the dynamic motion features in temporal aspects. LBP-TOP extracts LBP histogram features in XY , XT , and YT planes, respectively, and finally concatenates the histogram features of the three planes into a final feature vector. The details are shown below. Assume that HPF is the extracted HP features for the whole database DB , and hpf i = {sp i,1 , sp i,2 , . . . , sp i,m } is the feature of ith micro-expression clip, where m is the number of patches, and sp i,j denotes the feature of jth spatial patch of ith micro-expression clip. Then hpf is concatenated to form the feature of the ith clip, i.e., hpf i = sp i,1 ⊕ sp i,2 ⊕ . . . ⊕ sp i,m . For each clip, we extract HPF as the above steps, and then get the final features HPF = {hpf 1 , hpf 2 , . . . , hpf n }. The visualization process is shown in the Fig. 5.

4) HAPHAZARD CUBOIDS FEATURE EXTRACTION METHOD
With SPG and TSG modules initialized, we hereafter embark on extracting HC features using spatio-temporal descriptors. The details are illustrated below.
Suppose HCF is the whole database DB extracted HC features. For the ith micro-expression clip, the feature is hcf i =  = {ts i,j,1 , ts i,j,2 , . . . , ts i,j,k }  of the ith patch, where k is the number of temporal segments generated. We concatenate ts i,j,1 , ts i,j,2 , . . . , ts i,j,k to form sp i,j , i.e., sp i,j = ts i,j,1 ⊕ ts i,j,2 ⊕ . . . ⊕ ts i,j,k . Then we similarly concatenate sp i,1 , sp i,2 , . . . , sp i,m to form the feature of the ith clip, noted as hcf i . For each video, we extract HCF according to the above steps, and then get the final features HCF = {hcf 1 , hcf 2 , . . . , hcf n }, and the detailed algorithm is shown in Alg. 1. The visualization flows are illustrated in the Fig. 6.

IV. EXPRIMENTS
In this section, we described the database used, preprocessing process, LBP-TOP parameters, evaluation metrics, model hyperparameter tuning, and experimental results.

A. DATASETS
Considering the brief duration of micro-expressions, we elected databases with high frames per second (fps) from existing micro-expression datasets to explore micro-expressions better in the temporal dimension.

1) SMIC [48]
The SMIC database contains three sub-databases, including SMIC-HS (recorded by a high-speed camera at 100 fps), SMIC-VIS (recorded by a general vision camera at 25 fps), and SMIC-NIR (recorded by a near-infrared camera). SMIC-HS has 164 micro-expression clips from 16 subjects, while SMIC-VIS/SMIC-NIR consisted of 71 samples from 8 participants. Additionally, these samples were annotated as

Algorithm 1 HC Feature Extraction
Input: video V ∈ R h×w×t , patch size factor ρ, segment factor η, patches collection SPG , segments collention TSG and feature extractor Output: feature hcf i 1 Detect 68 landmarks of 1th frame of V as ; 2 Compute number of patches using (1)   negative, positive and surprise. In particular, SMIC-HS was used for our experiments.

2) CASMEII [44]
The CASMEII [44] database, recorded by a 200 fps highspeed camera, contains 255 micro-expression samples from 26 participants. These micro-expression clips contains natural micro-expressions labeled with onset, apex, offset, AU, and seven emotional categories. Each face resolution was cropped to 280 × 340 pixels. Since CASME II provides higher temporal resolution but remains subject to category imbalance, we only considered micro-expression clips with categories labeled happy, disgust, repression, surprise, and others. The total number of micro-expressions selected for the experiment is 246 (5 classes), and the category details are shown in the Table 1.

3) SAMM [45]
Analogous to the CASME II database, SAMM was performed in a well-controlled laboratory environment and  recorded with a high-speed camera at 200 fps. The face resolution was cropped to 400 × 400 pixels. It contains 159 micro-expression samples with emotion classes, including contempt, disgust, fear, anger, sadness, happiness, and surprise, generated by 32 subjects. The total number of micro-expressions selected like SAMM was 136 (5 classes), and the breakdown of categories is shown in Table 2.

4) 3DB-COMBINED [46]
A 3DB-combined database is proposed for composite database evaluation (CDE) by MEGC 2019 [46], in which all ME samples from the above three datasets are combined into a composite dataset, which aggregates them into three categories, i.e., negative (including depression, anger, contempt, disgust, fear, and sadness), positive (happy) and surprised. The sample distribution summaries for the datasets and their combinations are shown in Table 3.

B. DATA PREPROCESSING
As mentioned in Section III-A, we carried out face alignment and normalization on the micro-expression clips and resized the shape of each frame to 224 × 224 pixels for better feature extraction.

C. EXPERIMENTAL SETUP
To evaluate the effectiveness of each module of our proposed method, we designed two types of experiments, including the traditional hand-crafted method and the deep-learning method. Moreover, extensive experiments are conducted on a single dataset (CASME II and SAMM, 5 classes) and 3DBcombined dataset (3 classes) to validate the effectiveness of our method. All the experiments were conducted with Leave-One-Subject-Out (LOSO) cross-validation, where samples from one subject are held out as the testing set while all remaining samples are for training. Accuracy and F1 score were employed to measure the performance of micro-expression recognition for a single dataset (CASME II and SAMM, 5 classes). Additionally, according to the MEGC 2019, CDE protocol [46] (Unweighted F1-score (UF1) and Unweighted Average Recall (UAR)) is used to measure the performance of various methods on composite and individual databases with 3 classes.

1) HAND-CRAFTED FEATURE
We conducted experiments on the publicly released databases CASME II [44], SAMM [45] and 3DB-combined dataset [46] respectively. Simultaneously, to validate our proposed method, we took the classical feature extraction algorithm LBP-TOP [14] as our benchmark method and applied the LBP uniform patterns to extract LBP-TOP features, of which the parameters R x = R y = R t = 1 and P = 8 were adopted to all experiments below. SVM was deployed to categorize the micro-expressions, and it used radial basis kernel function with penalty factor C = [1, 2, . . . , 50] and γ = 1/(num_features × var(X )), in which num_features denotes the number of features and var(X ) refers to the variance of input X . Hyperparameters were obtained using the grid-search method, and all experiments were accomplished using Python 3.9.4.

2) DEEP-LEARNING FEATURE
To further validate the recentness of our method, we embedded the TSG module of HC into the deep-learning method and performed extensive experiments. Since almost all models designed by researchers in recent papers on deep learningbased micro-expression recognition input entire sequences, grayscale images, or optical flows, however, our manuscript discussion focuses on the impact of facial region selection scheme on micro-expression recognition performance. Thus, to fairly compare, we retrieved papers on deep learning-based micro-expression recognition from recent years and solely found one, namely [34].
In [34], Song et al. proposed Three-Stream Convolutional Neural Network (TSCNN), which extracts the spatio-temporal features of micro-expressions by three components, including the static-spatial component (extracting the appearance and overall outline information of the gray-scale apex frame), the local-spatial component (extracting the local features in the non-overlapping blocks of the apex frame), and the temporal component (extracting the dynamic deformation of micro-expressions). To verify the effectiveness of SPG module, we replicated the full architecture of the TSCNN, only altering the input of its local-spatial component with our facial sampling scheme. For more details, please turn to [34].
Since the apex frame annotation in SMIC is not available, D&C-RoIs method is applied to spot the apex frame (for more details, please turn to [49]). For the CASME II and SAMM datasets, the ground truth of the apex frame is directly used.
As in [34], we set the input image size for each recognition stream to 48 × 48 pixels, the basic learning rate to 10 −3 , the epoch to 30 [10] in CASME II.

D. RESULTS AND DISCUSSION
We conducted baseline experiments (using LBP-TOP [14] method) based on 5 × 5, 8 × 8, and 10 × 10 nonoverlapping blocks and 8 ROIs using preprocessed microexpression clips, respectively, and their experimental results on CASME II and SAMM are shown in Tables 4 and 5. For the baseline on the 3DB-combined dataset, they are individually presented inside a configuration-specific table for convenient comparison.

1) ANALYSIS OF SPG MODULE a: HAND-CRAFTED FEATURE
Furthermore, to demonstrate the feasibility and effectiveness of our method step by step, we performed the experiments using a degraded version of HC, i.e., the HP method. To begin with, we initialized the SPG module using different parameters ρ. Then, ten experiments for each ρ were done on two databases. As well, to examine the validity of the HC method, we selected the SPG parameter ρ with promising recognition performance based on the results of the HP method, and then, considering the comparative analysis of the two methods, we employed the patches already generated by the SPG module in HP, and on top of that, added TSG module. First, by setting the parameter η, we initialized the TSG module and then carried out a succession of experiments based on each η separately to circumvent the coincidence of our method.
First of all, we performed experiments on single dataset (CASME II and SAMM) using the HP method, in which we set the parameters ρ ∈ {0.1, 0.15, 0.2} for the SPG module and performed 10 sets of experiments individually. The details of the experimental results are shown in Tables 6  and 7, respectively, where the maximum values of accuracy and F1 score for each group of experiments are bolded, which implies a quantification of the performance with the potential to generate thriving patches using the corresponding parameters. From Table 6, i.e., the CASME II results, when ρ = 0.2, the best results for the ten sets of experiments diverged marginally from the baseline results we obtained and occasionally were notably worse afterward, bringing no performance gains. The bulk of this is attributed to the failure to extract the localized features of the micro-expressions, i.e., VOLUME 10, 2022 the spatial representation regions. However, the recognition performance of micro-expressions improved significantly when ρ = 0.1 and ρ = 0.15. In particular, when ρ = 0.15, the recognition performance of micro-expressions improved the most. Both the accuracy and F1 score were enhanced by 4%. From Table 7, i.e., the SAMM results, it could be likewise identified that the recognition performance failed to boost when ρ = 0.15 and ρ = 0.2, whereas when ρ = 0.1, the experimental results of the group 9 exhibited a relatively large performance pickup. Regarding the 60 sets of experimental results on the two databases jointly, it was observed that the experimental results of certain groups were relatively poor because the patches were too large or the representation regions of micro-expressions were not extracted. Therefore, to better explore the feasibility and effectiveness of the HC method proposed in this paper, in the following experiments, we selected the SPG module with parameter ρ ∈ {0.1, 0.15}, having the potential to yield promising results.

b: DEEP-LEARNING FEATURE
We conducted experiments with two sizes of non-overlapping blocks (8 × 8 and 10 × 10) as the baseline, and two types of facial configuration options (SPG module and nonoverlapping blocks) using the identical TSCNN model. Ten sets of experiments were conducted for each group. The experimental results in details are shown in Tables 8 and 9.
In published papers such as MEGC 2019 [22], [23], [24], [25], they obtained good performance using the network, or aided by data augmentation, to learn features of the entire micro-expression optical flows. However, our manuscript focuses on facial region selection scheme, which is far from theirs. We throw away data augmentation and employ purely deep learning to extract features to verify the superiority of our proposed SPG module over the baseline (non-overlapping blocks).
The experimental results show that the performance of SPG module improves quite a bit compared to the baseline for both configurations. When ρ = 0.15, the best experimental results showed an improvement of about 7% for UF1 and 8% for UAR over the baseline. The worst experimental results showed an improvement of about 3% over the baseline for both UF1 and UAR. Also, when ρ = 0.1, the best group of experiments showed a 4% improvement in UF1 and UAR versus baseline. The worst experimental results showed no elevation as against the baseline, which is attributed to the fact that SPG did not generate an effective micro-expressionactive region by chance. We also present the confusion matrix when rho = 0.15, where Figure 7 shows the recognition performance for the baseline facial configuration and Figure 8 shows the results for SPG module. Based on the confusion matrix, it is clear that extracting the facial patches (generated by the SPG module) features through TSCNN is more discriminative for Positive and Surprise samples. From the two groups of experimental results, it is apparent that for deep learning, ρ = 0.15 yields more substantial results. Moreover, embedding our proposed facial sampling method into a  comparable model under the same experimental conditions can outperform SOTA algorithms of the same type.

2) ANALYSIS OF TSG MODULE
HC method contains two modules: the SPG module and the TSG module. Above all, based on the experimental results of HP, we noticed that with the SPG module, the recognition performance improved effectively when SPG generated the critical local regions of micro-expressions. As described in the previous paragraph, ρ ∈ {0.1, 0.15} were utilized to initialize the SPG module. For each ρ, we performed 5 rounds of experiments. Later, we assigned the parameters η of TSG module to 0.3, 0.5, and 0.8. For each η, 5 trials were executed, then average accuracy and F1 score were calculated. In other words, for each ρ, 15 experiments were performed on a single database. To counter the casualty of the HC method, 150 experiments on single dataset (CASME II and SAMM) were carried out separately, with a total of 300 groups recorded. The detailed experimental results are shown in Tables 10, 11, 12 and 13. Tables 10, 11, 12 and 13 showed that compared to the best results at baseline, HC method achieved a 4.9% improvement in accuracy and a 5.5% improvement in F1 score in CASME II, and a 3.7% improvement in accuracy and a 4.6% improvement in F1 score in SAMM. For a fairer performance comparison, we took the patch size as a criterion and     [34] with 10 × 10 non-overlapping blocks and SPG module (ρ = 0.1) in 3DB-combined database. 1 to 10 represent repeated experiments with the identical parameters.
calculated the mean value of multiple experiments. In this rule, 10 × 10 non-overlapping blocks corresponds to ρ = 0.1 and 8 × 8 corresponds to ρ = 0.15. The mean value of experiments in single dataset was calculated. In CASME II, when ρ = 0.15, the accuracy was 0.4143, and the F1 score was 0.4011. Compared with 8 × 8 non-overlapping blocks, accuracy was equal, but the F1 score improved by 3.2%. When ρ = 0.1, the average accuracy obtained was 0.4125, and the average F1 score was 0.3647. Compared with 10 × 10 non-overlapping blocks, accuracy improved by 3%, while the F1 score improved by 5.8%. Similarly, in SAMM, when ρ = 0.15, the accuracy improved by 2% with an equal F1 score. When ρ = 0.1, the accuracy improved by 0.8%, and the F1 score by 1.3%. From the above results, it was clear that the HC method achieved an effective performance improvement.  Likewise, the results on the 3DB-combined database are shown in Tables 14 and 15, from which it can be seen that for combining micro-expression datasets of different ethnic groups, our facial sampling as well as time-domain selection schemes, i.e., SPG and TSG, are superior to the non-overlapping blocks and entire micro-expression clip. When ρ = 0.1, UF1 boosts nearly 3% and UAR 2% over the baseline at most; however, when ρ = 0.15, UF1 improves by 4.5% and UAR 8% up the maximum. Apparently, ρ = 0.15 and η ∈ {0.5, 0.8} bring more promising results. It is still evident that our proposed method is effective for a broader range of datasets.
From another perspective, we compared the HP method with the HC method in correspondence. By Tables 10, 11, 12 and 13, it was evident that the micro-expression recognition performance obtained by HC method was generally higher than that of HP method, especially for experiments on CASME II. The experiments indicated that, compared with the HP method, HC could achieve a relatively substantial performance improvement, either accuracy or F1 score, by incorporating the TSG module. For instance, in Table 10, the accuracy was up to 0.4634, and the corresponding F1 score was up to 0.4409. For the comparative results on the CASME II database, i.e., Table 11 and 10 suggested  that the TSG parameter η ∈ {0.3, 0.5} further improve the accuracy and F1 score significantly. Likewise, for the SAMM database, Table 13 identified that the recognition performance of micro-expressions gained overall promotion, though the presence of some groups without performance gains. When ρ = 0.1, HC reached its peak.
Next, to further investigate why the HC method works, we visualized some well-performing micro-expression segments generated by the TSG module in the HC method. The normalized temporal segments on CASMEII and SAMM are shown in Fig. 9  On the CASMEII database (Fig.9), we statistically obtained the MOI ranging from 0.45 to 0.49. However, on the SAMM database (Fig.10), the MOI was 0.36 to 0.51. The reasons for the different overlapping intervals may be the source of the database samples, such as ethnicity, age,.etc. Moreover, the TSG module was stochastically initialized, which may  omit the crucial segments occasionally owing to relatively limited experiments. In addition, we tallied the apex frame distribution of the samples in both databases, as shown in Fig. 11 and 12. The intervals with the highest frequency of apex frame positions for CASME II and SAMM were [0.45, 0.55] and [0.3, 0.40], respectively. We found that the maximum frequency interval of the apex frame distribution tends to be consistent with the interval obtained from the MOI method on CASME II and SAMM. The consistency bias regarding SAMM is since our TSG module generates a relatively large range of temporal frames but contains the critical region of micro-expressions.
Based on the performance pickup brought by the TSG module, we demonstrated the critical interval of micro-expressions in the temporal dimension, which holds significant representative information of micro-expressions. From another side, we also identified that [43] barely considering the apex and reference frame to extract optical flow features is reasonable since the apex frame portrays the highest intensive muscle motion of micro-expressions. The results suggested that, similar to the ROI in the spatial dimension, the micro-expressions equally possess ROI in the temporal domain.

V. CONCLUSION
In this paper, we propose an HC feature extraction method including SPG and TSG modules, which facilitates the extraction of local micro-expression movements by haphazardly sampling the spatial and temporal dimensions of micro-expressions, and extensive experiments (applying LBP-TOP features [10]) demonstrate that our method is superior to the baseline (i.e., traditional facial selection with non-overlapping blocks and ROIs). Afterwards, we embed the SPG module into deep-learning method and experimentally demonstrate its superiority in comparison with state-ofthe-art methods (i.e., TSCNN [34]). In the temporal aspect, we assume that, like facial-specific ROIs, micro-expressions have specific ROIs, and investigate the temporal frame selection problem from a new perspective.
Further, comparing our proposed method (HC) to its degraded version (HP), we found that the general superiority of the HC method over the HP method by 300 groups of experiments is attributed to the intro of the TSG module. From the overall experimental results, the best performance configuration is available when ρ = 0.15 and η = 0.5. Then, we selected the groups with superior performance in the HC method compared to the corresponding HP method, tallied the MOI of temporal segments in the TSG module, and drew a comparison with the apex frame distribution of the database. The results show an intrinsic consistency between the two. In other words, analogous to the ROI of the human face, the micro-expressions also inherit a similar ROI in the temporal domain, and their positions are highly relevant to the moment when the most intensive facial motion of the microexpression occurs, i.e., the apex frame.