Complex Emotion Profiling: An Incremental Active Learning Based Approach With Sparse Annotations

Generally, in-the-wild emotions are complex in nature. They often occur in combinations of multiple basic emotions, such as fear, happy, disgust, anger, sadness and surprise. Unlike the basic emotions, annotation of complex emotions, such as pain, is a time-consuming and expensive exercise. Moreover, there is an increasing demand for profiling such complex emotions as they are useful in many real-world application domains, such as medical, psychology, security and computer science. The traditional emotion recognition systems require a significant amount of annotated training samples to understand the complex emotions. This limits the direct applicability of those methods for complex emotion detection from images and videos. Therefore, it is important to learn the profile of the in-the-wild complex emotions accurately using limited annotated samples. In this paper, we propose a deep framework to incrementally and actively profile in-the-wild complex emotions, from sparse data. Our approach consists of three major components, namely a pre-processing unit, an optimization unit and an active learning unit. The pre-processing unit removes the variations present in the complex emotion images extracted from an uncontrolled environment. Our novel incremental active learning algorithm along with an optimization unit effectively predicts the complex emotions present in-the-wild. Evaluation using multiple complex emotions benchmark datasets reveals that our proposed approach performs close to the human perception capability in effectively profiling complex emotions. Further, our proposed approach shows a significant performance enhancement, in comparison with the state-of-the-art deep networks and other benchmark complex emotion profiling approaches.


I. INTRODUCTION
Humans convey their emotions as different nuanced expressions through their face. Although human regularly expresses continuous and complex emotions through face, preeminently, previous studies have mainly focused on detecting the Ekman's six basic emotions, namely happy, sad, surprise, fear, disgust and anger [1]- [3]. Accurately predicting the complex emotions (e.g., micro-emotions, pain and compound emotions) is essential to respond appropriately for a situation in many domains, such as medical, education and military. For instance, deceptive detection during a legal investigation The associate editor coordinating the review of this manuscript and approving it for publication was Yiming Tang . is a good example for the use of extracted complex emotions, such as micro emotions. The detection of complex emotions assists an investigator to profile the subject's emotional state more precisely. Another significant application of complex emotion analysis is the detection of pain during a medical observation.
Du et al. [4] defines 21 categories of compound emotions that humans express in-the-wild. They have also demonstrated the correlations between those emotions and the action units (AUs), which are a group of muscles responsible for the facial expressions. For example, as illustrated in the top raw images of Figure 1, the compound emotion happily surprised is revealed by the presence of several muscle movements, namely AUs 1, 2, 5, 12, 25 and 26, as shown in the left image, and AUs 1, 2, 5, 6, 12 and 25, as shown in the right image. Further, the second row of images in Figure 1 shows two complex emotions, happily surprised (left) and happily disgusted (right), which are primarily formed as a combination of the basic emotions happy with surprise and disgust, respectively. On the other hand, definition of another complex emotion pain is provided in [5]. Micro-emotion is another significant complex emotion, which occurs for a fraction of a second, and hard to recognize in-the-wild using the naked eye.
The existing techniques for complex emotion profiling have limitations in terms of detecting them and often require a large dataset for training the emotion profiling model. Besides preliminary definitions, the complex emotions have not been analyzed deeply in the past due to two more significant reasons.
First, the complex emotions can only be extracted from a continuous observation in-the-wild, which is challenging due to the variations present in an uncontrolled environment, such as lighting and pose. Although in-the-wild facial analysis with highly uncontrolled environments has become the center of attention in the recent research studies, [6], [7], in-the-wild extraction of complex emotions has not been well addressed in the past. The applications, such as deceptive detection and pain estimation, require feedback in uncontrolled environments to enhance the quality of service. Hence an in-thewild analysis of complex emotions demands an extensive examination.
Second, the unavailability or insufficient labeled datasets with annotations of complex emotions poses a significant limitation for the training of current deep networks. There are only a few benchmark datasets available with the annotations of complex emotions, such as micro-emotions, compound and pain emotions. Due to this limitation, the relatively new attempts on complex emotion recognition, such as pain intensity estimation, focused on using hand-crafted feature extraction techniques [8]. However, the minimal changes in facial muscles during the complex facial expression caused poor discriminative capability for the hand-crafted feature extractors. On the other hand, the rise of deep learning techniques (e.g., Convolutional Neural Network (CNN)) in the recent years enabled the computer vision systems to achieve highly efficient outcomes [9]. In addition, the deep CNN architectures have been widely utilized in recognizing the basic emotion through facial expressions. However, state-ofthe-art deep learning techniques demand a large and balanced training dataset to perform optimally.
In order to address the research gap identified above, in this work, we propose a novel Active Hybrid Deep CNN framework with fusion mechanism, named as AHDCNN, to predict the complex emotions using facial expressions inthe-wild automatically. In AHDCNN, we introduce a costeffective active learning (AL) based approach to improve the performance, and accelerate in-the-wild recognition of the complex emotion with a small amount of initial training data. Recent successes in AL-based approaches in computer vision provides motivations for complex emotion recognition with a small amount of annotated training data, which provides a less expensive way to train the model. AL is capable of providing a competitive classifier with a small number of initial training samples integrated with a progressive learning process in various image classification problems in-the-wild [10]. Further, the recently emerged AL approaches also demonstrated the reduced cost of labeling for training instances and improved performances [10]- [15]. However, integrating AL into deep architectures for image classification problems is limited due to the challenges, such as unavailability of techniques to define the optimal size of initial training data for deep network architectures and inefficient active selection algorithms.
Inspired by these two practical issues of integrating AL with deep network frameworks, we propose an enhanced AL technique that optimizes the initial training dataset. In particular, we utilize an image augmentation process. In addition, we propose an improved active selection algorithm that incorporates a wide range of samples ranging from informative to non-informative stage in the model updating process. We then propose an image pre-processing method to alleviate the variations present in uncontrolled environments in-thewild. A variety of image pre-processing tasks have been proposed in the past. However, those conventional image preprocessing approaches have limitations for emotion profiling tasks due to not being fine-tuned on more specific facial emotions and low robustness of the pre-processing tasks of image processing for unknown, in-the-wild, environments. Motivated by these two image pre-processing related issues, we propose a comprehensive image pre-processing technique for in-the-wild facial emotion profiling task. This image preprocessing task is integrated as an internal component of our proposed framework.

A. KEY CONTRIBUTIONS
In summary, the key contributions in this paper are as follows: • First, we develop an incremental active learning-based end-to-end deep CNN framework that performs accurate in-the-wild prediction of various complex emotions, such as micro-emotions, pain and compound emotions. In our deep framework, we introduce an improved costeffective AL mechanism with a continuous and fully automated feedback mechanism. Moreover, our end-toend framework is capable of estimating the optimized emotion dataset for initial training.
• Second, we propose a comprehensive image preprocessing mechanism, which is specifically designed for facial emotion images, to handle the inconsistency of an uncontrolled environment.
• Third, we show that the proposed framework yields state-of-the-art emotion prediction accuracies with small training sets in profiling the complex emotions in-thewild. To validate this, we have compared the prediction accuracy with existing complex emotion recognition methods discussed in the literature and other five finetuned state-of-the-art deep networks. The remainder of this paper is structured as follows. In Section II, the preliminaries of AL approach and complex emotions are reviewed. The methodology is introduced in Section III. Then, the Section IV describes the extensive experiments and evaluation. Lastly, Section V provides the conclusion and future directions of our work.

II. RELATED WORK
In this section, we describe the recently proposed related works on complex emotion profiling and active learning techniques.

A. COMPLEX EMOTION PROFILING
Apart from the basic emotions, humans express many complex emotions during continuous conversations. Although most of the existing works have focused on six basic human emotions, a list of complex emotions, such as pain, microemotions and compound emotions have been identified in the past due to its significance in many applications, such as medical interventions, human-computer interaction, sociable robots and social conversations.

1) COMPOUND EMOTIONS
Compound facial emotions are formed from a combination of a few existing basic emotions (e.g., happily surprised is a combination of basic emotions happy and surprise). Du et al. [4] defined 22 emotion categories, including the six basic emotions with neutral and 15 compound emotions. These compound emotions have not been analyzed in-depth using deep learning approaches due to the insufficient amount of labeled data to train the model. In particular, with 10-fold cross-validations, the authors of [4] have achieved classification accuracies of 73.61%, 70.03% and 76.91% when using shape, appearance and combined features respectively. Similarly, for the leave-one-out cross-validation, 72.09%, 67.48% and 75.09% classification accuracies were reported for shape, appearance and combined features. During the comparison, authors have reported that their shape and appearance-based model outperformed the multi-class SVM proposed in [16]. However, due to insufficient labeled data, authors have not compared the performance of their model with any of the existing deep networks.

2) MICRO EMOTIONS
Micro-emotion appears for a short duration with low intensity, which is also considered as one of the complex emotions since it is difficult to recognize in-the-wild. Numerous handcrafted feature extraction techniques have been proposed in the past to recognize the micro-emotions from videos. However, in recent years, a few prominent research works such as [17]- [19] have shown potential improvements in micro-emotion recognition using deep techniques.
In [17], authors have used a dual temporal scale CNN architecture to recognize the micro-emotions spontaneously. To avoid overfitting while training the deep model with sparse dataset, a dual architecture has been constructed based on two shallow CNN networks. Further, to acquire higher-level features, authors have used the optical flow frames instead of raw images. The experimental results show that the proposed architecture achieved 10% better accuracy than the stateof-the-art techniques. In [18], another significant study on micro-emotion recognition is presented, where the authors have utilized an enriched long-term RCNN. In this approach, the CNN modules are used to extract the features, and a long short-term memory (LSTM) is used to predict the microemotions. This approach also outperformed existing microemotion recognition techniques. However, the approaches proposed in [17] and [18] were not tested with sparse raw image training samples to recognize the micro-emotions.
Peng et al. [19] have proposed a transfer learning-based approach to recognize the micro-emotions considering a small training data. The ResNet10 [20] deep network that was pre-trained on Imagenet [21] dataset has been used to transfer learn on a small micro-emotion dataset. This approach achieved prediction accuracy rates of 70.59% and 75.68% on SAMM [22] and CASME II [23] datasets, respectively. Apart from the fact that it is working well with small datasets, a major limitation of this approach is its poor prediction accuracies compared to existing state-of-the-art micro-emotion recognition techniques.

3) PAIN
Highly social species, including humans, use face to express emotional states, such as pain during social and medical interaction. In the past, researchers have mainly focused on classifying the pain into binary classes, namely having pain or not. However, a vast number of recent research have focused on estimating the intensity of pain at a fine-grained level rather than a simple twofold classification. Facial action coding VOLUME 8, 2020 system (FACS) provides a standard way of defining the pain intensity estimation, where FACS represents a movement of facial components based method, which is effective to represent emotions with rich expression states, such as pain. Numerous researchers have used FACS to estimate the pain intensities in the past. However, the Prkachin and Solomon Pain Intensity (PSPI) [5] metric has been widely used to estimate the pain intensities in a sixteen-level ordinal scale from a combination of six action units (AUs).
The majority of the existing pain estimation approaches are based on typical handcrafted feature-based techniques [8], [24], [25]. The lack of labeled data and standard rules cause the automatic feature extraction based pain intensity estimation challenging. Due to the limited deviations in painful facial expressions between subsequent PSPI scales, researchers tend to curtail the number of pain intensity classes to improve better detection performances. One notable work carried out by Hammal and Cohn [8] estimated the pain intensity into four levels using a handcrafted feature extraction method; defined as PSPI = 0 (none), PSPI = 1 (trace), PSPI = 2 (weak) and PSPI ≥ 3 (strong). In this work, canonical appearance (C-APP) derived from the active appearance model (AAM) is traversed through Log-Normal filters in order to extract the features to classify the pain intensity classes. Additionally, four separate support vector machines (SVMs) have been trained using both 5-fold and leaveone-out cross-validation techniques. The classification rates achieved for the 5-fold and leave-one-out cross-validations are (97, 61), (96, 72), (96, 79), (98, 80) for the pain intensity levels none, trace, weak and strong respectively.
Roy et al. [24] designed another novel framework to estimate the pain intensity levels in four classes, as defined in [8]. A Gabor filtering was used for feature extraction, and Principal Component Analysis (PCA) was applied for feature compression. An SVM is then used to classify various pain intensity levels. The experiment was carried out under the frame level and image level settings in order to verify the robustness and accuracy of the framework under person dependent and person independent environments, respectively. Results show that this framework achieved 82.43% average classification accuracy over the four-level pain intensities. In [26], authors have categorized the pain intensities into six meaningful levels, namely none, mild, discomforting, distressing, intense and excruciating. Zhao et al. [25] studied estimation of the same six-level pain intensities, which are defined in [26]. The maximum estimation accuracy was achieved under a supervised setting among the experiments performed under fully supervised, semi-supervised and unsupervised settings. As observed, in [8], [24] and [25], the classification rate obtained for leave-one-out is significantly low, which is identified as a major limitation. It leads to a generalization issue for the proposed models, which is not effective across a range of different datasets.
Although the deep learning approaches have shown promising results during the recent years in various applications including computer vision, only a few works have been performed using automatic feature extractive deep learning techniques, especially in the automatic pain detection area [27]- [29]. In most cases, a limitation observed is the unavailability of annotated data for distinct pain intensity levels. By addressing this limitation, deep techniques still achieved comparable performances in this domain. In [27], Martinez et al. proposed a Recurrent Neural Network (RNN) based approach to estimate the pain intensity using the visual analog scale (VAS). A Long Short-Term Memory Recurrent Neural Networks (LSTM-RNNs) [30] was used as the core of this model. Although it has achieved higher accuracies on lower intensity levels, the limited training data caused poor average accuracies on higher intensity levels. Wang et al. [28] and Zhou et al. [29] have recently attempted to use recurrent CNN to estimate the pain intensity automatically. In [28], authors fine-tuned a pre-trained network to transfer the knowledge for pain estimation. Conversely, in [29], a five-layer convolutional network has been used to train the system. In both cases, the proposed models were trained using the whole pain dataset and showed low classification accuracy.

B. ACTIVE LEARNING
Recent deep learning-based architectures heavily rely on accurately annotated training datasets to learn accurate models. In particular, as discussed before, complex emotions are challenging to be annotated, which leads to insufficient annotated training data. In order to mitigate the aforementioned problem, AL techniques have been proposed for use in a range of computer vision tasks. AL-based models are usually trained with sparse data, and actively improved using most informative samples. Various AL algorithms in conjunction with deep networks have been proposed in the past for vision tasks [10], [31]- [34].
Most of the existing works consider only the most informative or minority samples after an active user labeling is performed [32], [33], adopting common AL methods, such as least confidence, margin sampling and entropy. In [31], Li et al. proposed an adaptive AL framework that considers an uncertainty measure and a density measure to select the critical samples. This approach also failed to consider the majority or high confidence samples. In contrast, Wang et al. [10] presented a cost-effective active learning algorithm for deep image classification tasks that selects both majority and minority samples during the active selection process. This approach again depends on a costly and timeconsuming human labeling process. Huang et al. [34] proposed a slightly different framework that considers two novel criteria, namely distinctiveness and uncertainty. Although the extensive experiments reported that this framework outperformed other active model adaptation techniques, a limitation is that it only considered binary classification. Further, their approach did not consider a diverse range of samples from unlabeled data. Thus the AL algorithm will fail to contemplate the majority of the samples during the selection process. However, previous studies have shown that considering   samples with different confidence values improves the prediction accuracy.

III. METHODOLOGY
Based on the challenges observed in the literature survey above, in this section, we propose a novel incremental AL-based deep framework for complex emotion profiling in-the-wild. The proposed framework consists of three components, namely pre-processing unit, optimization unit and an active learning unit, as illustrated in Fig. 2.

A. PRE-PROCESSING UNIT
In this section, we develop a comprehensive pre-processing mechanism, which is crafted specifically for complex emotion profiling tasks. Normalization and augmentation are two phases of our pre-processing descriptor. This pre-processing technique extends the one we proposed in [35].

1) NORMALIZATION
The complete overview of the normalization phase in the preprocessing unit is illustrated in Fig. 3. In the normalization step, first, the input video frames are converted into greyscale images in order to reduce the cross-database discrepancy between the video frames.
Rotation Correction: We then make two copies of each greyscale image to perform rotation corrections. This process eliminates the rotation variation related complexity while extracting the features, thus providing a reliable way to extract emotion features from the face. During the rotation correction, as indicated in Fig. 3, we align the active appearance model (AAM) facial feature points 37 and 46 of eyes on the first image, and AAM facial feature points 49 and 55 of mouth on the second image horizontally. After that, first and second images are used to select the expression centric areas of the eye and the mouth regions, respectively, and are then spatially normalized. VOLUME 8, 2020

ROI Selection and Spatial Domain Normalization:
Eliminating insignificant information (e.g., background information) in the input video frames will improve the detection or classification accuracy. The raw images of publicly available complex emotion datasets have a lot of background information in them. In the past, facial emotion recognition studies, such as [36], have eliminated the background information and certain portion of the face from the facial images to reduce the complexity. In our approach, not only the image is cropped to eliminate the background information and some portion of the face, but also the expression specific features are selected by focusing on the eye and mouth regions. The cropping process of eye and mouth regions are illustrated in Fig. 4.
For the eye region, we define a as the distance between AAM facial feature points 37 and 46. The width and heights are then set to 1.2 (0.1 times extended in each side) and 0.5 (0.3 times above and 1.1 below the eye corner AAM facial feature points) factors of a. Similarly, for the mouth region, b is defined as the distance between the lip corner AAM facial feature points 49 and 55. Then the width and heights are set to 1.8 (0.4 times extended in both left and right sides from the lip corner AAM facial feature points) and 1 (0.5 times above and 0.5 below the lip corner AAM facial feature points) factors of b. This is the average size of the active eye and mouth regions of all images used in the complex emotion datasets.
Intensity Normalization: The variations in image features, such as brightness and contrast often increase the complexity of classification tasks. Contrast limited adaptive equalization (CLAHE) [37] is one of the techniques that can be used to eliminate the variations in contrast and brightness of an image. CLAHE is a widely used variant of the adaptive histogram equalization algorithms, which can be applied on both colored and grayscale images. The slope of a transformation function, which is proportional to the cumulative distributive function (CDF) of neighborhood pixels, provides the contrast amplification of a pixel value. In CLAHE, before the computation of CDF, the contrast amplification is constrained by a pre-defined value called the clip limit of the histogram. The clip limit in CLAHE regulates the noise level that has to be smoothed, and the contrast that has to be enhanced. The primary advantage of CLAHE is that it redistributes the histogram part, which exceeds the clip limit between all histogram bins rather than just eliminating it. The clip limit and the α value are set to 0.01 and 1, respectively, for the Rayleigh distribution used in this study. Fig. 5 illustrates an example of a sample image before and after the intensity normalization process.
Scale Normalization: As the last step of the normalization phase, we performed a scale normalization, where we downsampled the size of the image to 128×128 pixels using linear interpolation. Scale normalization reduces the complexity of feature extractor by placing identical facial feature points of different images approximately at the same location.

2) DATA AUGMENTATION
Deep networks often show better performances with large training sets while performing classification tasks, such as profiling the complex emotions accurately. However, in this research, we have used a small portion of the benchmark complex emotion datasets for the training purposes. Therefore, we use synthetic data augmentation, which is often utilized to enhance the training set in the field of deep learning. Further, this technique has been widely used for many traditional deep network training purposes. Simard et al. [38] proposed a data augmentation method using elastic deformations (translation, rotation and skewing) on real images. Adopting this approach, we used a 2D Gaussian distribution to add random noise in the eye and mouth regions of the face to produce the synthetic frames separately. The Gaussian standard deviation is carefully engineered since both small and large variations can generate meaningless identical images and create a more complex learning environment for the classifier, respectively. Moreover, the augmented samples with large variations are carefully removed once again during the sample selection process. We synthesize all the rotation corrected images (both eye and mouth regions corrected images) and used to train the initial classifier.

B. SAMPLE SELECTION CRITERIA
In this section, we introduce the active sample selection criteria used in our framework. The main stages in sample selection, namely confidence value calculation criteria, self pseudo labeling with high confidence samples and threshold fine-tuning are described below.

1) CALCULATION OF CONFIDENCE VALUE
In the past, many approaches have been proposed to calculate the confidence value using the probability of a predicted sample P(y i = j | I i ; ) for a given deep CNN model . Among them, three commonly used active learning techniques are least confidence [39], margin sampling [40] and entropy [41]. Culotta and McCallum [39] defined the least confidence criteria, which sorts the samples in an ascending order according to the classification probability predicted by the current model. Eq. 1 describes the definition of the least confidence criteria.
where, P(y i = j | I i ; ) indicates the classification probability of the sample I i for the j th class under the current model θ.
The classifier is uncertain about a predicted sample when it records a lower confidence value. Margin sampling [40] strategy, on the other hand, measures the confidence value according to the margin between the highest and the second-highest probable classes, as described in Eq. 2.
where, P θ (y i = j first | I i ; ) and P θ (y i = j second | I i ; ) indicate the first and the second-highest classification probabilities of the sample I i under the current model θ. The smaller margin indicates higher uncertainty of predicted sample by the current classifier. Thus, the samples are ranked in an ascending order.
Inspired by information theory, in entropy sampling [41] criteria, all the predicted class probabilities are utilized to measure the entropy, which is defined in Eq. 3. Higher entropy values for the predicted samples indicate the uncertainty of the current classifier. Hence, all the samples are arranged in descending order.
where, en i is defined as the summation of the probabilities of all possible classes (i.e., j = 1 . . . m).

2) AUTOMATIC PSEUDO-LABELING
High confidence samples from the unlabeled dataset are selected to label the samples automatically, which are then included in the labeled set for the next training phase. We adopt the approach proposed in [42], where the authors have utilized least confidence, margin sampling and entropy criteria, in high-confidence sample selection for automatic Initialise the CNN parameters to I with the initial training set L I

3:
while not reached the maximum training iterations do 4: if I i ∈ {L R } then 5: Select a set of random samples S R Fine-tune the CNN parameters I to O using Eq. 6 12: Update the selection threshold δ using Eq. 5 13: end while 14: return O 15: end procedure pseudo-labeling.
In equation 4, j * is the most probable label of the sample I i with the current model. y i describes the label with the highest prediction probability, where lc i , ms i and en i are calculated using equations 1, 2 and 3 respectively. The classification ability of the model incrementally grows during the active learning process. Thus, the selection threshold needs to be updated to improve the model's reliability with newly added labeled data. We update the selection threshold δ using the equation 5.
where, the threshold δ is initially set to δ 0 and updated in each iteration using a learning rate decay d r .
In the next subsection, we provide details about the optimization unit proposed in our framework.

C. OPTIMIZATION UNIT
After obtaining the pre-processed data, during the optimization phase, we train the initial optimized model, as illustrated in Fig. 6. To optimize the deep CNN model, we use the VOLUME 8, 2020  labeled dataset L, which is a subset of a given complex emotion dataset D (i.e., L ⊂ D). Initially, the deep CNN network is trained using 30% of the labeled data, which is L I (0.3×L), to initialize the parameters of the deep CNN parameters to I . The rest of the annotated dataset L R (0.7 × L) is reserved for model optimization. The samples of the initial training dataset are randomly selected from the labeled data. After the initialization step, as indicated in Eq. 6, the model is updated incrementally using the optimization training set O, which is obtained from a combination of selected reserved L R and augmented L A samples.
In Eq. 6, O is the optimized model, I i and ε i O i are the initial model and the optimizing weights in i th iteration of the incremental process, where i = 1 . . . n.
We propose a robust sample selection algorithm that can progressively select the samples from the optimization training set for the incremental model updating process. Algorithm 1 explains the steps involved in the optimization sample selection (OSS) algorithm in detail. The reserved data instances of the optimized training set are picked in the model updating process without any conditions. However, from the augmented portion of the optimization training set, only the majority of samples with high prediction confidence (i.e., clearly classified) have been selected for the incremental model updating process. This mechanism helps eliminate the augmented images that are highly deviated from the original images. Generally, augmented samples with high deviation increase the complexity of the deep classifiers.
In the OSS algorithm, the active user participation is not required to select samples from both reserved and augmented datasets. We only use previously annotated data to optimize the model. Hence, the selection of random samples from the reserved dataset is entirely based on the available annotations. Another advantage of OSS is the elimination of active  while not reached the maximum training iterations do 3: Select the high confidence samples U H using Eq. 4 4: Fine-tune the CNN parameters O to F using U H

5:
Update the selection threshold δ using Eq. 5 6: end while 8: return F 9: end procedure user participation through a high confidence sample selection technique to obtain the training samples from the augmented dataset. Thus, OSS completely eliminates the expense of active user participation in the optimization process.

D. ACTIVE LEARNING UNIT
The main purpose of proposing an active learning unit in our framework is to actively learn and enhance the classification capability of the obtained optimized model with minimum training data. Fig. 7 shows the proposed active learning unit, which uses a comprehensive active sample selection (ASS) algorithm. The proposed ASS algorithm, which is illustrated in Algorithm 2, utilizes the majority samples (i.e., clearly classified samples with high confidence values) in each iteration, like used in other conventional AL approaches. The intuition behind this algorithm is to select the samples with high confidence values to automatically annotate and add them into the training set. However, our approach additionally considers the minority samples (i.e., informative samples) in subsequent phases during an incremental model updating process. As illustrated in phase 1 of Figure 7, we use the optimized model O , which is derived from the optimization phase, to select the majority samples from the unlabeled data U . We then utilize the majority samples U H to update the model. We then add Gaussian (G) noise to the minority samples U −U H , and reserve the resultant samples G(U −U H ) to present as an input for the next phase along with the updated model. Subsequently, we update the selection threshold using Eq. 5. We repeat the aforementioned incremental based model updating process until there is no further significant improvement in learner performance is observed, i.e., the training loss of the classifier is converged.
Next, we explain the deep CNN architecture used in our approach.

E. DEEP NETWORK
The novel CNN architecture integrated into our framework is illustrated in Figure 8. Our proposed deep network architecture consists of two parallel CNN stacks, each with six convolution layers, which is shallower than the majority of the existing state-of-the-art deep networks. Since the training process starts with small complex emotion datasets, in both optimization and active stages, using very deep networks are vulnerable for overfitting. Hence, as indicated in the figure, we have chosen a network with fewer convolution layers with appropriately placed residual blocks, where, each adding six extra convolution layers to our network. Residual blocks ultimately increase the number of layers in the network while providing flexibility to skip the training of a few convolution layers, and hence minimizing the complexity of the deep network. Generally, the skipped connections in the residual block eliminate the degradation problem during the training phase. In addition, after each convolution layer in our primary network, multiple rectified linear units (ReLU), VOLUME 8, 2020 dropout, normalization and pooling are attached to improve the stability of the deep networks.
In summary, we configure both stacks of our parallel deep CNN identically. Each stack of the network accepts images of size 128 × 128 with 3-channels, where the upper stack accepts the upper face and the lower stack accepts the lower face, as illustrated in Figure 8. The first two convolution layers are implemented with a kernel of size 7 × 7, a stride of size 2 and a padding of size 1. The kernel size and the stride size are set to 5 × 5 and 1 for the third and the fourth convolution layers. For the last two layers, the kernel size is further reduced to 3 × 3 with the stride size of 1. For the last four layers, the padding is set to 0. One ReLU layer is always attached immediately after each convolution layer of our primary network. There are two dropout layers placed after the third and fifth convolution layers. Additionally, other than the first convolution layer, a pooling layer is placed after every other convolution layer in our architecture. After fusing the feature maps, we stack 3 fully connected layers with sizes 4096, 4096 and 512, respectively. The first two fully connected layers are followed by two dropout layers in the network.
As indicated earlier, a residual block is placed between the third and fourth convolution layers of the main network, which comprised of 3 skip connections. The first and last of the six convolutional networks implemented in the residual block are with the kernel size of 5×5 and stride size of 2. The kernel size and the stride size of the rest of the convolutional networks are set to 3 × 3 and 1. The padding size is 0 for all the convolution layers in the residual block.
Finally, a softmax layer is utilized to perform the complex emotion classification.

IV. EXPERIMENTS AND RESULTS
In this section, we present the results and analysis of the extended experiment carried out on publicly available complex emotion benchmark datasets, to demonstrate the cost-effectiveness of our proposed active incremental learning approach. We report the results separately for three different types of complex emotions, namely compound, microexpressions and pain, which are discussed before.

A. EXPERIMENTAL SETTING 1) COMPLEX EMOTION DATASETS
Here, we evaluate the proposed deep active learningbased approach and report the results on various complex emotion datasets, such as the compound emotion dataset [4], micro expression datasets CASME [43], CASME II [23], CAS(ME) 2 [62] and SAMM [22], and the pain dataset UNBC-McMaster Shoulder Pain Expression Archive (UNBC) [44]. Table 1 illustrates the summary of complex emotion datasets used in our experiments. The compound emotion dataset was collected from 230 subjects, which provides annotation of 21 emotion categories, which includes 6 basic and 15 compound emotions. The CASME [43] is claimed to be the first spontaneous micro emotion dataset that provides annotations for 8 micro emotions, such as amusement, sadness, disgust, surprise, contempt, fear, repression and tense. The authors further extended the dataset to CASME II [23], which provides much more sophisticated annotations for 5 micro-expression (i.e., happiness, disgust, surprise, repression and others).
The CAS(ME) 2 [62] is another spontaneous dataset that offers 303 expression samples, including 53 microexpression sequences of four classes, such as positive, negative, surprise and other. Meanwhile, Davison et al. recently dispensed another spontaneous micro-expression dataset, namely SAMM [22], that contains annotated samples for 7 emotion classes, including 6 basic emotions. Lastly, the UNBC pain dataset is dedicated for the emotion pain and its intensity levels. The UNBC pain dataset consists of 200 video sequences with frame-level pain intensity annotations.
In order to perform a fair evaluation with existing complex emotion recognition methods, we use 10-fold and leave-onesubject-out (LOSO) cross-validation techniques to report our results. In both cross-validation techniques, labeled, unlabeled and test sets are manually sliced and consistently swapped across the whole dataset samples. For the 10-fold cross-validation, in each iteration, 10% samples of the whole dataset are reserved as the test set to report the performance of our model. Additionally, in each complex emotion dataset, we reserve 30% of the annotated samples as the labeled portion for the model optimization purpose, and the rest as the unlabeled portion for the active learning process. In contrast, for the LOSO protocol, we reserve one subject for testing purposes in each iteration and present the average results. We followed the similar protocol that we used for the 10-fold cross-validation to generate the labeled and unlabeled sets for optimization and active learning purposes.

2) IMPLEMENTATION
In the training phase, stochastic gradient descent with momentum (SGDM) method is used as the optimizer. Other parameters, such as learning rate, momentum, weight decay and Gaussian standard deviation are set to 10 −6 , 0.9, 5 × 10 −5 and 10 −2 respectively. The same CNN parameter values are used without any changes in the experiments carried out on all complex emotions datasets. For the active learning environment, we set the initial threshold δ 0 and decay rate d r as shown below in pairs, for least-confidence, margin-sampling and entropy-based methods respectively: [(8 × 10 −1 , 0.2 × 10 −6 ), (8 × 10 −1 , 0.2×10 −6 ) and (0.2×10 −6 , −0.1×10 −6 )]. These parameters are updated throughout the training process.

3) METRICS
The metrics used in complex emotion analysis are accuracy, mean squared error (MSE) and Pearson's product-moment correlation coefficient (PCC). The accuracy is used to present majority of the results in our experiments, which is defined in Eq. 7.
where, TP, FP, TN and FN are true positive, false positive, true negative and false negative, respectively.
Additionally, for the pain intensity estimation experiment, MSE and PCC are gradually used to report the performance of our proposed approach, which are defined in Eq. 8.
where, n is the number of samples in the test set. y i andȳ are the ground-truth of the i th frame and the mean of {y 1 , . . . , y n }. y i andȳ are the predicted pain intensity level of the i th frame and mean of {ŷ 1 , . . . ,ŷ n }, respectively. A higher value for PCC is better while a lower value for MSE is better.

B. EVALUATION FOR COMPOUND EMOTIONS
First, in this experiment, we evaluate the performance of our proposed framework on classifying the neutral face and 21 compound emotions defined in the compound emotion dataset [4]. Figure 9 illustrates the improvement of average classification accuracies of our approach over the five other state-of-the-art deep networks, considering the percentage of training data utilized in both optimization (left image) and active learning (right image) units. It can be seen that our approach performs favorably in both optimization and active learning steps against the compared deep networks.
Our proposed framework has utilized 78% of the labeled training data for the model optimization to reach a stable average accuracy of 73.9%. It shows that the presented model is feasible, and can be optimized with a small labeled dataset. Other deep networks except for AlexNet [45], compared in this experiment, consumed more training labeled data for the model optimization. The proposed model also achieved a better accuracy in the optimization stage compared to the other state-of-the-art deep networks. After the optimization, in the incremental active learning phase, the average classification accuracy of our approach has improved significantly and reached the peak average accuracy of 85.02% only with 50.5% of the unlabeled training data. It is clear that the incremental active learning phase has significantly improved the average classification accuracy of compound emotions. In addition, our model has recorded consistent accuracies (≥ 72%) for each emotion as summarized in Table 2. As can be seen in the table, we compared the accuracies for each compound emotion with Du et al. [4] and five other state-of-the-art deep networks. For some emotional states, such as neutral, sadly fearful, fearfully angry, angrily surprised, angrily disgusted and hate, the best accuracy rates were not recorded by our model due to the fact that these emotion classes contain a considerable amount of intra-class variations that can easily be confused with other classes. However, the overall comparison results show that our model showed better classification ability for most of the compound emotions.   Further, the comparison of the overall results achieved on the Compound emotion dataset is shown in Table 3. It can be seen that our proposed model achieved better overall results on the Compound emotion dataset. Notably, our model shows a way better F1-score compared to other existing models, which shows that our model is effective with imbalanced datasets as well.

C. EVALUATION FOR PAIN INTENSITY ESTIMATION
Second, as described earlier, we evaluate the presented framework on UNBC pain dataset [44] for pain intensity estimation. In this experiment, we perform 16-level pain intensity estimation using the presented model, where the pain intensity levels are as defined in the PSPI metric. Figure 10 presents the comparison of average accuracy change against the percentage of the labeled data during optimization (left) and incremental active learning (right) stages by our model and other deep networks on the pain dataset. The results demonstrate that our proposed framework achieved 82.5% and 98.8% accuracies after optimization and incremental active phases, respectively, which is better than the state-of-the-art deep networks compared here. In addition, our approach used 66% and 65% of the labeled samples in the respective stages, which is much lower compared to the other deep networks, except for AlexNet [45]. Further, we compare our proposed framework with existing pain intensity estimation methods and the state-of-the-art deep networks in Table 4. The comparison shows that our approach outperforms the existing pain intensity estimation benchmark methods and the state-of-the-art deep networks by a comprehensive margin. Our method achieved the highest overall accuracy of 98.8%, MSE of 1.21 and PCC of 0.79 with 10-fold cross-validation. The low MSE reported for our method demonstrates that the majority of the misclassified samples are confused with nearby pain intensity classes. Some of the very recent pain intensity estimation methods, such as [51], are not compared with our approach as they use minimized pain intensity classes.

D. EVALUATION FOR MICRO EXPRESSIONS
Third, to demonstrate the model feasibility and effectiveness for the recognition of subtle micro-expressions, we further evaluated our presented framework on four benchmark micro expression datasets, namely CASME [43], CASME II [23], CAS(ME) 2 [62] and SAMM [22]. From the observation, the accuracy changes for our approach and other existing deep networks show a similar behavior as that are achieved in compound emotion [4] and UNBC pain [44] datasets. Compared to the state-of-the-art deep networks, our approach obtained the best accuracies in both optimization and incremental active learning stages on all three micro expression datasets. Table 5 and 6 present the comparison of recent existing benchmark micro-expression recognition methods with our proposed approach. It can be observed that our approach outperformed all the state-of-the-art deep networks with 10-fold cross-validation. For the LOSO cross-validation, our approach outperformed all the benchmark micro-expression recognition approaches on CASME [43], CASME II [23] and CAS(ME) 2 [62] comprehensively. In addition, our TABLE 5. Benchmarking against existing micro-expression recognition approaches and the state-of-the-art deep networks on CASME [43], CASMEII [43] and SAMM [43] datasets with 10-fold cross-validation. approach has shown a competitive performance for the microexpression recognition on SAMM [22] dataset. Yet, from the comparison, Thuseethan et al. [35] achieved a better performance on SAMM [22] dataset. However, in particular, our approach has utilized less amount of labeled samples for training compared to [35]. In comparison with our proposed approach, we can also see that the classification accuracy recorded for [35] is much lower on other two micro expression datasets.
In order to evaluate the generalization ability of our proposed framework for micro-expression recognition, a crossdatabase evaluation has been carried out on selected micro expression categories, which are commonly available (e.g., disgust) in all three datasets. To perform this, we trained our framework on one dataset and tested on other two. The corresponding results are presented in Table 7. The cross-database evaluation reveals that CASME [43] and CASMEII [23] are best generalized on each other, and demonstrated a less generalization on the SAMM [22] dataset. This is due to the fact that CASMEII [23] dataset is an extension of CASME [43], and both contain a part of the same samples. Moreover, this may follow an additional rationale that both CASME [43] and CASMEII [23] datasets were collected under the same environment, unlike SAMM [22] dataset, which was constructed under a completely different environment. The CAS(ME) 2 [62] dataset also achieved better accuracies on CASME [43] and CASMEII [23] datasets in comparison to SAMM [22] dataset. However, in summary, the classification accuracies obtained for the cross-database evaluation are satisfactory, and affirms that our model is readily generalizable.

E. ABLATIVE STUDY
Our proposed framework combines an image pre-processing unit, as described in Section 3.1. To justify that the integrated pre-processing technique improves the performance of the presented framework, we have carried out an ablative study. To perform this, we compare the classification accuracies after eliminating all or a few significant stages of our pre-processing phase (a) no pre-processing and (b) no ROI selection and spatial domain normalization with our (c) final framework, which includes all the pre-processing stages. The obtained accuracies of these variants on all complex emotion datasets are shown in Figure 11. The results clearly indicate that the integrated pre-processing unit considerably enhances the performance of our proposed complex emotion recognition framework. In particular, the classification accuracy has improved by 34.35% (i.e., from 51.67% to 85.97%) for the framework without the processing unit of the final framework on CASMEII [23] dataset. This substantial performance improvement confirms the significance of our pre-processing unit in our framework.

F. COMPUTATIONAL COMPLEXITY ANALYSIS
We further present a comprehensive comparison of the computation complexity between our approach and other stateof-the-art deep networks. The computing environment used to obtain the computation complexity results is Intel(R) Xeon(R) 2.20 GHz processor accelerated using NVIDIA GPU with GeForce GTX 1080 Titan. Figure 12 presents the training and the testing time consumed on each for the complex emotion datasets. To simplify the comparison, we first set the computational cost of our model to 1. Then, we represent the computational costs of other state-of-the-art deep networks as the number of times opposed to the computational cost of our method. As can be seen, our method is computationally efficient compared to existing state-of-theart deep networks in recognizing complex emotions. In the training phase, our approach is 33.6% and 185.4% time efficient compared to AlexNet and Inception-3, respectively. In particular, compared to existing state-of-the-art deep networks, our approach is more effective in the testing process, as it shows 47.4% and 216.8% better time complexity against AlexNet and Inception-3.

G. DISCUSSION
In general, our extensive experiments on all three complex emotion scenarios show that our presented framework is promising compared to existing benchmark complex emotion methods and state-of-the-art deep networks. As a common pattern, it can be seen that the AlexNet [45] progressed better in the optimization stage, and achieved a competitive accuracy to our approach with a small amount of labeled samples. However, AlexNet [45] failed to converge in the incremental active phase to outperform our method. In contrast to AlexNet [45], more deeper networks such as ResNet-152 [20] and Inception-3 [47] showed slow progression in the optimization stage and converged to competitive classification accuracies in the incremental active phase. Yet, such deeper networks utilized more labeled and unlabeled samples in both stages to reach the maximum classification accuracy rates, in contrast to our proposed approach.
Due to the fact that our proposed approach requires limited labeled samples to train the model, it has the potential applications in recognizing emerging emotions spontaneously. In addition, our approach substantially reduces the inaccurate human annotation for complex emotion recognition. This helps in obtaining more accurate recognition of complex emotions in systems, such as emotional robots and human behavior analysis. For example, recognizing complex emotional cues helps emotion sentient robots to respond effectively to human behaviors, and hence enhances the social interaction between the human and machines. Further, intelligent personal assistants, such as Apple's Siri, Google Assistant, Amazon's Alexa and Microsoft's Cortana can be improved to recognize the complex emotions, using our technique. In the future, they can provide personalized assistance to individuals, such as elderly people, which might help improve loneliness problems faced by the elderly. Our approach can also be improved to profile the complex emotions using videos along with audios to provide personalize assistance.

V. CONCLUSION AND FUTURE WORK
In this paper, we proposed a novel incremental active learning-based end-to-end deep CNN framework to perform complex emotion recognition using facial expressions effectively. To the best of our knowledge, the proposed approach is the first one that exploits the use of an automatic incremental and active learning technique, to predict the complex emotions using a sparse training data accurately. Besides the key contributions of our approach, an additional advantage of our method is that there is no requirement for manual annotations during the active learning-based training process. The extensive experiments on benchmark complex emotion datasets shown that our proposed framework outperformed existing state-of-the-art deep networks and current benchmark complex emotion recognition methods. In the future, we aim to incrementally learn new complex emotions using active learning based approaches in in-the-wild environments. In addition, temporal, voice and textual features may also be considered to predict the complex emotions accurately.