Improved Action Unit Detection Based on a Hybrid Model

Facial action detection and facial expression recognition are two closely intertwined problems in behavior analysis. This paper presents evidence that model architectures designed for facial expression recognition can be seamlessly adapted for the action units detection task, taking advantage of the structural similarity between the two problems. As a sample case, we have adapted the Pyramid crOss-fuSion TransformER (POSTER) model for action unit detection by adjusting the architecture to handle a multilabel problem with one output per action unit. Then, we tuned the training parameters and retrained the model to achieve state-of-the-art performance on two widely used datasets: DISFA and BP4D. The results obtained with a standard 3-fold cross-validation setup show an average F1 score of 67.8% for DISFA and 65.5% for BP4D. These results outperform state-of-the-art models for AU detection, support the effectiveness of the approach, and suggest placing higher efforts on adapting existing architectures to leverage the synergies between facial expression recognition and action unit detection.


I. INTRODUCTION
Facial expressions are a spontaneous and powerful form of nonverbal communication for humans. During communication, individuals can infer the emotions and mental states of others by interpreting their facial expressions. The intuitiveness and effectiveness of facial expressions can greatly improve machines' understanding of human emotions and psychological behavior patterns in human-computer interaction scenarios. Through automatic facial expression recognition, machines can better understand human intentions and provide more personalized, natural, and human-like interactions, which can be especially beneficial in fields such as customer service [12], healthcare [28], and education [37]. As a result, the capability of detecting facial expressions has propelled it to become a crucial component of human- The associate editor coordinating the review of this manuscript and approving it for publication was Alessandro Floris . computer interaction, and it has received increasing interest in areas such as computer vision and affective computing.
Facial action units (AUs) are specific, measurable movements of the facial muscles that correspond to different facial expressions. These movements can be mapped to basic emotions such as happiness, sadness, anger, fear, disgust, and surprise [9] using the Facial Action Coding System (FACS) [10]. However, in the existing literature, facial expression recognition and AU detection have been treated as distinct problems, overlooking their inherent relationship. Consequently, separate architectures have been developed for each problem, disregarding the fact that they essentially address the same underlying challenge, and therefore architectures proposed for one problem can likely be also effective for the other.
Our main contribution in this paper is to demonstrate the significant potential of adapting architectures initially proposed for facial expression recognition to the AU detection problem. As a case study, we have adapted the architecture proposed in POSTER [42] to achieve stateof-the-art performance on the DISFA [26] and BP4D [41] datasets. In DISFA, we have reached an average F1 score of 67.8% across all action units, using a widely used standard experimental setting imported from the existing literature [24], [32]. The F1 score achieved in BP4D was 65.5%. These results surpass the performance of recent proposals that were specifically designed to tackle the AU detection problem, showing the potential of seamlessly reusing existing architectures initially designed for facial expression recognition.
The paper is organized as follows: Section II presents an overview of the current state of facial action unit (AU) detection methods. Next, the process followed to adapt the POSTER architecture is explained in Section III. Then, section IV provides a detailed description of the datasets and experimental setup used to evaluate the approach. The results of the experiments are presented and discussed in Section V, together with an ablation study to analyze the contribution of the different components to the reported gains. Finally, conclusions are drawn in Section VI.

II. BACKGROUND AND STATE OF THE ART
Action unit detection has attracted significant attention from researchers over the years, and various methods have been developed to address this issue. First approaches to detecting AUs relied on using hand-crafted features for classification, such as appearance features (e.g. Histogram of Oriented Gradients (HOG) [1], Gabor filters [34], Local Binary Patterns (LBP) [16]), and/or geometric features based on facial landmark points [17], [23] (e.g. locations and shapes). However, these methods failed to capture relevant image information for classification. The emergence of deep learning techniques has revolutionized AU detection by enabling the computation of features directly from pixel-level image data, allowing for dynamic modeling of the extracted features and their correlation to the target task during training. These methods have boosted AU detection performance, achieving superior performance compared to traditional methods.
In the design of AU detection approaches, there are two key aspects that should be carefully considered. The first one is the extraction of local features that are related to the activation of each action unit. The second one is the recognition of the inter-dependencies between different action units, as they often appear together in a single facial expression, e.g., when a person is smelling, both AU6 (Cheek raiser) and AU12 (Lip corner puller) activate together.
With regard to local feature extraction, facial landmarks are commonly used to robustly locate regions of interest (ROIs) and key points related to action units, thus reducing distraction from less important facial areas. Li et al. [20] proposed the EAC-Net architecture, which used the landmarks provided in the dataset to manually locate the centers for the AUs and build a bounding box around these centers. Those allowed the construction of attention maps that were integrated into a CNN to enhance the feature map. Shao et al. [32] further improved this idea in JÂA-net, by also learning key regions that were shared to learn the landmarks, and refining an attention map that was used to predict the AUs. Niu et al. [29] introduced LP-Net, which used landmarks to learn local features, and also proposed a person-specific shape regularization module that captured person-specific relationships between facial landmarks. Ge et al. proposed LGR-Net [11], a method for extracting robust local features from ROIs identified by landmarks, using multiple branches to enhance feature robustness, and then fusing and refining the features to represent the whole face. Jointly, these works demonstrate the effectiveness of using landmarks to refine local feature representation.
In relation to AU inter-dependencies, traditional approaches that use Convolutional Neural Network (CNN) architectures usually learn them implicitly during training. However, some approaches have yielded improved results by explicitly modeling those relationships. SEV-Net [39] used an inter-AU encoder that compared the semantics generated for each AU, in order to exploit the relationship between AUs and improve the accuracy of AU detection. FAUDT [15] created a specific correlation module that extracted discriminative features for each AU and modeled their connection thanks to a transformer-based architecture. The latest developments in Graph Neural Networks (GNNs) have also enabled more explicit modeling of the correlations among AUs. In this direction, Li et al. [18] used a Gated Graph Neural Network (GGNN) integrated into a multi-scale CNN framework called SRERL to spread information through the graph and improve AU representation. Liu et al. [22] proposed AU-GCN to extract latent representations of related AU regions using an auto-encoder and subsequently fed them into a Graph Convolutional Network (GCN) as nodes. Luo et al. [24] explored advanced AUs relation modeling by using multi-dimensional edge features in the CNN-GCN-based method named ME-GraphAU. More recently, Yang et al. [40] proposed FAN-Trans, a hybrid network that combines convolutional and transformer blocks to learn the relationship between AUs. An online knowledge distillation was employed during training in this case to further improve the model's performance. In another work, Wei et al. [36] proposed ABRNet, which models AU relations in different crowds, using a relation learning module and a self-attention fusion module. Additionally, Chen et al. [5] developed CISNET to remove the subject variation effect in AUs detection using a causal intervention module.
We shall also remark on the strong influence of transformer-based architectures [35] in action unit detection models. They quickly became popular in NLP due to their ability to handle long-term dependencies, and demonstrated comparable performance to CNN on diverse visual benchmarks for Computer Vision tasks such as Image Classification, Object Detection, and Image Segmentation [14].  [7] and the landmark detector is MobileFaceNet [3]. They generate the image features X i mg and the landmark features X l m .
In addition, transformer-based architectures have produced successful results in facial expression recognition [15], [38], [42]. The earliest transformer-based model targeted at Computer Vision tasks is known as Vision Transformer (ViT) [8], which uses a pure transformer to directly classify the complete image by processing sequences of image patches.
In AU detection, transformers were used to compare semantic descriptions of action units [39] and learn discriminative AU features [15]. In [18], it was shown that the transformer structure and self-attention mechanism can better learn the co-occurrence between regions of interest. More recently, Swin Transformers have even been used to replace the typical CNN backbone [24].
POSTER [42] applied some of the latest advancements to propose a novel architecture that used landmark features and implicitly considered inter-dependencies between different action units. This was done by using a two-stream architecture that comprised a landmark stream and an image stream. In addition, a Vision Transformer block was incorporated to facilitate mutual guidance between the two streams and enable global correlation across features through a selfattention mechanism.
The POSTER model is illustrated in FIGURE 1. It is composed of two backbones, namely IR-50 [7] and MobileFaceNet [3], [4]. IR-50 produces image features and MobileFaceNet focuses on the generation of 68 landmarks. Next, these features are processed by a pyramidal structure that creates small, medium, and large representations of the features outputted by the image and landmark backbones, enabling the extraction of information at various levels of detail. The resulting features are then embedded and analyzed jointly in a transformer represented in FIGURE 2. In the multi-head attention block, the query matrices of the two feature types are swapped, allowing for a refinement of both types of information towards one another. This cross-fusion technique combines global and local features and provides higher stability toward identity variations because landmarks provide higher robustness to age, skin tone, and gender. It also allows the model to address two intrinsic problems associated with action unit detection: inter-class similarities and intraclass variations. The last layer is fully connected and linearly projects the features to a space whose dimensionality is the number of emotions. The emotion class is decided based on the results of a softmax activation on the predicted values.

III. MODEL ADAPTATION
The first change required to adapt an architecture designed for facial expression recognition to deal with AU detection refers to the output. Facial expression recognition is a multi-class classification problem, in which only one label can be active at a time. On the contrary, AU detection is a multi-label classification problem, as several AUs may be simultaneously active. Hence, it is essential to transform the model's output into one neuron per action unit, where each neuron indicates whether the corresponding action unit is activated or not. This adjustment requires additional modifications. The transformation of the output results in a heavily unbalanced problem, with usually a significantly larger number of examples from the negative class. Consequently, the loss function needs to account for such an imbalanced scenario. Weighted loss functions or specialized approaches, such as the focal loss [21], are some of the most common choices that allow for effectively addressing this challenge.
Another crucial aspect to consider is whether retraining the entire network is necessary. Due to the inherent structural similarity between the two tasks, it is reasonable to expect that the essential features extracted in the context of one problem remain valid and exhibit a similar nature in the other task. While it may initially appear sufficient to retrain only the last layers to leverage the precomputed weights, there are additional benefits to be gained by retraining the entire network.
Regarding the evaluation of the resulting model, accuracy is commonly used for facial expression recognition. However, VOLUME 11, 2023 77587 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.  when it comes to AU detection, accuracy can be misleading due to class imbalance, potentially leading to biases towards the majority class. To address this issue, the F1 score is a more reliable performance metric for AU detection, as it simultaneously considers both precision and recall.
In this work, we present a practical case of this adaptation by applying it to the POSTER architecture, which surpassed state-of-the-art performance for emotion classification in RAF-DB [19], FERPlus [2] and AffectNet [27]. In particular, we have successfully tailored the POSTER model for AU detection by adjusting the training parameters and converting the output to a multilabel binary classification problem with one binary label per action unit (activated/non-activated).
Same as POSTER [42], we utilized the IR50 [7] image backbone pre-trained on the Ms-Celeb-1M dataset [13], and MobileFaceNet [3] was chosen to produce landmark features. The image features X img ∈ R P×D and the landmark features X l m ∈ R P×D are fused along the P dimension to obtain fused features X ∈ R 2P×D . In this context, P represents the number of landmarks and D is the feature dimension. The fused features X are then utilized as inputs in the pyramid structure. In the feature pyramid structure, X was sampled into three different sizes: a large feature vector with an embedding dimension of 512 (D L = 512), a medium feature vector with an embedding dimension of 256 (D M = 256), and a small feature vector with an embedding dimension of 128 (D S = 128). Then, eight cross-fusion transformer encoders were introduced. Each transformer encoder interoperated on these feature vectors. Importantly, it should be noted that, as previously mentioned, the queries of image features and landmark features are exchanged in computation within the self-attention mechanism, as illustrated in FIGURE 2. The configuration of the transformer encoders involved setting the Multilayer Perceptron (MLP) ratio to 2 and the drop path rate to 0.01. Finally, the large feature vector was utilized for the classification.
With regard to training, all images were pre-processed by using RetinaFace [6], to crop the face region and filter out images that were not detected or had incomplete faces. The remaining images were aligned by taking the coordinates of the two eyes as a reference. They were then resized to 224×224 pixels to yield a format that was compatible with the model. The maximum number of training epochs was set to 15, as it was observed that the models began to overfit beyond that point.
To alleviate the potential effect of class imbalance [25] in the training samples, we imported the weighted asymmetric loss function proposed in [24], which is defined as: where N is the number of samples, y i is the ground truth label for the i-th AU (0 for non-activated and 1 for activated), and p i is the predicted score (a value in the interval [0, 1]). The weights w i for the i-th AU are defined as w i = N (1/r i )/ N j=1 (1/r j ), where r i denotes the i-th AU's occurrence rate computed from the training set.

A. DATASETS
All experiments were conducted on two common datasets widely used in the literature, namely DISFA [26] and BP4D [41]. 77588 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. The DISFA dataset [26] recorded the spontaneous facial expressions of 27 adult subjects, with 12 females and 15 males, as they watched a four-minute video in a laboratory environment.
Videos were captured by BumbleBee point grey stereo-vision system at 20 fps under uniform illumination using a high resolution of 1024 × 768 pixels. Each video comprised 4,845 frames. Each frame was manually labeled with the intensity of 8 Action Units on a 0 to 5 scale. The resulting dataset includes approximately 130 000 frames.
The BP4D dataset [41] contains 3D and 2D dynamic spontaneous facial expressions of 41 subjects, with 23 females and 18 males. The dataset was acquired in a controlled laboratory environment. Recordings were taken while subjects were doing 8 different tasks designed to elicit specific emotions (interview, video-clip viewing and discussion, startle probe, improvisation, threat, cold pressor, insult, and smell). A total of 328 videos were recorded using two grey-scale stereo cameras and one color video camera. The resolution of the 2D frames was 1040 × 1392 pixels. For each task, approximately 500 frames were manually annotated to indicate the presence or absence of 12 AUs and their corresponding intensity levels, coded on an ordinal scale from 0 to 5. This resulted in a dataset of around 140 000 valid frames.
The two datasets are heavily unbalanced. TABLES 1 and 2 show the number of positive and negative samples for each AU, along with the negatives to positives ratio, in DISFA and BP4D, respectively. All AUs in DISFA exhibit an imbalance in favor of the negative label. The most unbalanced case happens for AU9, with just one activated sample for every 22.9 non-activated entries. The most balanced AU in this dataset is AU25, with one activated sample for every 2.61 non-activated samples. The level of imbalance is lower in BP4D. Still, AU1, AU2, AU4, AU15, AU23 and AU24 contain over 3.5 more negative samples than positive entries. However, in AU7, AU10 and AU12 the class imbalance is in favor of the activated class.
We followed the same protocol adopted in previous studies [24], [32], which consists of a 3-fold subjectindependent cross-validation that evaluates all methods on exactly the same data partitions. In all compared methods, the outputs of the model in DISFA and BP4D were represented as 8-component (AU1, AU2, AU4, AU6, AU9, AU12, AU25, and AU26) and 12-component (AU1, AU2, AU4, AU6, AU7, AU10, AU12, AU14, AU15, AU17, AU23, and AU24) vectors, respectively. Each component of the vector indicated whether the corresponding AU was activated or not. The only action units that were shared between the two datasets were AU1, AU2, AU4, AU6, and AU12. TABLE 3 indicate the action units considered in each dataset, along with the facial muscles involved and a brief description of the movement [31].
In DISFA, thresholding was used to convert intensity values to binary form, following the procedure reported in [24] and [32]. Samples with an original label of 2 or greater were assigned the 'activated' label (1), while samples with a label lower than 2 were assigned a 'non-activated' state (0). In BP4D, the occurrence labels for each AU were used. With regard to the output, our particular setting produced as a result the probability of activation for each AU. These probabilities were converted into a binary prediction by using a threshold set to 0.5.
All our experiments were conducted on a computer equipped with a 13-th generation i7 processor with 128 RAM and a single NVIDIA RTX 3090 GPU with 24 GB of memory, running Ubuntu 20.04.4. LTS. The required model implementations used Python 3.9 with version 2.0 of the open-source Pytorch library [30]. VOLUME 11, 2023

C. EVALUATION METRICS
The highly unbalanced nature of the datasets makes accuracy a misleading metric, as a model could achieve high accuracy by simply predicting the majority class in most cases. On the contrary, the F1 score simultaneously considers true positive, false positive, and false negative rates and it is more appropriate in unbalanced settings. The F1 score is defined as the harmonic mean between precision (number of correct positive predictions divided by the total number of positive predictions) and recall (number of correct positive predictions divided by the number of positive samples), which can be mathematically expressed as: In this work, we used the macro-averaged F1 score, which is computed by using the arithmetic mean of the F1 score for the positive and negative classes, regardless of their support values.  TABLE 5. The best result for each AU is highlighted in bold, and the second best is indicated by using squared brackets.

A. COMPARISON TO STATE-OF-THE-ART METHODS
The proposed method behaved better than the average in all AU for both datasets, except for AU6 in BP4D, where the F1 score, although very close to the average, is slightly below. In DISFA, our model achieved an average F1 score of 67.8% for the eight AUs, outperforming all previous studies reported in the comparison. Our approach showed the best performance for AU1 (Inner Brow Raiser) and AU26 (Jaw Drop). For AU2 (Outer Brow Raiser) and AU12 (Lip Corner Puller), our results were the second-best. However, our model's performance in AU9 (Nose Wrinkler) was quite far from the best score of 80.5%, achieved by EAC-Net. In general, the proposed model exhibits its best relative performance on the AUs that are positioned close to the brows and mouth regions, and the lowest on the areas around the cheek and nose. TABLE 5 shows the results obtained in BP4D. It can be observed that our model achieved an average F1 score of 65.5%, which equals the performance reported for ME-GraphAU. In this case, our method demonstrated the best results of the methods in the comparison for AU4 (Brow Lowerer), AU17 (Chin Raiser), and AU23 (Lip Tightener), and scored second-best for AU12 (Lip Corner Puller), AU14 (Dimpler), and AU15 (Lip Corner Depressor). Once again, the algorithm shows its highest relative performance on the AUs located near the regions of the brows and mouth, while exhibiting lower performance on the nose and cheek.
When we analyze the F1 scores in absolute terms, we notice a significant difference in performance achieved for different AUs, which is consistent across all datasets. In DISFA, the activation of AU25 seems the easiest to predict, while AU2 seems far harder. Similarly, results reported for AU10 and AU12 in BP4D are consistently better than those obtained for AU2, for example. A careful study of these differences suggests that the performance of the models increases with the number of available positive samples. The boxplots in FIGURES 3 and 4 show how the F1 score varies with this ratio. In these Figures, a box has been built from each AU, using the results obtained from each method in the comparison. The lower and upper quartile F1 scores are marked by the edges of the box, and the vertical line that splits the box in two represents the median. The whiskers extend outward from the box, but no further than 1.5 times the interquartile range, to the smallest and largest data points. It can be observed that higher performance is generally associated with lower ratios of negatives to positives. In addition, boxes are considerably 77590 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.   smaller in BP4D, showing a higher consistency in the performance obtained by all different methods considered in the evaluation. These observations suggest that the heavy class imbalance has a negative impact on the results and there is significant potential for improvement in AUs with fewer activated samples.
For completeness of this study, FIGURES 5 and 6 show the confusion matrices for each AU in DISFA and BP4D, respectively. The numbers refer to the sample counts, and the intensity of the grey shade increases proportionally with the number of samples. It can be observed that they are consistent with the class imbalance reported in TABLES 1 and 2, and the ratio between the predicted labels is generally close to the ratio in the training samples. Overall, the model achieves a high accuracy in predicting negative samples, with an error rate that is relatively higher in BP4D. For positive samples, the accuracy rate is far lower and below 50% for some AUs (AU2 in DISFA and AU24 in BP4D), showing that the model has higher difficulty are classifying positive samples due to the lower number of samples in the training sets.   Although accuracy values are less relevant due to the existing class imbalance already reported in Section IV-A, we also observe superior results to those reported in the state-of-the-art. Among the competing methods, accuracy values are only reported for EAC-Net [20] and JÂA-Net [32]. TABLES 6 and 7 compare the accuracy values for all AUs in DISFA and BP4D, respectively. The best results for each AU are marked in bold, and the second bests are by using square brackets. Our results are considerably better than the ones reported for EAC-Net, both on average (80.6% and 75.2% in DISFA and BP4D, respectively) and for each individual AU. Moreover, results are also better, although somewhat closer, to the ones reported for JÂA-Net [32] (94.0% and 78.6% in DISFA and BP4D, respectively). The higher accuracy values in the DISFA dataset are consistent across all different methods, mainly due to the higher class imbalance already reported in Section IV-A.

B. ABLATION STUDY
To investigate the contribution and impact of the two different feature extraction components used in the architecture,  we conducted an ablation study on the DISFA and BP4D datasets. In particular, we compared the results obtained when 77592 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.  extracting features by using only the IR50 backbone, when using only landmark features that were extracted by using MobileFaceNet, and when using the full model.
The results are shown in TABLE 8. As can be observed, they are consistent across the two datasets. The features extracted by using the IR50 backbone are more effective than landmarks features extracted by using MobileFaceNet. Adding landmark features only yielded marginal improvements to the results. This suggests that the landmarks extractors could be suppressed in scenarios with constrained inference times, e.g. IoT, without considerably compromising the general performance of the POSTER model.
Finally, we have studied the total number of parameters (Params) and floating-point operations (FLOPs) of each model to evaluate their computational and memory complexity. We only compare our proposed model with the methods in TABLES 4 and 5 that have made their implementation available, namely FAUDT [15], ME-GraphAU [24], and JÂA-Net [32]. As ME-GraphAU allows using different backbone models to extract image features, we considered the one that uses a Swin Transformer base. To determine the FLOPs, we used the PyTorch library ptflops [33]. TABLE 9 shows the number of parameters, FLOPs and F1-scores achieved in DISFA and BP4D, for the four models mentioned above. We do not provide the FLOPs value for FAUDT, as its implementation is based on TensorFlow and the calculation of FLOPs is unreliable. Overall, POSTER-AU exhibits the lowest value of FLOPs. Moreover, it demonstrates a reduction of over 20 million parameters compared to ME-GraphAU. Despite this, POSTER-AU achieves similar performance on BP4D and surpasses performance on the DISFA dataset. JÂA-Net has the lowest number of parameters, but it has 1.8 GigaFLOPs more than POSTER-AU.

VI. CONCLUSION
The research reported in this paper demonstrates the promising potential of utilizing architectures originally proposed for facial expression recognition in the context of action detection. In particular, when the proposed method is used to adapt the POSTER architecture to the action unit detection problem, our results outperform state-of-the-art techniques across a wide range of representative methods on the DISFA and BP4D datasets. These successful results suggest that further effort should be placed into studying possible adaptations of existing models for emotion detection, as both problems are closely related and base features performing well in one task are expected to perform well in the other. In addition, the reported results suggest that transformer-based architectures and positional attention mechanisms are highly appropriate for tackling the action unit detection problem. The nature of the transformer allows it to explore relationships between regions that are spatially distant [29]. At the same time, the implementation of guided attention directs the network to focus on crucial face regions related to the activation of action units, such as the eyebrows and mouth. These elements also benefit from crop and alignment operations that attempt to ensure that patches correspond to specific facial areas. These pre-processing operations facilitate the transformer's acquisition of positional knowledge regarding patches and enable the network to gather contextual information about each patch.
Globally, the proposed model performed better than the recent competing approaches described in the literature, both in terms of F1 score and accuracy. The proposed model performed particularly well on the AUs located around the eyebrows and the mouth while performing worse than other approaches in the region surrounding the cheeks and the nose. The ablation study has also revealed that the landmark features contribute only marginally to the F1 score. These results expand the potential applicability of the model to scenarios that demand lightweight components, reducing inference times at a minimal performance cost.
It has been observed that, as a general rule, all methods exhibit better behavior when classifying samples belonging to the class with the higher number of samples. This strengthens the argument that more comprehensive and balanced datasets could enhance classification outcomes, and more work is required in this direction. It has also been noticed that certain architectures behave particularly well for specific AUs, producing significantly better results than the average outcomes of all methods. Examples of this are EAC-Net [20] in AU9 or LGR-Net [11] in AU2/AU26, in the DISFA dataset; and also, AU1 and AU2 in SEV-Net [39] on BP4D. These differences are more prominent in datasets with a high level of class imbalance, indicating the potential benefits of hybrid models that are able to leverage the unique strengths of each architecture for detecting individual AUs.
Future work will be oriented towards automating the adaptation process, including decision-making about retraining. We also plan to explore more effective ways of improving the resulting models. This includes investigating improved methods for capturing spacial-temporal dependencies and integrating landmarks features in a more productive manner. Another aspect that is worth considering relates to the relatively large performance differences of different models across all AUs. They suggest that classification results could be significantly improved by combining various existing models and leveraging their strengths in relation to specific AUs.