Across the Universe: Biasing Facial Representations Toward Non-Universal Emotions With the Face-STN

Facial expression recognition, as part of an affective computing system, usually relies on solid performance metrics to be successful. These metrics depend significantly on the affective context in which one evaluates this system. While presenting excellent performance on the dataset it was trained on, a facial expression recognition model might drastically fail when one assesses it in a different scenario. Such performance reduction occurs because most facial perception models rely on an extreme generalization concept, focusing on a universal emotion perception system. With the recent findings on the non-universality of emotional perception, generalization of facial encoders seems not to be the optimal path to take. Therefore, exploiting transfer learning toward adapting specific facial features to specific scenarios could address this problem. This paper proposes and investigates a Spatial Transformer Plugin (STN) to rearrange different facial encoders towards particular affective representations from different scenarios. We experiment with our model in eight different facial expression recognition datasets (AffectNet and the derived MaskedAffectNet, OMG-Emotion, FERPlus, ElderReact, EmoReact, FABO and JAFFE datasets) and obtain competitive performance with much less training effort than state-of-the-art models. Besides performance alone, we introduce the STN as a mechanism towards a non-universal emotional perception system and discuss how it rearranges learned perception features to address some specific characteristics of each investigated dataset.


I. INTRODUCTION
One of the key factors in understanding the lack of adaptability in current automatic facial expression recognition systems comes from the categorization of affect itself. For a long time, the concept of a universal understanding of emotions [26] guided the development of facial expression recognition (FER) systems. The notion that any person in the world can identify one out of six basic emotions independently of their cultural background made the task of labelling and categorizing facial expressions easier [72]. This led to a plethora of artificial systems trained and validated over these predefined concepts. Even affect categorization specializations, such as the popular dimensional arousal and valence The associate editor coordinating the review of this manuscript and approving it for publication was Kah Phooi (Jasmine) Seng . models [46], continue to rely on generalizing affect to claim great performance on emotion expression recognition. This has become even more evident in a recent publication by Coen et al. [22] where researchers analysed the presence of predefined affective expressions over millions of YouTube videos from all over the world. However, as discussed by Lisa Feldman [29], Coen et al. trained and evaluated their automatic perception model based on a set of predefined and unchangeable emotional concepts, leading their neural network to classify what it was trained to do: sixteen known emotional expressions. The ability of such models to recognize affect from any given scenario is therefore restricted by similar scenarios with which the neural network was trained.
The problem of adaptability is more evident when one deploys these models in real-world scenarios, such as the recent applications in social robots [69]. Usually, their evaluation in cross-dataset experiments (which simulate their application in different scenarios), tends to decrease the FER performance drastically if it is not followed by a computationally heavy fine-tuning or readaptation routine [58]. In most cases, the cause of this lack of adaptability is usually mentioned as different input characteristics and pre-processing, or label distribution. In reality, the problem could lie in the task itself. Recent findings on affect categorization show that emotion perception might not be as universal as we have been led to believe [39], [40], [45].
These recent investigations discuss how interpreting affect comes directly from our world understanding; each person, based on one's expertise, has a certain way of expressing and recognizing affect [35]. In other words, every person sees and understands affect differently, and we adapt and converge towards a known interpretation while interacting. Each carries our affective perception world in this regard: unique and constantly updating.
Translating this view into affective computing, specifically in the development of heavily supervised learning models, we hypothesize that the general understanding of how a facial expression can be categorized is represented by the labelling procedure. Whoever chooses the labels of a dataset is giving to that specific contextual scenario (i.e., all the data samples that compose the dataset) a unique understanding of affect. One must interpret training and evaluating a model in such a dataset that can achieve amazing performances by considering it within the dataset's own constrained characteristics, in particular by considering the labelling decision.
In this study, we address the problem of adapting facial expression recognition by proposing a spatial transformer network (Face-STN) plugin layer that is trained to specify high-level affective features of given facial encoders. Different from traditional fine-tuning and transfer learning, our proposed model is a light-weight neural layer that can be trained with very little effort to improve facial expression recognition. Our Face-STN leverages from visual transformers capability to learn specific characteristics of visual representations [31], [60], but focused on learning the specific affective characteristics of each evaluated scenario.
We evaluate our Face-STN model on specifying the feature representations learned by three different encoders: the strongly supervised FaceChannel [6],the semi-supervised PK, a Generative Adversarial Network [7] that learns facial representations based on self-supervised image reconstruction, and a Contrastive Predictive Coding Network [65] that learns representations by contrasting the specific facial characteristics present on images from the same affective category.
To help us represent different scenarios and therefore evaluate our model in different affective scenarios, we use 8 different datasets in all our experiments. First, we run a baseline scenario for transfer learning and fine-tuning, training the encoders on the one million images drawn from the internet presence on the AffectNet dataset [61]. We then evaluate how our model compares with traditional transfer learning in seven different scenarios: The monologues and individualized expressions from the OMG-Emotion dataset [5], the internet-crawled images labelled with an affective distribution on the FERPlus dataset [10], the Elder React dataset [56] with recordings of elderly persons, the EmoReact dataset [64] with facial expressions from children, the images from FABO dataset which are composed of acted facial expressions [33], and the Japanese-woman-only expressions from the JAFFE dataset [54]. To investigate a constrained interaction scenario with partial facial expressions, we recently presented a version of the AffectNet dataset where all the images have facial masks added, calling it MaskedAffectNet [9].
To provide a complete understanding of the impact of our Face-STN plugin, we also run a feature analysis that compares the learned representations of each encoder in different transfer learning scenarios and when trained with the plugin. We discuss how our solution compares to existing state-ofthe-art results in terms of FER performance for each of these datasets, and how our approach leverages the non-universal affective perception theory to provide a competitive FER solution in most of the evaluated scenarios. Our results show that different from what is believed, training these models with more data would achieve better feature recombination and decision boundary for each specific task. Doing so would not lead to better feature representation, which directly impacts stagnation of the performance observed when applying full-network fine-tuning. Our dataset-specific models achieve competitive performance when compared with complex models, validating our investigation on biasing facial features towards specific tasks. We proceed with an in-depth discussion, through facial features visualization, on how the differences in deep learned facial expressions rely mostly on the dataset chosen to train and evaluate the model, representing a specific task. We conclude by connecting our observations with the non-universal expressions theory, by exemplifying the impact that each affective scenario has on learning emotional representations, and that to have an artificial system able to deal with different scenarios, it is necessary to have fast readaptation on a decision-making level.
Our paper is organized as follows: in the next section (Section II), we introduce our related work and situate the reader about the facial expression adaptability problem; In Section III, we introduce the encoders and detail their implementation, the Face-STN model and all its training and updating mechanisms are proposed; Section IV describe our evaluation and experimental effort, followed by Section V, which exhibits all our results. We discuss our findings in Section VI, and finally conclude the paper in Section VII.

II. RELATED WORK AND IMPORTANT ARGUMENTS
To separate facial representation from affect understanding is somehow intuitive, and in fact has been addressed by several solutions in the past [13], [24]. Most of these have taken this path due to technological limitations. Once convolutional neural networks became a universal solution for VOLUME 10, 2022 data representation, researchers began giving most of their attention to end-to-end learning of facial expressions [38]. In this section, we present our view on how end-to-end affective perception goes against the concept of non-universality of emotion representation, and thus, presents a limitation on automatic facial expression recognition. We also consider how the current solutions that claim adaptability and transfer learning do not address the problem properly -they provide a shallow layer as a solution instead of dealing with the root of the problem.

A. THE GENERAL END-TO-END AFFECTIVE PERCEPTION
Most of the current state-of-the-art solutions for automatic facial expression recognition (FER) claim to have addressed the problem of global FER by approaching maximum generalization [8], [49], [63]. The majority of these approaches deploy the computational power of artificial neural networks, boosted by data-driven deep learning of faces. The modus operandi of these solutions is to use millions of examples to tune these networks to extract specific facial features that represent and categorize affect. Unfortunately, the learned facial features are biased owing to the very specific scenarios represented by the datasets on which these models are trained and validated. In most of these models, the learned features are comparable with existing human-made modelling such as the Facial Action Units [21], [25], [27]. Coupled with a case-specific good performance, these Units are being perceived as good candidates for a general facial representation system.
The problem these models face when deployed in different or socially constrained scenarios appears when combining these representations into affective categories [51]. Most of these models, mostly for commodity and data availability, categorize affect using standard representations, whether by means of a strict set of categories or dimensional pleasure/arousal/dominance scales. They are not only sensitive to representing only the facial features that are present on the training data, but they are also sensitive to categorizing such features based on given affective labels. Such labels are usually obtained using a transitive bias of giving instructions based on constrained options: the already specified set of categories, or the predefined boundaries for dimensional scales. The generalization aspects of the trained model are bounded to the capability of the labelling procedures.

B. THE RELATION BETWEEN FACES AND CONVOLUTIONAL-BASED RECOGNITION MODELS
If one is to minimize the bias from a pretrained model for affective categorization by providing a computationally light and effective adaptation mechanism, focusing on the representation of facial structures, we could increase the adaptability of affective recognition models in different scenarios. Faces change little. The position of eyes, mouth, and cheeks will be always relatively close to each other [78]. Their representations, different from an affective category, are universal [28], [78]. A healthy person will detect FIGURE 1. Facial features are differently exposed when expressing affect while using a mask. different facial structures [75] even on non-facial images [42], leading them to be easy to identify and adapt. This is the case for most facially constrained interactions, like when participants use facial masks. Most current convolution-based emotion expression recognition solutions (the most common ones) already rely on a general facial representation [51], even if implicitly learned by strongly supervised end-to-end learning. Once we can present a soft separation between these facial representations and the affective categorization, it would be much easier to recombine their meaning into a unique world understanding of affective category.

C. THE ARGUMENT OF ADAPTABILITY
When deployed in scenarios that are different from the ones for which models were tuned, most recent affective perception models present difficulty to perform and even to adapt, given that deep neural networks are known to require extreme resources and data-hungry [80]. These models are thus extremely biased towards their application, and they are most often difficult to adapt to specific scenarios [73]. Models without a popular interest and those that do not provide large amounts of available or labeled data are underrepresented world views. One of these scenarios, now in strong evidence given the COVID-19 pandemic, is when social interactions are constrained using personal protective equipment such as facial masks. As most of these neural networks learn how to recognize affect based on a collection of facial features, when some of these features are absent, which is the case when using a mask (illustrated by Figure 1), these models tend to fail [2]. This effect is also observable, albeit on a smaller scale, in humans. However, due to our capability of changing the way we recognize emotions when seem a partially covered face [57], [76], we learn to compensate much better than any deep learning system.

D. THE CURRENT PROBLEMS ON PERSONALIZING AFFECT
The models that get closer to the concept of a strong separation between facial representation and affective under-standing are the ones that claim to provide personalize perception. In these models, the affective concept usually is specialized to a single individual, or a group of individuals that share the same contextual background [67]. Such models generally rely on strong feature representation and on mechanisms to specify features away from the initial affective estimation [70]. Most current solutions focus more on an auditory representation of affect [18], [19], [47]. Although it may be easy to assume that this happens due to the availability of personal auditory information, the reality is other: convolutional neural networks have become experts on image representation, while still struggling to represent auditory features, and specifically speech [53]. Most convnets that deal with speech are extremely complex, and not easily accessible without access to very specific and powerful hardware [1]. Representing speech, and auditory signalling in general, therefore typically occurs with traditional feature extractors that hinder an end-to-end learning approach and facilitate a strong separation between signal representation and affect categorization.
When applied to facial expression recognition, the few models that approach personalization focus on overspecifying the learned features to unique persons [7], [20]. Recently, facial expression representation was attempted to be separated from affective understanding [4], but the proposed model relied on the unique world view of a specific dataset to accomplish both feature representation and affective under-standing. Adapting it towards a universal feature representation would demand retraining the entire neural network, and thus, adaptability and representation transfer is not feasible

III. BIASING FACIAL EXPRESSION REPRESENTATION WITH THE FACE-STN
There exist many facial representation models, most of which are based on convolutional neural networks (ConvNets). Hierarchical representation of a ConvNet allows the representation of facial features to emerge within the network layers [14], [62]. The typical facial representation learned by these networks resembles human-made Facial Action Units, which measure different muscle movements to describe a facial expression [43], [55].
Most of these models rely on explicit supervision, coming in the form of a given label, to learn feature maps that represent faces. This process specifies the features towards that specific unique world, represented by the datasets and associated labels that the model is trained with. Other solutions focus on learning facial representations through implicit supervision, such as in the case of convolutional autoencoders [68], [79], and most recently Generative Adversarial Networks [15], [52], [77].
Most of these solutions bias facial expression representation models towards a unique affective world representation both in the data distribution and presentation, as well as in the labelling process. In this way, most of these models remain very difficult to adapt towards a novel scenario.
To perform a complete analysis of facial representation, we investigate how different learning schemes contribute to emerging facial representations. In this regard, we investigate the FaceChannel [6], a ConvNet trained with explicit labels; the Prior-Knowledge Generative Adversarial Network (GAN), part of the P-AffMemory model [7], that learns facial representations by identifying real and generated faces; and a novel facial representation based on a Contrastive Predictive Coding network [65], that learns to represent faces based on reconstructing latent representation space itself. Each of these models implements convolutional layers to highlight facial features, and we are interested in investigating the similarities of such features and how we may reuse them on different affective worlds representation.

A. THE IMPACT OF EXPLICIT LABELS WITH THE FaceChannel
The FaceChannel is a recently proposed convolutional neural network with a light-weighted architecture that implements inhibitory layers to improve facial expression representation. It has a total of 2 million parameters, allowing it to be trained from the scratch while making it easily adaptable to other tasks. Our implementation of the FaceChannel has 10 convolutional layers. The last of them is represented by a shunting inhibitory layer [30] and 4 pooling layers. An inhibitory neuron S xy nc , present at position (x,y) of the n th receptive field in the c th layer is defined as: where u xy nc is the activation function of the convolution unit, in our case ReLu, and I xy nc is the activation of the inhibitory units. The passive decay term a nc is also updated during training and is shared among each inhibitory filter.
When training, after the convolutional layers, the FaceChannel implements a fully connected hidden layer implementing a ReLu activation function. This layer is followed by an output layer that implements the direct label decision of the network. This involves a set of neurons implementing a SoftMax activation for categorical classification, or linear activation for a continuous and dimensional representation of affect.
The convolutional layers of the FaceChannel demonstrate the capability of learning different facial representations based on the dataset with which it was trained [6]. The changes are modulated directly from the output layer; the facial representation reflects the labelling distribution of the dataset with which the network is trained. In our investigations, we are interested in understanding the strength of this modulation, and how different the learned feature representation is when training this model with faces collected from different scenarios. Figure 2 illustrates the facial representation layers of the FaceChannel. VOLUME 10, 2022

B. THE IMPACT OF A RECONSTRUCTION ERROR WITH THE PRIOR-KNOWLEDGE GAN
The Prior-Knowledge (PK) is an autoencoder that learns facial representations by applying an adversarial training routine between real and generated faces. It implements a controllable term that allows the change of affective characteristics, in terms of continuous arousal and valence, to the decoded faces. Besides the encoder and decoder/generator architecture (E and G respectively), the PK also implements three discriminators: the arousal/valence enforcer (D em ), the discriminator that guides the latent representation to follow a uniform distribution (D prior ), and the adversarial discriminator that ensures the decoded image contains the desired affective information (D real ). The PK receives an image (x) and a continuous arousal/valence label (y) as an input and produces a facial latent representation (z) as well as an edited image expressing the chosen arousal/valence (x gen ).
The encoder architecture (E) is implemented as four convolutional layers and one fully connected output layer structure. The decoder (G) implements the same structure, though inverted. We do not apply pooling; we use a strided convolution with an order of 2 to provide a dimensional reduction and reduce the network's number of trainable parameters. The encoder represents an RGB image into a latent representation (z), then feeds it concatenated with the desired affective label (y) to the decoder. The entire autoencoder is trained with an image reconstruction loss (L rec ) using mean squared error.
The affective information discriminator (D em ) guides the encoder to learn facial representations. Recent experiments show that, without this discriminator, the network learned facial representations that did not carry affective content like hair and eye colour [7]. It is implemented as two fully connected hidden layers followed by two linear neurons: one for arousal and one for valence. It is trained using a mean-squared error loss (L em ).
The uniform distribution discriminator (D prior ) enforces the latent facial representation (z) to be uniformly distributed. It showed to be important to increase the generalization of the model, and to help on the imposition of the affective features within the latent representation [7]. It implements four fullyconnected layers, and it is trained using an adversarial loss (min E max D z L p rior) between the original distribution of z and an artificial uniform distribution p prior (z).
Finally, the last discriminator (D real ) imposes the photorealistic characteristics and enforces that the affective labels (y) are present on the generated images. It implements four convolutional layers, which receive the generated image (x gen ), followed by two fully connected layers. Each of the convolutional layers also receives the desired affective information (y), to enforce that it is present on the generated image. This discriminator is trained using an adversarial loss that implements a mean-squared error on the original image and the generated one (min G max D img L img ).
Previous experiments with the PK demonstrated that generated images carried affective information, but did not maintain the personal identity [7]. To solve this, we implemented a identity-preserving loss (min E,G L iden ) on the reconstructed image. This loss is computed between the original image and the generated one by using the mean-squared error from the last layer representations from a pretrained VGG face [16] encoder.
As is typical for GANS, the PK is quite a sensitive model to be trained, and the impact of each of these losses was defined based on a grid-search focused on minimizing a total loss: The coefficients λ 1 , λ 2 , λ 3 , λ 4 and λ 4 served as a balance between each discriminator. Figure 3 illustrates the final architecture of the PK with all the parameters.

C. THE IMPACT OF LATENT REPRESENTATION PREDICTION WITH CPC
Contrastive Predictive Coding (CPC) [65] is a recent self-supervised model that learns to predict the entangled representations of sequences of input stimuli using autoregression. It applies a contrastive InfoNCE loss [65] to enforce data representation which maximizes the reconstruction of future stimuli. For that, it uses an encoder (E) to learn the representation of an image (i) from a sequence (T ) of observed stimuli (x i ) and outputs a latent state (z x = E(x i )), and an autoregressive neural network (A) that integrates a sequence of latent representations (w <= T ) into a temporally-contextual latent representation (C w = A(z w ) ). This context representation is then used to predict the next element on the sequence (x k ).
Differently from traditional generative models, which focus on learning a representation that is useful to generate or reconstruct the original stimuli, CPC focuses on encoding information that is present on the data sequence. For that, it predicts future stimuli by modeling a log-bilinear model of a density ratio (f k ) between the perceived stimuli sequence (x w ) and the contextual latent representation (C w ): where W kC t is a linear transformation used for the prediction of the next element in the sequence. The entire network is trained to optimize (f k ) by distinguishing the density ratio between positive and negative samples. Thus, the CPC model learns in a self-supervised manner, estimating the labels directly from the density ratio. As we are interested on learning affective information from the facial expressions, we create positive and negative examples directly from the training data distribution, by clustering samples from the same categorical, or dimensional, representation into positive samples. We implement our encoder as a series of convolutional layers, and the autoregressor as a GRU network. The entire architecture, with the detailed parameters, is illustrated in Figure 4.  Because the learning of the representations in a CPC network is made directly on the latent representations themselves, it does not require many training epochs, neither many examples, as demonstrated by its recent applications in the representation of phonemes [36] and EEG signals [3].

D. THE FACE-STN
One of the most common strategies when adapting facial expression recognition towards different scenarios is to retrain, entirely or partially, a neural network such as any of the three models we introduced in the last section. This enforces that the affective information, from both facial representation and emotional categorization, is somehow depicted by both the convolutional channels and the decision-making layers. The problem when readapting this network towards a novel scenario, is that both the facial representation and the decision-making layers carry an inductive bias from the dataset the model was originally trained with. So, if the new scenario carries any similarity with the originally trained dataset, the tendency is that the emotion recognition performance improves; however, when the scenario is very different, a new and expensive training scheme is usually necessary to achieve a good performance. In doing so, we also change the entire affective representation present on the network and make it less probable to deal with other scenarios.
Spatial Transformer Networks (STNs) have recently been used to learn specific facial characteristics that help on emotion expression recognition on specific datasets [31], [60]. An STN relies on a localization convolutional neural network that learns specific image transformation parameters, biased by the strongly supervised learning. For faces, making an STN learn different affine transformations can help it to identify specific geometrical characteristics of faces, which might be unique for each dataset.
Our Face-STN is composed of a set of convolutional layers, usually referred to as localization network (L), and receives as input feature maps (F i ) from the convolutional encoder. Often STNs process the input image, but as faces do not change much, and the convolutional channels of the encoders are known to depict facial information, having it applied directly to the feature maps allows us to recombine the learned facial features to deal with the characteristics of a specific dataset. Also, this allows the training of the Face-STN with less data, as it has to learn useful transformations based on already processed input stimuli, and not on the raw image. The Face-STN has as a role to learn the parameters T θ to perform the affine transformations on the feature maps. After the set VOLUME 10, 2022 of convolutional layers of the Face-STN, a grid generator (G) is used, to apply the transformations into different patches of the feature maps. A bilinear sampling kernel [41] (S) sampler is used to select transformations from the grid generator and use them as an additional input to the decision-making of the encoder. Our Face-STN is then trained in a supervised manner, together with the decision-making of each of the encoders.
As the Face-STN is applicable to any convolutional connection to the encoder network, we conducted an exploratory experiment to identify where we want to apply it. The results of this experiment show us that using the feature maps from the last channel of each encoder as input to the Face-STN allowed the best ratio between the number of training parameters and the performance of the network. As such, we illustrate the final Face-STN architecture in Figure 5.

IV. EVALUATING THE LEARNED FACIAL REPRESENTATIONS AND ADAPTATION MECHANISMS
In our evaluations, we want first to investigate the role of traditional fine-tuning and transfer learning mechanisms on learning affective information from faces. Second, we want to evaluate the impact of the Face-STN on biasing the deep visual representations towards affective information. And finally, we want to contrast all these approaches, in objective performance terms, but also on visualizing learned representations.
We have divided our experiments into three settings: first, we run a baseline study to establish the best architectural design of each proposed encoder. We do this by training, evaluating, and fine-tuning them using the AffectNet dataset. Our second experimental setting consists of investigating the capability of each facial encoder to represent the unique characteristics of each dataset and to evaluate if the learned representations can be transferred from one affective to another, with traditional transfer learning and fine-tuning methods. For that, we contrast four training routines: first, we train the entire encoder and the decision-making layer (All layers). Second, we train the last-convolutional layer of the encoder and the decision-making layer (Last Conv-Layer). Third, we train only the decision-making layer (Decision-Making) and fourth we train the entire network from the scratch (Scratch).
We then attach the Face-STN plugin to each encoder, and train them, together with the decision-layer, with all the datasets. This way we can compare the impact of the Face-STN alone with all the other fine-tuning and transfer learning routines. Besides our baseline investigations, we also compare our performance results with existing state-of-theart models for each dataset. This allows us to evaluate the overall performance of our proposed model, and its impact on the field of facial expression recognition.
For each setting, we propose and explain a series of experiments in the following writeup. We also present specific metrics for each dataset. Each of the used datasets has a unique characteristic, either regarding the image selection and processing, or the affect representation, or both.

A. UNIQUE AFFECTIVE DATASETS
Each of the datasets we use in our experiments (illustrated in Figure 6) has specific characteristics which include image selection and processing, labelling strategy, and data distribution. We also derive a unique decision-making layer for each, illustrated in Figure 13. We individually optimized these as described in our Appendix A. The decision-making layers are connected to the encoder of each model to provide the best performance. The sessions below present each dataset, their unique characteristics, and information about how they were evaluated.
AffectNet [61] is our main baseline and comparison point. It has over 1 million images drawn from the internet,with half of them manually annotated using mechanical Turk. Each image has a single label based on a continuous arousal and valence value. It provides a specific training and validation subset, and we use the concordance correlation coefficient (CCC) [48] for arousal and valence between the models' predictions and the true labels as a performance metric. The images of the AffectNet are centred and are provided as cropped faces. This enforces the encoders to learn facial representations from a large variance of faces, but with a very predictable facial structure, which together with good data distribution, contributes to it mostly be used to train facial expression encoders for other tasks [51]. The labels, although crawled from the internet, were collected based on given concepts of arousal and valence, and thus follow a very specific rule, which makes it possible to be used for benchmarking automatic facial expression recognition models. A simple decision-making layer, composed of fully connected units, and two linear output heads, one for arousal and one for valence, provided the best results for in our exploratory experiments.
The FER+ [10] dataset contains around 31, 000 greyscaled face images crawled from the internet. Each image has a small resolution, of 48 × 48, and has a centred and cropped face. To label the images, a crowd-sourced strategy was used, where each labeller was given one out of seven affective categories to choose from: Angry, Disgust, Fear, Happy, Sad, Surprise, and Neutral. The authors obtained 10 labellers per image and provide the final label as a distribution of the 10 votes. This means that each image is labeled using a composition of the given 7 concepts. The decision-making layer for the FER+ is also based on a fully connected layer but followed by a single SoftMax output. We use the provided train, test, and validation separations in our experiments, and use the accuracy over all the classes as our main performance metric.
The JAFFE [54] dataset contains 213 images from 10 Japanese women performing facial expressions. Each person was asked to perform 3 times each of the seven desired expressions (Angry, Disgust, Fear, Happy, Sad, Surprise, and Neutral), and a series of independent Japanese evaluators gave the images a label, to validate the expressions, however, each of these evaluators was given a set of adjectives to be identified in each image, heavily biasing the categorization of the images. The images are presented centralized and in greyscale, and given the dataset size and the reduced amount of training images, this dataset will help us to evaluate a very specific scenario. In our experiments, we follow the proposed evaluation scheme of leave-one-emotion-out and calculate the models' accuracy. The decision-making layer for this dataset is also composed of a fully connected layer followed by a SoftMax layer, to provide a one-hot-encoding classification.
The MaskedAffectNet [9] dataset represents a constrained interaction scenario. It is composed of the same images of the AffectNet dataset, but with the artificial addition of a facial mask. The mask is added in a postprocessing scheme that finds the facial points of the mouth. It then uses a geometrical transformation on a standard face mask image and fixes the mask on top of the mouth. The results closely resemble mask-use in a real-world environment.
The OMG-Emotion [5] dataset contains around 10 hours of recordings from persons performing monologues. Each of the 675 videos has a single person and contains about one minute in length. The study collected the videos from the web and they were manually annotated by an internet crowd using an arousal and valence scale. This dataset contains a very specific world representation, as each video has a unique person expressing a continually changing emotional behaviour across a certain topic, so there exists a gradual transition of expressions. The labelling process, although developed from different persons, is based on utterances. This means that a sequence of frames represents the entire labelling scheme, instead of relying on facial expressions alone. Benchmarking facial expression recognition models with this dataset is challenging, as the individual and unique components of how each person expresses emotions are present on the video. The dataset is available in the form of video files, and we pre-process them by cropping faces using the OpenCV face localizer [12]. The authors propose a specific training and validation separation, and we use CCC per arousal and valence as a main performance metric. The decision-making layer is composed of a GRU layer to process sequences, followed by a fully-connected layer and two linear outputs -one for arousal and one for valence.
The ElderReact [56] dataset has 1,323 videos of 46 elderly individuals, all collected from the internet. Each video contains one person naturally expressing emotions. Each video has a few seconds of length and is annotated with the presence of absence of seven affective states: Uncertainty, Excitement, Happiness, Surprise, Disgust, Fear, and Frustration. Each video has eight binary labels, one for each affective state. We process each video by cropping the face, using the OpenCV face localizer [12], and training the models using sequential decision-making. Like the OMG-Emotion, we use a GRU layer, followed by a fully connected layer and a SoftMax output layer for each affective state. The dataset authors provide specific training/validation separation that we use in our experiments. We calculate the F1-score between all the affective states as our main performance.
The EmoReact [64] dataset is similar to the ElderReact in construction and labeling proceeding but is composed of VOLUME 10, 2022 videos from children. A total of 1200 videos are available, all of them with a few seconds, and annotated using the same categories present on the ElderReact. Also for this dataset, we use the same sequential decision-making, composed of one GRU, one fully-connected, and one softmax layer per affective category. We use the available training/validation separation in our experiments.
The FABO [33] dataset is our last experimental scenario. It contains short videos of actors performing expressions by request. The dataset has a total of 284 videos, each containing 2 to 4 executions of the same expression. Each execution starts from a neutral position followed by the facial expression apex. There is then a return to the neutral position. Each video is labeled using one out of 9 expressions (Anger, Anxiety, Boredom, Disgust, Fear, Happiness, Puzzlement, Sadness, and Surprise) associated with the apex of each video. We process each video by extracting the face using the OpenCV face localizer [12], then feed the apex of each sequence to a sequential decision-making layer with the same structure as the EmoReact and ElderReact models. We use the given training and validation sets and calculate the accuracy as our main performance metric.

B. EVALUATION METRICS
The AffectNet, MaskedAffectNet, FER+, EmoReact, Elder-React, and OMG-Emotion datasets have a standard separation between training and validation samples, which we follow in all our experiments. The JAFFE evaluation follows a leave-one-emotion-out classification scheme, which is the most common evaluation metric in the literature. The FABO dataset follows this as well.
The AffectNet, MaskedAffectNet, and the OMG-Emotion datasets are evaluated in terms of concordance correlation coefficient (CCC) [48] for both arousal and valence representations. The CCC is computed as: where µ x and µ y represent the mean for model predictions and the annotations and σ 2 x and σ 2 y , are the corresponding variances. ρ is Pearson's Correlation Coefficient between model prediction labels and the annotations.
The FER+, JAFFE, and FABO datasets use accuracy as the main performance metric, while the EmoReact and ElderReact datasets use the F1-Score averaged per emotional category.
We ran each of our experiments 30 times, and we calculated the average performance, exhibiting it herein. We pretrained each of the models with the AffectNet dataset, and the facial expression encoders are then used as input to the decision-making layers. The final performance of each model is calculated using a combination of the facial encoder and decision-making.

A. AffectNet BASELINE
Our first experimental setting calculates the performance of each model when fully trained with the AffectNet dataset. Figure 8 reports the final performance. The FaceChannel shows slightly better performance on valence, reaching a CCC of 0.46, while the CPC encoder achieves the best arousal with 0.63. In general terms, the performance of all three encoders was similar, showing that all three encoders do learn efficient facial expression representations.

B. FACIAL REPRESENTATION PERFORMANCE
The entire experimental result for the MaskedAffectNet and OMG-Emotion, in terms of CCC, are reported in Figure 9. In general terms, the best results on all four training settings were achieved by fine-tuning all the layers of the network, which is somehow expected, as both datasets have a large amount of data. In our recent published paper [9], we report a similar experiment with the MaskedAffectNet and the FaceChannel encoder and obtained the same results. Training only the decision-making layer presented the worst performance on the MaskedAffectNet dataset, which could indicate that the emotional representation learned from the AffectNEt dataset was not enough. This is somehow expected, given the presence of the masks covering much of the faces in this dataset. For the OMG-Emotion, training from the scratch presented the worst results, which leads to the understanding that the dataset alone does not have enough data samples to train these encoders. In both cases, when the Face-STN is present, the results improve drastically, in some cases surpassing the total fine-tuning routine. Both datasets have their own labeling process, and thus, affective representation. The high performance achieved by the Face-STN is clear indication that it can focus the general features learned by the encoders into very specific affective representation.
Similar behavior can be found when evaluating the accuracy-based datasets (FABO, FER+, and JAFFE), reported in Figure 9. The JAFFE dataset is quite particular here because the PK encoder seems to not be able to perform as well as the other two encoders. Probably an indication that the facial representations depicted by the encoder are not enough for the very specific characteristics from the JAFFE dataset. Again, the presence of the Face-STN improves drastically the performance of all encoders. In evaluating the models on the ElderReact and EmoReact datasets that we report in Figure 11, we observe that full retraining obtains the best results, while exclusively training decision-making achieves the worst results. In terms of encoder, all three models achieve similar results.
The presence of the Face-STN also impacts positively the encoders when evaluated with the ElderReact and EmotReac datasets, reported in Figure 11. These datasets show the least variance in the performance range between all the experiments, which shows that their facial representation is not heavily affected by the facial features coming from the encoders.

C. STATE-OF-THE-ART COMPARISON
The Face-STN achieves a competitive performance compared to the current state-of-the-art results on the OMG-Emotion dataset [23], [66], [81], as Table 1 exhibits. All the reported models use deep neural networks with strong pretraining and fine-tuning routines. Using attention mechanisms [81] to process the continuous expressions in the videos presented the best results of the challenge, such as achieving a CCC of 0:35 for arousal and 0:49 for valence. Temporal pooling, implemented as bi-directional LSTM, achieved the second best, with a CCC of 0:24 for arousal and 0:43 for valence. Late-fusion of facial expressions, speech signals, and text information reached the third-best result, with a CCC of 0:27 for arousal and 0:35 for valence. The complex attention-based network proposed by Huang et al. [37] achieved a CCC of 0:31 in arousal and 0:45 in valence, using only visual information. Our Face-STN achieve a maximum of 0.38 arousal (with the PK encoder), and 0.44 valence (with the CPC encoder) without needing to retrain the convolutional layers, reducing the fine-tuning effort.   [10]. They focus on using a fine-tuned VGG13 encoder that updates all the convolutional layers. We also outperform the results Miao et al. [59], Li et al. [50], and Siqueira et al. [71] reported, all of which employ different types of complex neural networks to learn facial expressions. On the FABO dataset, the Face-STN achieves higher results than reported in the literature, including Chen et al. [17], who proposed a frame-based recognition and a bag-of-words-based model, or even Gunes et al. [32] who used an SVM-based implementation.
When evaluated on the JAFFE dataset, the Face-STN attached to the FaceChannel achieves the best results when compared with the fine-tuning of the DeepEmotion [60] and the attention-based salient patch neural network [34].
The performance of the Face-STN on the EmoReact and ElderReact, reported in Table 3, is better than the models reported by the authors. On the EmoReact, the FaceChannel encoder achieves the best results, while on the ElderReact, the CPC encoder has the highest F1-Score.
Some of the models with which these datasets were evaluated seem to be outdated for other computer vision tasks. In our experiments, however, we do evaluate the performance of three very recent deep neural networks (FaceChanel, PK, and CPC) on each of the datasets. The Face-STN complements these models and presents results that are competitive with them.

VI. DISCUSSIONS
Our experimental results confirm that the Face-STN can be used to adapt different facial encoders towards specific affective worlds. To obtain a holistic understanding of the impact of the plugins, we must disentangle their training efforts from their final performance and contrast this information with all other training settings. Besides performance, understanding the impact of the Face-STN on the facial representation  of each encoder is needed to ground its true contributions. We perform feature formation analyses, especially on the representation of very specific affective worlds.
Our main contribution of this paper regards the connection between the Face-STN and the non-universal perception of affect theory. Our experimental setup initially indicates how we can continue this quest, and we further discuss current advantages and limitations of this approach.

A. TRAINING EFFORT VS PERFORMANCE
When we analyse performance alone, our experiments show that retraining all the encoders, in a full training setting, increases drastically the performance in all the datasets. Although it is relatively easy to obtain computational power on demand to train large and complex models, the number of trainable parameters of a model continues to indicate the training effort this model takes to be updated. When comparing the relative performance and number of parameters from the full-training setting and the Face-STNs for each encoder type in each dataset, displayed in Table 4, we observe that the Face-STNs outperform the full training in most datasets.
For most cases, we observe that the Face-STN has a similar performance when compared to the full-training, but even in the worst case, the relative performance achieved by the Face-STN is over 93% of the full-training performance. On the other hand, the training effort, represented by the number of updatable parameters, drastically reduces, especially for the FaceChannel. In the extreme case of the PK encoder with the JAFFE dataset, the Face-STN achieved almost double the performance.
Besides discussing the numbers and performance bits, the Face-STN displayed an important behaviour that is lacking in most automatic affective perception models: fast adaptability. It could, based on prior perception models (the pretrained encoders), modify the affective representations, embedded on the latent space of each encoder, towards the specific characteristics of each dataset. By doing so, we could reuse the encoder in every dataset without retraining or readapting them. When we did the same by retraining only the convolutional layer, the performance dropped considerably, and we were modifying these encoders drastically, needing one set of encoders per dataset. When not readapting the encoders at all, only updating the decision-making layer, the performance dropped to the lowest levels, making this option the worst of our experiments.

B. HOW THE FACE-STN ALLOWS AFFECTIVE BIASING?
Our results show that the Face-STN networks are able to improve performance in most cases, or at least match the performance of full-retraining on each of the datasets. The main contribution of the Face-STN involves using a bottom-up training scheme to try to adapt the last convolutional layers towards the unique affective characteristics that each of the datasets possesses. That means the affective information coming from the labelling scheme of each dataset directly impacts the selection of specific features that each encoder can extract.
The Face-STN does not update the weights of the last convolutional layer, but it rearranges the features to highlight the most important ones for that specific dataset. If the original encoders, trained on the AffectNet dataset, already have a similar representation to the ones found on the images from a dataset, the impact of the Face-STN is reduced, as one can see in the case of the FABO dataset. When the facial representations learned by the encoders differ from the ones present on the dataset, which is the case of the JAFFE images, the Face-STN could repurpose the learned representations to fit the JAFFE requirements. To illustrate this behaviour better, Figure 12 displays the differences between the entangled representations of the JAFFE dataset of the three encoders when training all layers and Face-STN settings. The entangled representations are passed through a t-SNE calculation to obtain the two most important components. For the PK and CPC encoders, training all layers does not produce distinguishable representations, while when the Face-STN is present, the representations are rearranged and better distinguishable from each other, based on their original labels.

C. AND WHY ARE FACE-STNs NON-UNIVERSAL?
Independently of the affective representations with which we are dealing, faces do not change. The general physical structure and characteristics of a face endure, which is a good start for artificial facial expression recognition because they can focus on which features to adapt. Convolutional  TABLE 4. Relative performance and number of parameters when training and evaluating the Face-STNs, for all encoders, compared to retraining the full network, which usually achieved the best results, for all datasets. neural networks can depict facial characteristics quite well [44], [74], but because they learn it using a strongly supervised process, the given labels still bias the learned representation. This was the case in our experiments in terms of a performance drop, especially when evaluating the pretrained encoders on very specific affective world representations, such as the JAFFE, the EmoReact, and the ElderReact datasets.
Fitting our experiments within the concept of nonuniversality of emotional perception can seem contradictory, as our model focuses on rearranging pretrained perception towards very well-defined-by strong labels-affective worlds. However, the adaption that the Face-STNs achieve allows a pre-existing perception model to deal with unknown conditions from different datasets. Our experiments demonstrate that our model addresses the problem of learning facial representations by reorganizing the existing facial features. It does by by biasing the high-level represenations towards the labels of each dataset, improving the overall model's performance. This demonstrates that, at least for a well-defined encoder, different scenarios can share the learned features. By biasing these features we demonstrate that in most cases it is a more beneficial solution than retraining the encoders even partially. Understanding this problem as a continual rearranging of a perception mechanism, based on the specific affective context given by each dataset, is how we address the non-universality of emotional perception.

VII. CONCLUSION AND FUTURE WORK
In this paper, we present a facial expression perception study where we investigate the readaptation of facial features as a mechanism for achieving non-universal affective perception. In this regard, we present a Spatial Transformer Network (Face-STN) that one may attach to any convolution-based encoder to rearrange learned features without the need of retraining the entire encoder. We perform a series of experiments with three different convolution-based encoders and with eight different datasets, representing different affective worlds. Our experiments demonstrate that when the Face-STNs are present, we reduce the training effort and maintain high performance, sometimes even surpassing the state-of-the-art performance on each of the evaluated datasets.
Besides performance, we discuss how our Face-STNs adapt the concept of non-universal emotional perception and put it into practice by understanding its impact on the different affective representations of each dataset. We establish and present our networks as one tool that will help us approach non-universal perception in affective computing, which will help develop truly adaptable emotional perception models.
Furthermore, the major contribution of this study regards discussing the impact and responsibility when developing facial expression research. We should consider the soft separation of face representation from affect understanding, following the recent trend on affective perception of humans, to provide reliable and adaptable facial expression recognition solutions. Focusing on adaptable affective recognition, instead of a general one, will allow us to be much more flexible when dealing with underrepresented scenarios.
Although we demonstrate the capability of the Face-STNs to adapt towards very specific affective worlds, we are still dealing with perception alone. All our experiments consider as granted that the labels derived from the datasets are reliable and represent the truth of that affective scenario. In future work, we will continue our search for non-universal emotional modelling from the affective understanding perspective, primarily addressing the problems of emotional grounding in different scenarios. We will address this problem by adapting the Face-STN to consider other aspects of the scenario, such as using reinforcement learning to address the congruence of the affective responses of a person.

APPENDIX A DECISION-MAKING NETWORKS
For each of the datasets, we propose one decision-making network that is attached to each of the encoders. The final architecture, and topological and training parameters, of these networks were found using a tree-parzen search [11] through the search space found in Table 5.   The final architecture of each decision-making network is reported in Table 6, and illustrated in Figure 13.
ALESSANDRA SCIUTTI (Member, IEEE) received the Ph.D. degree in humanoid technologies from the University of Genova, in 2010. She is a Tenure Track Researcher and the Head of the COgNiTive Architecture for Collaborative Technologies (CONTACT) Unit of the Italian Institute of Technology (IIT). After two research periods in USA and Japan, in 2018 she has been awarded the ERC Starting Grant wHiSPER (www.whisperproject.eu), focused on the investigation of joint perception between humans and robots. She published more than 80 papers in international journals and conferences and participated in the coordination of the CODEFROR European IRSES Project. She is currently an Associate Editor of several journals, among which the International Journal of Social Robotics, the IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS, and Cognitive Systems Research. The scientific aim of her research is to investigate the sensory and motor mechanisms underlying mutual understanding in human-human and human-robot interaction. More info at https://www.iit.it/people/alessandra-sciutti. Open Access funding provided by 'Istituto Italiano di Tecnologia' within the CRUI CARE Agreement VOLUME 10, 2022