Graph-based Facial Affect Analysis: A Review

As one of the most important affective signals, facial affect analysis (FAA) is essential for developing human-computer interaction systems. Early methods focus on extracting appearance and geometry features associated with human affects while ignoring the latent semantic information among individual facial changes, leading to limited performance and generalization. Recent work attempts to establish a graph-based representation to model these semantic relationships and develop frameworks to leverage them for various FAA tasks. This paper provides a comprehensive review of graph-based FAA, including the evolution of algorithms and their applications. First, the FAA background knowledge is introduced, especially on the role of the graph. We then discuss approaches widely used for graph-based affective representation in literature and show a trend towards graph construction. For the relational reasoning in graph-based FAA, existing studies are categorized according to their non-deep or deep learning methods, emphasizing the latest graph neural networks. Performance comparisons of the state-of-the-art graph-based FAA methods are also summarized. Finally, we discuss the challenges and potential directions. As far as we know, this is the first survey of graph-based FAA methods. Our findings can serve as a reference for future research in this field.

Historically, FAA methods have undergone a series of evolutions.Initial studies usually rely on hand-crafted design or classic machine learning to obtain useful affective features without structural information [7,19].Psychological findings indicate that the human cognition of facial information is realized through a dual system composed of analytic processing and holistic processing [4].The former acquires multi-dimensional cluster features by analyzing local areas, while the latter aims to generate a holistic representation to perceive the overall structure [4,20].Such an analytic-holistic working system is similar to a topology-like structure, so it is reasonable for machine vision researchers to model it into a graph.Accordingly, many state-of-the-art studies have been dedicated to generating a facial graph with local-to-global affective features [21,22,23,24].
If the above evidence reveals the feasibility of using graph-based methods for FAA, research on how facial muscles participate in affective expression further demonstrates its possibility as a necessary condition [25,26,27].There are latent relationships among different facial areas and contexts, which are vital clues [28,29].A few non-graphbased deep models have partly captured these relationships and improved performance [30,31,32].The underlying assumption is that explicit mappings that reflect this relationship can be directly learned [33].However, these mappings are not solid enough in the real world because they differ from subject to subject and even from one condition to another [34,35].Recently, graph-based methods have shown that they represent facial anatomy and simultaneously fit latent relationships in facial affects [36,37,38].Some pilot studies have also suggested that the graph-based method can even move beyond to deal with challenging tasks such as analyzing occluded faces [39,40] and ambiguous facial affects [41,42].
By searching on Google Scholar using keywords of 'graph' and Index Terms in this survey, we have counted the number of relevant published papers from 2010 to the present.As presented in Fig. 1, the graph-based FAA has gained increasing attention, especially in the past five years (publications in 2021 increased by 600 year-on-year).
Based on theoretical support, outstanding performance and quantity of existing work, and potential for future development, it is necessary to review the state of graphbased FAA methods.Although many reviews have discussed FFA's historical evolutions [7,33,43] and recent advances [44,45,46], including some on specific problems like occluded expression [47], multi-modal affect [48] and micro-expression [49], this is the FIRST systematic and indepth survey for the graph-based FAA field as far as we know.We emphasize representative research proposed after 2010.The goal is to present a novel perspective on FAA and its latest trends.This review is organized as follows: Section 2 provides a brief background on FAA and discusses the unique role of graph-based methods in FAA research.Section 3 presents a taxonomy of mainstream graph-based methods for affective representation.Section 4 reviews classical and advanced approaches for graph relational reasoning and discusses their pros and cons in FAA tasks.Section 5 summarizes public databases, main FAA applications, and current challenges based on a detailed comparison of related literature.Finally, Section 6 concludes with a general discussion and identifies potential directions.

FACIAL AFFECT ANALYSIS 2.1 Affective Desription Model
As early as the 1970s, Ekman and Friesen [50] proposed the definition of six basic affects, i.e., happiness, sadness, fear, anger, disgust, and surprise, based on an assumption of the universality of human affective display [25].In addition, compound affects [51] defined by different combinations of basic affects (e.g., sadly surprise and happily surprise) are proposed to depict more complex affective situations [52].Another kind of famous description, called Facial Action Coding System (FACS), is designed for a broader range of affects, which consists of a set of atomic Action Units (AUs) [20,29].Fig. 2 shows an example of six basic affects plus neutral and activated AUs in each facial affect.Besides categorical models, a continuous affective model named VAD Emotional State Model [53] is also suggested [54,55,56].The VAD model has three dimensions, i.e., valence (how positive or negative an affect is), arousal (the activation intensity of an affect), and dominance (how submissive or in-control a person is in an affective display).Recent studies consider that the continuous model is more appropriate for describing dynamic changes of human affects in the realworld [44,57].Please refer to [7,43] for a more detailed discussion about this topic.

General Pipeline
A standard FAA method can be broken down into fundamental components: face preprocessing, affective representation, and task analysis.As a new branch of FAA, the graph-based method also follows this generic pipeline (see Fig. 3).Face detection and registration are two necessary  pre-steps that first locate faces and normalize facial variations, sometimes also providing facial landmarks [59,60].Fig. 4 presents an illustration of the preprocessing steps.
Early methods like Viola and Jones [61], Mixtures of Trees [62], and Active Appearance Model (AAM) [63] have been widely used for this purpose.Recently, cascaded deep approaches with real-time performance are popular, such as Multi-Task Cascaded Convolutional Network [64], Hyperface [65], and Supervision by Registration and Triangulation [66].Please refer to [67,68,69] for more specific information.
Compared to other existing methods, the graph-based FAA pays more attention to representing facial affects with graphs and obtaining affective features from such representation by graph reasoning.
In mathematical terms, a graph can be denoted as G = (V, E).The node set V contains all the representations of the entities in the graph, and the edge set E contains all the structure information between two entities.Thus, when E is empty, G becomes an unstructured collection of entities (e.g., independent local facial areas [30]).Meanwhile, we could also define some initial graph structure ahead of the relational model, which is a general practice in many affective graph representations [24,39,71].
Given this unstructured collection, performing relational reasoning requires the model to infer the structure of these entities before predicting the property or category of an object.Naturally, generic approaches need to be adjusted depending on affective graph representations or propose new graph-based approaches to infer the latent relationship and extract the final affective feature.
The two components can perform separately or arranged as an end-to-end framework.They are expected to exhibit better performance and generalization capability by manually or automatically providing richer information through prior knowledge.Hence, the advantages and limitations of different graph generation methods and their relational reasoning approaches are two main topics of this survey.

GRAPH-BASED AFFECTIVE REPRESENTATIONS
Affective representation is a crucial procedure for most graph-based FAA methods.Depending on the domain that an affective graph models, we categorize the strategy as Spatial graphs, Spatio-temporal graphs, AU-level graphs, and Sample-level graphs.Fig. 5 illustrates a detailed summary of the literature using different graph representations.Note that many graph-based representations contain preextracted geometric or/and appearance features.Whether hand-crafted or learned, these feature descriptors are not essentially different from those used in non-graph-based affective representations.Interested readers can refer to [7,43,45] for a systematic understanding of this topic.

Spatial Graph Representations
Non-graph-based spatial methods usually treat a facial affect as a whole representation or pay attention to variations among main face components or crucial facial parts [30,103,104].For spatial affective graphs, facial changes are considered while their co-occurring relationships and affective semantics are represented as essential cues [21,76,82].These approaches can be divided into landmark-level graphs and region-level graphs.Fig. 6 illustrates frameworks of different spatial graph representations.

Landmark-level graphs
Facial landmarks are one of the most critical geometry that reflects the shape of face components and the structure of facial anatomy [105].Thus, it is natural to use facial landmarks as base nodes to generate a graph representation.
Limited by the detection performance, only a few landmarks that locate basic face components were applied in early graph representations [106].Recently, graphs using more facial landmarks (e.g., 68 landmarks [107]) are proposed to depict fine-grained facial shapes.For example, in [35] and [73], the authors associated 68 landmarks with the AUs in FACS and made graph-based representations.
The difference is that the former additionally employed local appearance features extracted by Histograms of Oriented Gradients (HOG) [108] as node attributes; the latter proposed three landmark knowledge encoding strategies for enhanced geometric representations.Alternatively, [72] formulated a Latent Tree (LT) where 66 landmarks were set as parts of leaf nodes accompanied by several other leaf nodes of AU targets and hidden variables, which reflected the joint distribution of targets and features.
Furthermore, some current methods select landmarks with significant contributions to avoid redundant information [76,109].Landmarks locating external contour and nose are frequently discarded [39,74] (see Figs. 6a, b) because they are considered irrelevant to facial affects.[21] chose to remove the landmarks of the facial outline and applied a small window around each remaining landmark as one graph node, while the local features were extracted by Gabor filter [110].Since these local areas were segmented to introduce facial appearance into the graph representation rather than as independent nodes, similar to [35], these methods are still classified in landmark-level graphs.On the other hand, adding extra reasonable landmarks was designed to generate comprehensive graph representations [71,111], which could keep an appropriate dimension and represent sufficient affective information.
A fully connected graph is the most intuitive way to form edges [21,76].However, the number of edges is n(n − 1)/2 for a complete graph with n nodes, which means the complexity of the spatial relationship will increase as the number of nodes increases.This positive correlation is not helpful because landmarks in a facial component mostly move in concert rather than arbitrarily when conveying facial affects [112].Studies of point-light displays in emotion perception also show that more complex representations seem to be redundant [113].To this end, work like [71,73,74,111] manually reduced edges based on muscle anatomy and FACS.Another type of approach is exploiting triangulation algorithms [39], such as Delaunay triangulation [35], to generate graph edges consistent with true facial muscle distribution and uniform for different subjects.Similarly, the landmark-level graph with triangulation is also utilized in generating a sparse or dense facial mesh for 3D FAA [77,78].The Euclidean distance is the simplest and most dominant metric for edge attributes of the above fa-

Region-level graphs
Like geometric information, appearance information, especially in local facial regions, can also contribute to FAA [114,115].Using graph structures is an excellent choice to encode spatial relationships while representing texture changes in facial components [82,83].There are two categories of region-level affective graphs: region of interest (ROI) graphs and non-prior information (NPI) graphs.ROI graphs partition a set of facial areas as graph nodes related to affective display.Coordinates of facial landmarks are commonly applied to locate and segment ROIs.Unlike a few landmark-level graphs that only use texture near all landmarks as supplementary information, ROI graphs explicitly select meaningful areas as graph nodes, and edges do not entirely depend on established landmark relation-ships.[23] employed a High-Resolution Network (HRN) [116] to regress ROI maps spotted by representative landmarks.Each spatial location in the extracted feature map was considered one graph node, while edges were induced among node pairs according to mappings between ROIs and AUs.Another example in [37] utilized feature maps of landmarkbased ROIs outputted by the ResNet50 [117] as nodes to construct a K-Nearest-Neighbor (KNN) graph.For each node, its pair-wise semantic similarities were calculated, and the nodes with the closest Euclidean distance were connected as initial edges.Similarly, [81] also employed landmarkbased ROIs, but the KNN graph was generated in opticalflow space to encode the local manifold structure for a sparse representation [75].Due to chained reactions among multiple AUs and the symmetrical structure of the human face, [80] proposed a parts-based graph that had manually linked edges by taking FACS and landmarks as references.The nodes were ROIs with Local Binary Pattern (LBP) [19] or deep features as attributes.In addition, the method of obtaining ROIs without relying on facial landmarks has also been studied [82] (see Fig. 6c).
Different from ROI graph representations, nodes in NPI graphs are evenly distributed in raw images or generated in a fully automatic manner without external knowledge.[83].Zoom in for better view.
[83] created a reference bunch graph by evenly overlaying a rectangular graph on object images (see Fig. 6d).Gabor filters were utilized for each graph node to compute a set of feature vectors for different facial instances.Recently, several methods have tried to introduce regions beyond facial parts or single face images as context nodes.[84] exploited the Region Proposal Network (RPN) [118] with VGG16 [119] to extract regions-level nodes, including the target face and its contexts, while edges were affective relationships calculated based on feature vectors.[85] built two NPI graphs for crossdomain FAA.First, holistic and local features were extracted as nodes for source and target domains.Then, global-toglobal connection, global-to-local connection, and local-tolocal connection were computed according to statistical feature distribution acquired by K-means algorithm.

Spatio-Temporal Graph Representations
Spatio-temporal representations deal with a sequence of frames within a temporal window and describe the dynamic evolution of facial variations [120].In particular, introducing temporal information allows nodes to interact with each other at different times and generates a more complex affective graph.Fig. 7 presents frameworks of various spatiotemporal graph representations.
Extend spatial graphs to the spatio-temporal domain is currently the main route.[89] exploited weighted compass masks to obtain 2D directional number responses and 3D space-time directional edge responses corresponding to each of the symmetry planes of a cube.The two masks of given local neighborhoods were nodes in a spatio-temporal Directional Number Transitional Graph (DNG), which could represent salient facial changes and statistic frequency of affective behaviors over time (see Fig. 7a).
Several representations have been proposed to define temporal connections between landmarks, which can be seen as landmark-level spatio-temporal graphs.[88] developed a context-aware facial multi-graph where intra-face edges were initialized based on morphological and muscular relationships, and inter-frame edges were created by linking the same node between consecutive frames.Similar landmark-based edge initialization in the temporal domain was also utilized in [86,87].In [40], authors introduced a connectivity inference block that could automatically generate dynamic edges for a spatio-temporal situational graph of part-occluded affective faces (see Fig. 7b).
Unlike landmark-level graphs, [90] first extracted a holistic feature of each frame and set them as individual nodes to establish a fully connected graph (see Fig. 7c), which could be seen as a frame-level spatio-temporal graph.Similar work includes [91] that took Discrete Cosine Transform (DCT) features as node attributes.Edge connections of these methods would be established by learning the long-term dependency of nodes in time series (discussed in Sec. 4).

AU-level Graph Representations
Apart from using knowledge of AUs and FACS in the above two types of affective graphs, many graph-based representations have been proposed to model affective information from the perspective of AUs themselves.We divide these approaches into two categories: AU-label graph and AUmap graph.Fig. 8 shows frameworks of different AU-level graph representations.

AU-label graphs
Unlike spatial and spatio-temporal graphs, AU-label graphs were built from the label distribution of training data [36,93].[92] computed the co-occurrence and co-absence dependency between every AU pair from the existing database (see Fig. 8a).Since the dependency is not always symmetric, these AU label relationships were used as edges to construct a Directed Acyclic Graph (DAG).In [94], an AU-label graph was built with a data-driven asymmetrical adjacency matrix that denoted the conditional probability of co-occurring AU pairs.AU labels were transformed into high-dimensional node vectors as node attributes [121].On the other hand, [41] established a DAG where object-level labels (affect categories) and property-level labels (AUs) were regarded as parent nodes and child nodes, respectively.The conditional probability distribution of each node to its parents was measured to obtain graph edges for correcting existing labels and generating unknown labels.A similar idea was achieved in [42] to boost affective feature learning in largescale FAA databases (see Fig. 8b).

AU-map graphs
AU-map graphs are intuitively close to region-level spatial graphs, especially ROI graphs, because they both employ local feature maps as graph nodes.[22] is an example in between.Twelve AUs features were learned through landmark-based ROI features cropped from a multi-scale global appearance feature [119].These AU features and the AU relationships gathered from training data and manually pre-defined edge connections [122] were combined to construct a knowledge graph (see Fig. 8c).However, the significant difference is that AUs define a set of facial muscle actions, which means there might be multiple AUs in the same ROI.Like in Fig. 2, AU12 and AU15 co-occur at lip corners but refer to 'puller' and 'depressor', respectively.Therefore, for many AU graphs, their definition of nodes is independent of those in ROI graphs, even though they are similar in feature map extraction.For instance, graph nodes in [38] were AU features directly obtained by ResNet without defining ROIs.The homologous protocol was also conducted in [98].
Some special AU-map graphs have been proposed to introduce structure learning for more complex FAA tasks.For AU intensity estimation, [99] trained a Convolutional Neural Network (CNN) to learn deep AU features from multiple databases jointly.The copula functions [123] were applied to model pair-wise AU dependencies in a CRF graph.In addition, Bayesian networks (BNs) are also used to capture the AU inherent dependencies for this task [24,98].To account for indistinguishable affective faces, [95] designed a VGG-like patch prediction module plus a fusion module to predict the probability of each AU.A prior knowledge taken from the given databases and a mutual gating strategy were used simultaneously to generate initial edge connections.To model uncertainty samples in real-world databases, [97] established an uncertain graph, in which a weighted probabilistic mask that followed Gaussian distribution was imposed on each AU feature map.By doing this, the importance of edges and the underlying uncertain information could be encoded in the graph representation.Another attempt in [96] boosted semi-supervised AU recognition for labeled and unlabeled face images.The parameters of two AU classifiers were used as graph nodes to share the latent relationships among AUs.

Sample-level Graph Representations
Recently, several graph representations beyond a single sample have been proposed, which indicates that this is still an open research field.In [102], a correlation graph with word-embedded affective labels as nodes was built for distribution learning.Its edges could be generated either by psychologically normalized Gaussian function or conditional probabilities.To combine signals from multiple corpora, [100] proposed a dual-branch framework, in which the visual semantic features were extracted in source and target sets.These features were then retrieved with correlation coefficients to generate positive edge connections for a learnable visual semantic graph (see Fig. 8d).Besides, [101] constructed a KNN graph with edges of binary weights to preserve the intrinsic geometrical structure of source and target data, which can seek more latent common information to reduce the distribution difference and make representations more discriminative.

Discussion
As a significant part of the graph-based FAA method, different affective graph representations have their merits, shortcomings, and requirements (see Table 1).
Spatial graph representations: Conceptually, landmarklevel graphs model the facial shape variations of fiducial points and easily generate the internal structural relationships of different affective displays.However, most methods are sensitive to facial landmarks' detection errors, thereby failing in uncontrolled conditions.On the other hand, the selection of landmarks and the connection of edges have not yet formed a standard rule.Their effects on the graph With extra dynamic affective information, spatio-temporal graphs can help aggregate evolution features in continuous time.For landmark-level methods, the current initialization strategy of edges is to link the facial landmark with the same index frame by frame.Unfortunately, no research has been reported to learn the interaction of landmarks with different indexes in the temporal dimension.Besides, in addition to Euclidean distance and Hop distance, other edge attributes measurement methods should also be explored to model the semantic context both spatially and temporally.For the frame-level methods, embedding domain knowledge related to affective behaviours like the muscular activity by graph structure is not explicitly considered in recent work.Therefore, building a hybrid spatio-temporal graph is a practical way to simultaneously encode the two levels of affective information.
AU-level graph representations: As a distinctive type, AUlevel graphs provide certain semantics of facial affects by representing each AU and its co-occurrence dependency.The measurement criteria of AU correlations are versatile but not general.Most AU-label graphs rely on the label distributions of one or multiple given databases.Nevertheless, AU labelling requires annotators with professional certificates and is a time-consuming task that causes existing databases with AU annotations to be usually small-scale.Therefore, the distribution from limited samples may not reflect the true dependencies of individual AUs, and its impact on FAA still needs to be assessed.
Sample-level graph representations: Sample-level graphs are an appealing field that introduces latent relationships in data distributions.Such characteristic makes it convenient to integrate with existing FAA methods.However, it also puts forward higher requirements for the diversity and balance of samples.On the other hand, to the best of our knowledge, there is no work combining sample-level and other in-face graphs to construct a joint representation, which we think is a good topic.

AFFECTIVE GRAPH RELATIONAL REASONING
Generally, graph relational reasoning can be considered a two-step process, i.e., understanding the structure from a certain group of entities and making inferences of the system as a whole or the property within [124].However, things are slightly different in the case of graph-based FAA.
Depending on what kind of affective graph representation is exploited, the contribution of graph relational reasoning can be either merged before the decision level with other affective features or reflected as a collaborative way in the level of feature learning.
In this Section, we review relational reasoning methods designed for affective graph representations in four categories: Dynamic BNs (DBNs), classical deep models, Graph Neural Networks (GNNs) and non-deep machine learning techniques.

Dynamic Bayesian Networks
DBNs are often used to reason about relationships among facial displays like AUs [125] and, of course, for AU-label graph representations.The BN is a DAG that reflects a joint probability distribution among a set of variables.In the work of [36,92], a DAG was manually initialized according to prior knowledge, and than large databases were used to perform structure learning to find the optimal probability graph structure.After that, the probabilities of different AUs were inferred by learning the DBN.Following this idea, [93] additionally integrated DBN to a multi-task feature learning framework and made the AU inference by calculating the joint probability of each category node.Sometimes DBN is also combined with some statistical methods to explore different graph structures [24,98], such as Hidden Markov Models [126].Another advanced research of DBN is [41] that modeled the inherent relationships between category labels and property labels.Its parameters were utilized to denote the conditional probability distribution of each AU given the facial affect.The wrong labels could be corrected by leveraging the dependencies after the structure optimization.

Adjustments of Classical Deep Models
Before GNNs are widely employed, many studies have adopted conventional Deep Neural Networks (DNNs) to process affective representations with the graph structure.These deep models are not explicitly designed but can conduct standard operations on structural graph data by adjusting the internal architecture or applying an additional transformation to the input graph representation.Fig. 9 shows examples of classical deep models for graph relational reasoning.

Recurrent neural networks
The Recurrent Neural Networks (RNNs) variant is one of the successfully extended model types for handling graph structural inputs.Similar to random walk, [21] applied a Bidirectional RNN to deal with its landmark-level spatial graph representation in a rigid order.The Gabor features of each graph node was updated by multiplying with the average of the connected edges to incorporate the structural information.Subsequently, the nodes were iterated by the RNN with learnable parameters in forwarding and the backward direction.In [95], the authors built a structure inference module to capture AU relationships from an AUmap graph representation.Based on a collection of interconnected recurrent structure inference units and a parameter sharing RNN, the mutual relationship between two nodes could be updated by replicating an iterative message passing mechanism with the control of a gating strategy (see Fig. 9a).Following the sequential idea of RNNs, [22] exploited a Gated Graph Neural Network (GGNN) [127] that calculated the hidden state of the next time-step by jointly considering the current hidden state of each node and adjacent nodes.
The relational reasoning could be done through the iterative update of GGNN over its AU-map graph.

Convolutional neural networks
Unlike the sequential networks, [35] utilized a variant CNN to process the landmark-level spatial affective graph.Com-pared to standard convolution architectures, the convolution layer in this study convolved over the diagonal of a particular adjacency matrix to aggregate the information from multiple nodes.Then a list of the diagonal convolution outputs was further processed by three 1D sequential convolution layers.The corresponding pooling processes were performed behind convolution operations to integrate feature sets (see Fig. 9b).Another attempt for landmarklevel spatial graph representations is the Graph Temporal Convolutional Networks (Graph-TCN) [76].It followed the idea of TCNs that consisted of residual convolution, dilated causal convolution, and weight normalization [128].By using different dilation factors, TCNs were applied to convolve the elements inside one node sequence and from multiple node sequences.Thus, the TCN for a node and TCN for an edge could be trained respectively to extract node feature and edge feature simultaneously.Besides, [37] exploited a Semantic Correspondence Convolution module to model the correlation among its region-level spatial graph.Based on the assumption that the channels of co-occurring AUs might be activated simultaneously, the Dynamic Graph CNN (DG-CNN) [129] was applied on the edges of the constructed KNN graph to connect feature maps sharing similar visual patterns.After the aggregation function, affective features were obtained to estimate AU intensities.

Multilayer perceptron networks
As a vanilla architecture, Multilayer Perceptrons (MLPs) has also been explored.[39] employed a hierarchical Auto-Encoder (AE) based on MLPs to capture relationships from a landmark-level spatial graph.Specifically, the first stage learned the texture variations from the extracted HOG features for each node.In contrast, the second stage accumulated features of multiple nodes whose appearance changes were closely related and computed the confidence scores as the triangle-wise weights over edges.Finally, a Random Forest (RF) was used for facial affect classification and AU detection simultaneously.In [98], a hybrid graph network composed of different dynamic MLPs performed multiple types of message passing, which provided more complementary information for reasoning the positive and negative dependencies among AU nodes (see Fig. 9c).

Graph Neural Networks
Unlike conventional deep learning frameworks mentioned in Sec.4.2, GNNs are proposed to extend the 'depth' from 2D image to graph structure and establish an end-to-end learning framework instead of additional architecture adjustment or data transformation [130].Several types of GNNs have successfully addressed the relational reasoning of affective graph representations in FAA methods.Fig. 10 illustrates several GNN architectures for graph relational reasoning.

Graph convolutional networks
Graph Convolutional Networks (GCNs), especially the spatial GCN [131], are the most popular GNN in graph-based FAA research.Practically, GCNs can be set as an auxiliary module [73,94] or part of the collaborative feature learning framework [24].Spectral GCN [80]; (e) GAT [88].Zoom in for better view.
For the auxiliary module, GCNs are applied immediately after the graph representation.However, the outputs of relational reasoning are not directly used for facial affect classification or AU detection but are later combined with other deep features as a weighting factor (see Fig. 10a).
[96] employed a two-layer GCN for message passing among different nodes in its AU-level graph.Both the dependency of positive and negative samples were considered and used to infer a link condition between any two nodes.The output of GCN was formulated as a weight matrix of the pretrained AU classifiers.Besides, GCNs can also be utilized following the above manner to execute relational reasoning on atypical graph representations, such as multi-target graph [84], distribution graph [102], and cross-domain graph [85].
For the collaborative framework, GCNs usually inherit the previous node feature learning model progressively (see Fig. 10b).Like in [79], a GCN-based multi-label encoder was proposed to update features of each node over a region-level spatial graph representation.The reasoning process was the same as that in the auxiliary framework.Similar studies also include [23] and [100].In addition, to incorporate the dynamic in spatio-temporal graphs, [90] set GCNs as an imitation of attention mechanism or weighting mechanism to share the most contributing features to explore the dependencies among frames.After training, the structure helped nodes update features based on messages from the peak frame and emphasize the concerned facial region.A more feasible way is to apply Spatial Temporal GCN (STGCN) [132] on spatio-temporal graphs [40,86,87,88] (see Fig. 10c).In their relational reasoning, features of each node were generated with its neighbor nodes in the current frame and consecutive frames by using spatial graph convolution and temporal convolution, respectively.
Alternatively, the approach of spectral GCN [133] has also been studied [109].[80] devised a lightweight GCN following the Message Passing Neural Network [134].A learnable adjacency matrix was adapted to infer the spatial dependencies of ROI nodes in different facial affects.[91] extended in Inception idea from standard CNNs to spectral GCNs that captured emotion dynamics at multiple temporal scales.The yielded embeddings of different dimensions were jointly learned over a classification loss and a graph learning loss for the optimal graph structure.

Graph attention networks
Graph Attention Networks (GATs) aim to strengthen the node connections with high contribution and offer a more flexible way to process the graph structure [135].[97] introduced an uncertain GNN with GAT as the backbone.The goal is to select valuable edges, depress noisy edges, and learn AU dependencies on its AU-map graph.In addition, the underlying uncertainties were considered in a probabilistic way, close to the idea of Bayesian methods in GNN [136], to alleviate the data imbalance by weighting the loss function.On the other hand, GAT collaboratively worked with GCN in [71] to deal with two-stream graph inputs.Compared to applying GAT directly, [38] proposed a GNN that added a self-attention graph pooling layer after three sequential GCN layers.A similar block was done in [111] which revised the GCN block with channel and node attention.It improved the reasoning process on graph representations because only important nodes would be aggregated, including affective information and facial topology.To make nodes interact more dynamically instead of using a constant graph structure, [88] applied a set of learnable edge attention masks to the STGCN for subtle adjustments of the defined spatiotemporal graph representation (see Fig. 10e).

Non-deep Machine Learning Methods
Although refining deep features extracted by parameterized neural networks and gradient-based methods is the mainstream, they require numerous training samples for effective learning.Due to the insufficient data in the early years or the purpose of efficient computation, many non-deep machine learning techniques have been applied for affective graph relational reasoning.Graph structure learning is one of the widely used approaches.In [72], the reasoning of its spatial graph representation was conducted by LT learning.Parameters update and graph-edit of LT structure were performed iteratively to maximize the marginal log-likelihood of a set of training data.[99] employed CRF to infer AU dependencies in an AU-map graph.The use of copula functions allowed it to model non-linear dependencies among nodes easily.At the same time, an iterative balanced batch learning strategy was introduced to optimize the most representative graph structure by updating each set of parameters with batches.Approaches of graph feature selection are also exploited in this part, such as Graph Sparse Coding (GSC) [81,101] and Elastic Graph Matching (EGM) [83].These methods have provided a more diverse concept for graph relational reasoning.

Discussion
Although all the methods above can achieve affective graph relational reasoning, the choice has a causal relationship with the type of graph representation (see Table 2).
Dynamic Bayesian network: Nearly half of AU-label graph representations employ DBNs as their relational reasoning model.However, the representation quality highly relies on the available training data that need balanced label distribution in positive-negative samples and categories.This strong assumption will limit the effectiveness of node dependencies learned by DBNs.Another problem is that DBNs can only be combined with facial features as a relatively independent module and are hard to integrate into an endto-end learning framework.
Classical deep model: Standard deep models, including CNNs, RNNs and MLPs, have been explored to conduct graph relational reasoning before the emergence of GNNs.Even if they are suitable for more graph representations than DBNs, these grid models focus more on local features.The additional adjustments in input format or/and network architecture cause losses of node information or let node messages only pass and update in a specific sequence, which suppresses the global property represented by the graph.Thus, we think the specifically designed networks like GNNs will become dominant in this part.
Graph neural network: GNNs are developing techniques that make full advantages of graphs.Architectures with different focuses have been proposed but have their flaws as well.For instance, GCNs cannot handle directed edges well (e.g., AU-level graphs), while GATs only use the node links without considering edge attributes (e.g., spatial graphs).Besides, due to the low dimension of the nodes in affective graphs, too deep GNNs may be counterproductive.In addition, being an auxiliary block or part of the whole framework will influence the construction of GNNs.Therefore, managing graph representation and relational reasoning using GNNs still need to be explored.
Non-deep methods: Non-deep machine learning has a place in early studies and is even applied in recent work because no training is required.They partly inspire advanced techniques like DBNs and GNNs.Nevertheless, one of the reasons they have been replaced is that these approaches need to be designed separately to cope with different graph representations, similar to hand-crafted feature extraction.Hence, it is not easy to form a general framework.On the other hand, more training data and richer computing resources allow deep models to perform more effective and higher-level relational reasoning on affective graphs.

APPLICATIONS AND PERFORMANCE
According to different description models of facial affects, the FAA can be subdivided into multiple applications.The typical output of FAA systems is the label of a basic facial affect or AUs.Recent research also extends the goal to predict micro-expression or affective intensity labels or continuous affects.This section compares and discusses graph-based FAA methods from four main application categories: facial expression recognition, AU detection, microexpression recognition, and a few special applications.Due to page limitation, we select most relevant and representative papers following these standards: published in more well-known forums in the past five years; or belonging to distinct branches of graph representation and reasoning for diversity consideration.

Databases
Most FAA studies apply public databases of facial affect as validation material.A comprehensive overview is presented in Table 3.The characteristics of these databases are listed from four aspects: samples, attributes, graph-related properties, and certain contents.Fig. 11 exhibits several examples of facial affects under different conditions.In addition, for better interpreting the graph-based FAA, we summarize corresponding elements (e.g., landmark coordinates, AU labels) self-carried by databases, which are rarely considered in previous related surveys.
Another type of database is for micro-expressions.Participants are required to keep a neutral face while watching videos associated with induction of specific affects [147].Following this setting, Spontaneous Micro Facial Expression Database (SMIC) [148], Improved Chinese Academy of Sciences Micro-Expression Database (CASME II) [149], Spontaneous Micro-Facial Movement Database (SAMM) [150], Chinese Academy of Sciences Macro-Expression and Micro-Expression Database (CAS(ME) 2 ) [151] have been released.However, it is hard to collect and annotate large-scale microexpression data with uncontrolled scenarios due to its subtle, rapid, and involuntary nature.
Concerning graph-based FAA, it is available to find and select suitable databases with corresponding metadata, such as landmarks, AU labels, and dynamics, for different graph representation purposes.However, existing databases also have some shortcomings.On the one hand, not enough accurate AU annotations are provided by inthe-wild databases, limiting AUs' role in FAA.On the other hand, there is still a blank in the dynamic large-scale affective database field, so it is hard to use temporal information to generate affective graph representation.Finally, databases about natural and spontaneous facial affects in a continuous domain need more attention instead of discrete categories.

Facial Expression Recognition
Facial expression recognition (FER), or macro-expression recognition, has been working on basic facial affects classification.An inevitable trend of FER is that the research focus has shifted from the early posed facial affects in controlled conditions to the recent spontaneous facial affects in real scenarios.In other words, recognizing the former is considered a solved problem for FAA methods, including graph-based FER, which can be corroborated from the results in Table 4.For example, the performance on the CK+ database is very close to 100% [78,89,90,111].
Although many graph-based studies have shown improvements in predicting facial affects, FER still has some potential topics.One thing is that the goal of existing methods stays on classifying basic facial affects.No study of graph-based methods to recognize compound affects (or mixture affects), whose labels are provided by recent databases like RAF-DB and EmotioNet, is reported.One possible solution is introducing AU-level graph representations that can describe fine-grained macro-expressions with closer inter-class distances.The other topic is practical graph-based representations due to the big gap between the performance of current methods and the acceptable result in practice when analyzing in-the-wild facial affects.In addition, since existing databases lack sufficient dynamic annotated samples, the evaluation of spatio-temporal graphs in large-scale conditions remains explored.

Action Unit Detection
The AU detection (AUD) facilitates a comprehensive analysis of the facial affect and is typically formulated as a multitask problem that learns a two-class classification model for each AU.It can expand the recognition categories of macro-expressions through the AU combination [72] and can be used as a pre-step to enhance the recognition of micro-expressions [94].Compared with graph-based FER, the wide usage of graph structures has a long history in [33] and has played a more dominant role.Table 5 summarizes graph-based AUD methods including the performance comparison.
Specifically, spatial graphs and AU-level graphs are equally popular in the representation part of AUD.Interestingly, no matter landmark-level or region-level, all the spatial graphs constructed in the listed AUD methods employed facial landmarks [23,35,37,39,72,79], even for the spatio-temporal graph [88].The possible reason is that the landmark information is helpful and practical for locating the facial areas where AUs may occur.In this setting, their node representations were close to that in spatial graphs of FER methods, which usually combined geometric coordinates with appearance features (e.g., HOG [35,39]).Although some AUD methods using AU-level graphs also exploited traditional features (e.g., Gabor [92,93], LBP [41]) or deep features (e.g., VGG [95]) to introduce appearance information, their graph representations were initialized from the AU label distribution of the training set.Thus, the DBN model has become popular in the relational reasoning stage [41,92,93].Another similar trend to graph-based FER is that GNNs have been widely utilized to learn the latent dependency among individual AUs in recent studies, such as GCN [23,79,88,96], GAT [97], GGNN [22], and DG-CNN [37].But the difference is that fully-connected (FC) layers [22,35,79,96] or regression models [23,37,99] are often applied for predicting labels instead of softmax classifier [88,97].
A particular line of AUD research analyzes the facial affects by estimating the AU intensities, which could have greater information value in understanding complex affective states [161].Even though a few attempts in estimating AU intensities based on graph structures have existed [37,72,95], the study of using the latest spatio-temporal graph representations and GNNs has not been reported.Another big challenge in AUD is few and imbalanced samples.Recent graph-based methods using transfer learning [41,96] or uncertainty learning [97] were proposed to address this problem.They showed an advantage of the graph-based method in this topic and are helpful to implement AUD in large-scale unlabeled data.

Micro-Expression Recognition
Micro-expressions are fleeting and involuntary facial affects that people usually exhibit in high stake situations when attempting to conceal or mask their true feelings [147].The earliest well-known studies came from [162] as well as [163].Generally, a micro-expression lasts only 1/25 to 1/2 seconds long and is too subtle and fleeting for an untrained person to perceive.Therefore, developing an automatic microexpression recognition (MER) system is valuable in reading human hidden affective states.Besides the short duration, low intensity and localization characteristics also make it challenging.
To this end, graph-based MER methods have been designed to address the above challenges and have become appealing in the past two years [81], especially in 2020 [38,76].Table 6 lists the reported performance of a few representative recent studies of graph-based MER.These methods fall into the landmark-level spatial graph [76,81] and the AU-level graph [38] in terms of representation types.For the former, their idea is to use landmarks to locate and analyze specific facial areas to deal with the local response and the subtleness of micro-expressions.The latter aims to infer the AU relationship to improve the final performance.The difference in processing ideas is also reflected in the reasoning procedure.Approaches like GSC [81] and variant CNNs [76] are exploited in the landmarklevel graph to integrate the individual node feature representations.In comparison, GCNs are employed to learn an optimal graph structure of the AU dependency knowledge from training data and make predictions.Nevertheless, one common thing is that all the methods consider the local appearance in a spatio-temporal way by using optical-flow or DNNs.
A problem in graph-based MER is the lack of largescale in-the-wild data.The small sample size limits the AUlevel graph representation that relies on initializing the AU relationship from the AU label distribution of the training set.The lab-controlled data make it difficult to follow the trend in FER studies, which generalizes the graph-based FAA methods in real-world scenarios.However, the analysis of uncontrolled micro-expressions is fundamental because micro-expressions and macro-expressions can co-occur in many real cases.For example, the furrowing on the forehead slightly and quickly when smiling indicates the true feeling [163].Since the evolutionary appearance information is crucial for the micro-expression analysis, building a spatiotemporal graph representation that can model the duration and the dynamic of micro-expressions is also a helpful but unexplored topic.

Special Tasks
The graph-based methods also play a vital role in several special FAA tasks, such as pain detection [72], non-basic affect recognition [84,102], occluded FER [39,40], and multimodal affect recognition [87,100].Table 7 summarizes the latest graph-based FAA methods for special tasks.Their node representations and edge initialization strategies for graph constructions in this field are similar to those in graph-based FER, MER, and AUD methods.While for the reasoning step, GCN is the top-1 option.This observation implies that the framework of the graph-based method discussed in this paper can be easily extended to many other FAA tasks and promote performance improvement.

OPEN DIRECTIONS
Graph-based FAA methods have been dissected into fundamental components for elaboration and discussion in this review.When encoding facial affect into graphs, strategies vary according to node and edge elements.Relational reasoning approaches infer latent relationships or inherent dependencies of graph nodes in terms of space, time, and semantics.The category of graph representations will affect the technique choice of relational reasoning to a certain extent.
Despite significant advances and numerous work, the graph-based FAA is still an appealing field with many open directions.Due to advantages in modeling and reasoning latent relationships of facial affects, graph-based methods may provide complementary information to help solve some challenges that non-graph-based approaches face.Also, the graph-based method has natural advantages or unexplored research potential in other topics.

In-the-wild Scenarios
Although many efforts have been made for graph-based FAA in natural conditions [35,38,39,40,41,42,82,96], even the state-of-the-art performance is far from actual applications.Factors like illumination, head pose, and part occlusion are challenging in constructing an effective graph representation.For one thing, significant illumination changes and head pose variations will impair the accuracy of face detection and registration, which is vital for establishing landmark-level graphs.ROI graphs without landmarks or NPI graphs [82,85] should be a possible direction to avoid this problem.Also, missing face parts resulting from camera view or context occlusion make it challenging to encode enough facial information and obtain meaningful connections in an affective graph.Pilot work [39,40] has tried to exploit a sub-graph without masked facial parts or generate adaptive edge links to alleviate the influence.Unfortunately, there has still been a considerable performance decrease compared to normal conditions.Proposing more effective  spatio-temporal graphs can account for these problems based on evolutional affective information.

3D and 4D Facial Affects
Using 3D and 4D face images might be another good topic because the 3D face shape provides additional depth information and dynamically contains subtle facial deformations.They are intuitively insensitive to pose and light changes.Some studies have transformed 3D faces into 2D images and generated graph representations [22,37,39,95], but they have not fully taken advantage of the 3D data.Alter-natively, non-graph-based [170] and graph-based methods [77] have been explored to conduct FAA directly on 3D or 4D faces.Since the 3D face mesh structure is naturally close to the graph structure, employing the graph representation and reasoning to handle 3D face images will promote the improvement of in-the-wild FAA.Besides, there is also a potential topic of using 3D and 4D data with graph-based methods, especially landmark-level graphs and GNNs, in micro-expression recognition.

Valence and Arousal
Estimating the continuous dimension is a rising topic in FAA.Unlike discrete labels, Valence-Arousal (V-A) annotations describe a wider range of facial affects that are consistent with those in the real world.Large-scale FAA databases (Aff-Wild I [14], II [171]) containing V-A annotations have been released to support the continuous FAA.Existing graph-based methods mainly perform the V-A measurement [37,72,99] on lab-controlled databases except for a few studies like [84].Recent graph-based methods have studied multi-label learning according to intrinsic mappings between facial affect categories and other annotations [22,41].Such underlying assumptions can also be extended to the V-A measurement task, where AU-level graphs and DBNs, as well as sample-level graphs, are potential directions.

Context and Multi-modality
Most current FAA methods only consider a single face in one image or sequence.However, people usually have affective behaviors, including facial expressions, body gestures, and emotionally speaking in real cases [172].These facial affective displays are highly associated with context surroundings that include but are not limited to the affective behavior of other people in social interactions or inanimate objects.Existing studies like [84] and [85] have employed graph reasoning to infer relationships between the target face and other objects in the same image.Facial affects and other helpful contexts can be combined in a graph representation to perform the analysis on a fuller scope, such as gesture [173,174].Another valuable topic is to introduce additional data channels that are multi-modality.Sample-level and spatio-temporal have also been successfully extended to process multi-modal affect analysis tasks with audio [87] and physiological signal [100], respectively, which shows a good research prospect.

Cross-database and Transfer Learning
Insufficient annotations and imbalanced labels are two problems that limit the development of FAA research.One possible solution is to use graph-based transfer learning.
Efforts like [41,96,97] have exploited the graph structure to solve this challenge in semi-supervision, label correction, generation, or uncertainty measurement.On the other hand, the performance of affective features extracted using graph-based representation and reasoning has been proved through cross-database validation in all FER [42,85], AUD [41,92], MER [38], and cross-corpus analysis [100,101].Specifically, the strength of distribution modeling of AUlevel and sample-level graphs is valuable in improving the generalization capability of affective features.

Fig. 1 .
Fig. 1.The growth trend of papers related to graph-based FAA.

Fig. 8 .
Fig. 8. AU-level and Sample-level graph representations.(a) AU-label graph with edges generated from training data [92]; (b) AU-map graph with FACS based edges [22]; (c) Auxiliary graphs of AUs and landmarks [42]; (d) Sample-level multi-modal graph of visual and physiological signals [100].Zoom in for better view.

Fig. 11 .
Fig. 11.Facial affect databases.(a) Oulu-CASIA contains posed facial affects; (b) SFEW 2.0 has facial affects under in-the-wild scenarios; (c) BP4D provides 3D affective face images; (d) EMOTIC, multiple faces appear per image with VAD annotations; (e) The SMIC collects images of spontaneous micro facial affects in visual light and near infrared light; (f) DISFA offers frame-level AU intensity labels.

TABLE 1 An
Overview of Affective Graph Representations graphs.Since most NPI graphs utilize a region searching strategy, the problem is how to avoid the loss of target face and how to exclude invalid regions.Spatio-temporal graph representations:

TABLE 2 Causal
Relationships between Graphs and Reasoning methods

TABLE 4
Performance summary of representative graph-based FER methods