Adaptive Multilayer Perceptual Attention Network for Facial Expression Recognition

In complex real-world situations, problems such as illumination changes, facial occlusion, and variant poses make facial expression recognition (FER) a challenging task. To solve the robustness problem, this paper proposes an adaptive multilayer perceptual attention network (AMP-Net) that is inspired by the facial attributes and the facial perception mechanism of the human visual system. AMP-Net extracts global, local, and salient facial emotional features with different fine-grained features to learn the underlying diversity and key information of facial emotions. Different from existing methods, AMP-Net can adaptively guide the network to focus on multiple finer and distinguishable local patches with robustness to occlusion and variant poses, improving the effectiveness of learning potential facial diversity information. In addition, the proposed global perception module can learn different receptive field features in the global perception domain, and AMP-Net also supplements salient facial region features with high emotion correlation based on prior knowledge to capture key texture details and avoid important information loss. Many experiments show that AMP-Net achieves good generalizability and state-of-the-art results on several real-world datasets, including RAF-DB, AffectNet-7, AffectNet-8, SFEW 2.0, FER-2013, and FED-RO, with accuracies of 89.25%, 64.54%, 61.74%, 61.17%, 74.48%, and 71.75%, respectively. All codes and training logs are publicly available at https://github.com/liuhw01/AMP-Net.

explicit feature, facial expressions can convey information about emotions and intentions due to their adaptability and communicability [3]. With the advances of computer vision, facial expression recognition (FER) can capture the emotions of target objects and has been widely used for humancomputer interaction (HCI) [4], medical diagnosis [5], and other fields.
Currently, FER has achieved excellent recognition results in data collection in a controlled laboratory environment, such as CK+ [6], JAFFE [7], and MMI [8]. However, the complexity and variability of real-world scenarios, such as illumination variation, face occlusion, pose variation, and other uncontrollable factors, increase recognition difficulty. Although researchers are working to increase the diversity of real-world datasets [9], [10] to improve the versatility of the model, occlusion and variant poses markedly change facial visual appearances, resulting in inaccurate feature location, imprecise face alignment or inefficient feature extraction, which make FER still a challenging. Traditional methods consider a face as a whole [11] and solve the FER problem by optimising a loss function [12], [13] or synthesizing facial expressions [14], [15] to improve generalizability. However, these methods pay less attention to the potentially diverse emotional information provided by facial details, and irregular faces caused by occlusion and variant poses also strongly affect the model's ability to extract features.
Recent studies have shown that different facial areas display diverse emotional information [16], and extracting different fine-grained features of global and local faces can mine potential key information [17] and effectively deal with information loss caused by occlusion and variant poses. Therefore, research now focuses on solving the problem of FER in the realworld situations using global and local patch methods [18], [19], [20], [21]. These studies primarily include landmarkbased patches [19], [20] and image-based patches [18], [21]. Landmark-based methods can better locate facial muscle movement subregions related to emotional expression. Li et al. [20] proposed perceiving facial occlusion areas based on the pitches obtained from the regions of interest of 24 facial landmarks. Wang et al. [19] constructed a weighted mask based on 68 facial landmarks to capture global and local facial information. However, the excessive facial landmark demand relies heavily on reliable and accurate face detection and landmark tracking, and occlusion may lead to incorrect positioning of certain landmark information, as shown in Fig.1 detection. Image-based patches can segment images into different regions at the image level to mine potential attributes. Zhao et al. [18] proposed a method to divide the feature map into four nonoverlapping local regions to eliminate the instability of multilandmark-based methods. Li et al. [21] proposed a method to focus the identifiable area using window sliding. However, the image-based method lacks adaptability to variant poses, and the same face region may be assigned to different patches under variant poses, as shown in Fig.1(b), which reduces the model's ability to learn details of irregular faces. These robustness problems limit the performance of FER in real-world situations.
Cognitive science and psychological research have shown that the human facial perception mechanism is a process from coarse to fine. For partial occlusion, facial symmetry makes it possible to capture similar emotional information in the corresponding occlusion area during emotional expression [22]. The eyes and mouth regions convey more emotional information due to marked local muscle changes [23]. The perceiver's visual system also focuses more on the eye and mouth areas [24], [25]. Therefore, to improve robustness and the performance of FER in real-world situations, we innovatively propose an adaptive multilayer perceptual attention network (AMP-Net) that is inspired by the facial attributes and the facial perception mechanism of the human visual system. AMP-Net extracts different fine-grained emotional features from global, local, and salient facial regions to learn the diversity and key information of facial emotions under real-world scenarios (see Fig. 2). AMP-Net has three different perception domain modules. The proposed local perception module (LP module) is robust to occlusion and variant poses, and can thus guide the network to focus on multiple finer and distinguishable local patches based on facial attributes and to learn diverse potential information through the adaptive local region method and attention blocks. The LP module achieves a reasonable distribution of patches under variant poses, and the obtained local patches also exhibit facial symmetry, which can provide similar information for occluded parts. Conversely, in the global perception module (GP module), the proposed gate one-shot aggregation (gate-OSA) block can enhance features with different receptive fields in the global perceptual field. In addition, to avoid information loss caused by the inaccurate positioning of the model for key regions, we also use an attention perception module (AP module) to supplement the key texture details of eye and mouth regions with high emotional correlation based on prior knowledge to learn the differences in facial expressions. Therefore, the robustness and effectiveness to occlusion and variant poses can be increased through different levels of perceptual fields.
The contributions of this study include the following: • We propose an adaptive multilayer perceptual attention network (AMP-Net) based on the facial attributes and facial perception mechanism that can adaptively capture the diversity and key information from global, local, and salient facial regions to improve the robustness of FER in real-world situations. • We designed a local perception module that is robustness to occlusion and variant poses, and can effectively extract potential information from different facial regions. • A global perception module is designed to obtain features with different receptive fields, and an attention perception module supplements salient emotional features based on prior knowledge. • Extensive experimental results on real-world datasets (i.e., RAF-DB, AffectNet, SFEW 2.0, FER-2013, and FED-RO) demonstrate that AMP-Net achieves state-ofthe-art expression recognition performance. In particular, we also perform experiments with occlusion and variantpose datasets to show the improved robustness of the proposed method.

II. RELATED WORK
In this section, we mainly present related works on FER under human visual and computer vision, aiming to inspire new FER methods by understanding the human visual perception mechanism under normal, occlusion, and variant poses before reviewing computer vision-related technologies under the same situation.

A. Human Vision FER
The human visual system can quickly capture others' emotions even in complex environments. Therefore, researching new recognition methods based on human visual perception mechanisms is an important way to improve FER performance under occlusion and variant poses.
For visual perception under occlusion, Halliday [26] conducted an experiment to determine emotions in static facial images of four different occluded areas (forehead and eyebrows, nose and cheeks, eyes, and mouth), which showed that subjects could accurately identify emotions from limited information. Additionally, the mouth and eyes were the two most critical areas for identifying real emotions. Wang et al. [25] compared the effects of different masks, such as simulated sunglasses on facial expressions, and found that the ability to decode facial emotions considerably decreased when the masks were occluded by the eyes and mouth. Yan et al. [27] found that anger, fear and sadness were easier to recognise from the upper facial area, while disgust and happiness were easier to recognise from the lower facial area by occluding upper and lower facial areas.
For visual perception under variant poses, Busin et al. [28] observed asymmetry deviation in human emotion recognition by observing the 45 • left and right sides of the expresser by the perceiver. The right face required more fixations than the left face. In addition, related studies [29], [30] have also shown that emotional expression on the left face is more active than that on the right face.
For visual perception under normal faces, Roberson et al. [22] quantified the visual system experiment in a bubble aspect and found that individual differences were primarily evident in the prediction of the eye area. Only the vertical information when the mouth was open was shown to be effective. Jack et al. [31] studied the eye movements of observers from different cultural backgrounds in the FER task and found that although human perception and determination of facial expressions were different due to experience and environmental factors, they were more inclined to focus on eye and mouth regions. Studies [32], [33] also showed correlations of upper facial features with fear, sadness, and anger and lower facial features with surprise, disgust, happiness, and neutrality. When recognizing sadness, primarily the eyes, eyebrows, and mouth provided useful information [34]. For fear recognition, people primarily fixate on the eyes, and the mouth region can provide additional information [35]. Therefore, based on the prior knowledge of human visual FER, we propose the AMP-Net method to acquire different fine-grained facial features of global, local, and salient regions to improve the effectiveness of FER in real-world situations.

B. Computer Vision FER
Occlusion and variant pose are two key FER issues in realworld scenarios. A face is likely to be occluded by sunglasses, hats, scarves, and other things, which markedly changes the facial visual appearance, and a variant poses leads to partial information loss and inaccurate positioning. Previous methods to deal with occlusion and variant poses can be categorised into two parts: holistic-based methods and pitch-based methods.
Holistic-based methods treat the face as a whole and typically solves occlusion and variant-pose issues based on feature reconstruction of geometry [36] texture [37] or improvement of loss function [12], [38] and synthesis of facial expression [14], [15]. Zhang et al. [36] combined the iterative closest point (ICP) algorithm and fuzzy C-means to construct a facial point detector and reconstructed 54 facial points of occluded and variant poses. Xie et al. [12] proposed a new triplet loss based on class-pair margins and multistage outlier suppression to enhance interclass separability and intraclass compactness of network features. Zhang et al. [14] performed facial expression recognition and facial image synthesis simultaneously based on a generative adversarial network (GAN) to ease the overfitting problem in the FER task. However, these methods pay less attention to the potentially diverse emotional information provided by facial details, and irregular facial images caused by occlusion and variant poses also markedly affect the performance of global facial feature extraction.
Pitch-based methods extract facial subregions as regions of interest and assign different attention weights, and primarily include landmark-based [19], [20], [39], [40] and imagebased [18], [21], [41] methods. Zhang et al. [40] proposed a Gabor-based finite element template for FER analysis based on the occlusion of the eyes, mouth, and glasses, and randomly placed blocks. In a recent study, Li et al. [20] perceived occlusion regions through a convolutional neural network (CNN) with an attention mechanism, gACNN focused on the global facial representation, and pACNN detected the occlusion problem in the regions of interest of 24 facial landmarks through an occlusion attention mechanism. Wang et al. [39] proposed a region attention network (RAN) and used fixed position cropping, random cropping, and landmark-based cropping to capture important facial pitches to solve occlusion and variant poses. Zhao et al. [18] proposed a visual information processing method based on a global, multiscale, and local attention network (MA-Net) in a real-world environment, and evenly divided facial images into four blocks to guide the network to focus on the local salient features. However, excessive facial landmark requirements may lead to inaccurate landmark detection, and image-based methods lack adaptability to variant poses, which limits their ability to mine facial details. Different from these methods, the proposed AMP-Net is robust to occlusion and variant poses, and requires fewer facial landmarks, which can adaptively acquire finer and distinguishable local regions under variant poses and supplement global and key region information to obtain potential features.

A. Overview
We propose an adaptive multilayer perceptual attention network (AMP-Net) to address occlusion and variant-pose issues. As shown in Fig. 2, AMP-Net consists of three components: the GP module, the LP module, and the AP module. The network takes a facial image as input. First, the conv1 to conv3 layer of ResNet-34 [42] serves as the feature pre-extractor to output 128 × 28 × 28 feature maps. Then, the feature maps are input into the three-branch module to extract different perceptual field features, and the FER results are obtained by fusing feature-level and decision-level features.

B. GP Module
The GP module aims to learn deeper global facial features in different receptive fields within the global perceptual domain. The one-shot aggregation (OSA) block [43], as a variant of DenseNet [44], aggregates all previous layers into the last layer in a relatively sparse manner and obtains rich receptive field information through feature reuse, which can effectively reduce feature redundancy caused by heavy dense connections in the DenseNet network.
To enhance available features, we design the gate-OSA block, as shown in Fig.3(a). Each 3 × 3 convolutional layer in the gate-OSA block is connected to a gating mechanism (gate) to learn the channel correlation, as shown in Fig.3(b). The gating mechanism compresses the output of the convolutional layer to the channel dimension through the Avgpool layer and is then transformed by a fully connected (FC) layer with a sigmoid activation function α to derive the weight of the channel. Then multiplying the output of the convolutional layer with the channel makes the channel with higher correlation have a higher weight, and the lower weight suppresses the channel with a lower correlation. The gate layer can be formulated as follows: where x g is the input of the gate layer, F G is the output of the gate layer, and ⊗ is element-wise multiplication. Each gate layer in the gate-OSA module has two types of connections. One type of connection obtains feature information with a larger receptive field through alternate series connections of n 3 × 3 convolutional layers with gate layers. The output of each convolution and gate layer is the same C 1 × W × H feature map, where C 1 is the number of channels in the feature map. The other type of connection connects the output of each gate layer with to last output layer to obtain the C 2 × W × H feature map, where C 2 = 128 + C 1 × n, and thus obtain feature information about different receptive fields. Then, a 1 × 1 convolution reduces the dimension of the C 2 × W × H feature map to C 3 × W × H so that the model can train a deeper network. Then, the SE block [45] is added to further enhance the features. Finally, the input of the gate-OSA block is added to the output through the down sample layer to increase the short-circuit connection, and the information loss is reduced through the feature multiplexing of multilayer receptive field information.
In the proposed network, each gate-OSA block has a gate layer serial connection of n = 3. The GP module has three gate-OSA blocks connected in a series, with C 1 ∈ {128, 144, 160} and C 3 ∈ {256, 384, 512}. The GP module receives a 128 × 28 × 28 feature map as input and outputs a 512 × 7 × 7 feature map through three gate-OSA blocks, where each gate-OSA block halves the feature map size. Finally, a global average pooling (GAP) layer is connected to obtain a 512-dimensional global perception feature vector. GP module obtains global information with different receptive fields through multiple convolutional layer feature multiplexing and improves the performance of FER feature extraction in the global scope. A comparison of the settings is shown in ablation experiments IV-C.

C. LP Module
Local facial information can provide more robust emotional feature information for occlusion and variant poses, and the selection of local patches seriously affects the model's ability to mine potential features. We propose an LP module based on facial attributes that is robust to occlusion and variant poses, and can adaptively guide the network focus on multiple finer and distinguishable local patches, improving the ability to learn potentially diverse facial emotions and eliminating the incorrect positioning that may be caused by multiple landmarks and the low adaptability caused by image-based patch methods.
Based on the following knowledge, the upper and lower facial parts convey different emotional information [23] and facial symmetry can provide similar feature information for the occlusion area [22]. Therefore, in this study, the LP module first uses pose-based division (PBD) and location-based padding (LBP) methods to divide face into four subregions with facial symmetry: upper left, upper right, lower left, and lower right. Then facial organs, such as the eyes and mouth, are allocated to the corresponding subregions to ensure the effectiveness of local region allocation under occlusion and variant poses. Local patches with facial symmetry can also provide similar emotional information for the occluded parts, reducing the impact of a partially missing face. In addition, the PBD can adaptively identify the effective facial range of different patches, eliminate the interference of the redundant parts of the image, and then mine the potentially diverse information of different facial subregions through the attention block. The details of the PBD and LBP methods are as follows: 1) Pose-Based Division (PBD): We first use the Reti-naFace [46] facial landmark detector with occlusion robustness to extract five key points of eyes, nose, and mouth in facial maps R ∈ (r, r ) as simple pose information, as shown in Fig. 4 y mouth2 ), and r represents the maximum pixel point of the image.
According to the position of the nose on the Y axis, the face is divided into top and bottom parts, and the left and right parts are divided, respectively according to the positions of the eyes and mouth on the X axis. A total of four facial subregions are obtained, meaning two division points are described by follows: P top = (x top , y cent er ) and P bottom = (x bottom , y cent er ), as shown in Fig. 4 In addition, to reduce the interference of useless facial information under different facial poses, we define the left and right subregions at the top and bottom based on facial symmetry as square regions of the same size with lengths l top and l bottom , to extract useful facial information. l top and l bottom are defined as the minimum distance between the division point and upper or lower image boundaries, respectively, and l top and l bottom can be formulated as follows: The final four facial subregions: can be obtained by the formula as follows: Using this method, a reasonable allocation of facial local regions is ensured under different poses. As shown in Fig. 4(b), the left eye is in R le f t−top , the right eye is in R right−top , the left lip is in R le f t−bottom , and the right lip is in R right−bottom . The following two special cases can occur: 1) When l < r/3, where l = l top or l bottom , a small local region may cause the loss of some important information. Therefore, in this study, the length l 1 of a symmetric subregion with a subregion of length l is the maximum length of the division point from the image boundary in this direction of the symmetric subregion, which is less than r/2. As shown in Case 1 of Fig. 5, the length of the black subregion is l < r/3; therefore, the length of the symmetric subregion is l 1 = r/2. 2) When the subregion does not contain the corresponding eyes or mouth key points, the subregion is defined as a rectangle with width l and length l 2 , where l 2 is the maximum length less than r/2 of the division point distance from the image boundary along the vertical direction of l. The symmetrical subregion is a square with length l 2 , as shown in Case 2 of Fig.5, where the black subregion does not contain the left lip, while the modified red subregion contains the left lip. Fig. 5 shows the pose-based division method results under normal, occlusion, and pose-variation conditions. The proposed method adaptively allocates facial organs, such as the eyes and mouth, to their corresponding subregions under different facial conditions, ensuring the rationality of the facial local region distribution and eliminating the influence of invalid regions, such as hair, on feature extraction. Finally, the four facial subregions are mapped to the 128 × 28 × 28 feature map, as shown in Fig.4(c).

2) Location-Based Padding (LBP):
The output of the posebased division method is the local region feature maps with different sizes, as shown in Fig.4(c). Because the input of the convolutional layer should be the feature map of the same size, a location-based padding method is proposed to retain the integrity of the feature information and the effective input of the convolutional layer, as shown in Fig.4(d). Based on these steps, four local region feature maps with 128 × 14 × 14 are obtained and input into the attention block in parallel. The attention block is shown in Fig.3(c), which contains two 3 × 3 convolution layers and a lightweight convolutional block attention module (CBAM) [47] to weigh both channel and space dimensions, making the model pay more attention to emotion-related regions and feature channels. The channel attention M c and spatial attention M s modules in the CBAM are set as sequential serial connections. Finally, the input feature map is added to the output feature map of CBAM and enhances the features through feature multiplexing based on short-circuit connections. The attention block can be formulated as follows: where x is the input of the attention block, f represents two 3 × 3 convolution layers, and F A is the output. In the LP module, two parallel attention blocks are used to extract attention features and output four 128 × 7 × 7 feature maps, where the first attention block reduces the 14 × 14 feature map to 7 × 7 through down sample and the first 3 × 3 convolution layers. Then, the GAP layer is connected to obtain 4 × 512 feature vectors and spliced into 2048-dimensional feature vectors. After dimensionality reduction of the FC layer to 512 dimensions, the final facial local perception emotion feature is obtained. The LP module extracts finer subregion features based on simple pose information, reduces the interference of invalid regions to the model, and improves FER performance for occlusion and pose variation by learning the potential diversity facial emotions.

D. AP Module
Based on the prior knowledge that the eyes and mouth show a substantial emotional correlation in emotion expression and recognition perception, as shown in Fig.6(a), the AP module is designed to extract the key texture details of salient areas of the eyes and mouth through the attention network, paying more attention to small-scale areas with important emotional features. This result is used as supplementary information to eliminate feature lack that may be caused by ignoring certain important areas.
The AP module obtains five subregions related to the eyes and mouth based on facial key points. As shown in Fig.6(b), the centre points P A of each region are as follows: where P 1 A , P 2 A , and P 3 A are the centres of the left eye, right eye, and eyebrow, respectively; and P 4 A and P 5 A are the centres of the left lip and right lip, respectively. Then the centre points of the five subregions are mapped to a 128 × 28 × 28 feature map to obtain five feature map regions with 128 × L × L.
The AP module inputs five feature maps into the parallel attention block to obtain 128 × L/2 × L/2 feature maps with different channel and regional attention weights. Then the GAP layers are connected to obtain a 5 × 256-dimensional feature vector and spliced into 1280 dimensions, which is finally reduced to 512 dimensions by the FC layer as the final attention perception emotion feature. The AP module can capture the features of eyes, eyebrows, and mouth regions with substantial emotional information used as supplementary information for the network to ensure robust feature extraction.
In the experiments of this study, we set L = 10 as the optimal value.

E. Fusion Strategy
In this study, the LP module and AP module are fused at the feature-level to guide the model to pay more attention to salient regions without occlusion. The fusion of the GP module and feature-level fusion results at the decision-level can obtain features in different perception domains to ensure that the model performs robustly in facial-occlusion and posevariation conditions.
For specific implementation, the 512-dimensional features provided by the LP module and AP module are spliced into 1024-dimensional emotional features, and the FC layer is connected to output c-dimensional vectors, which can be formulated as z L A = {z 1 , z 2 , . . . , z C } to achieve feature-level fusion, where c is the number of emotional categories. Then, we connect the z L A and c-dimensional vector z G output by the GP module under the FC layer to train the model through the loss function L as follows: where L G P is the output loss of the GP module, L L_ AP is the output loss of the feature fusion result, and λ is a hyperparameter used to balance L G P and L L_AP . In the experiment, L G P and L L_ AP are calculated by minimising the cross entropy loss, which can be formulated as follows: y j logŷ j (8) where N is the number of samples,ŷ j is the predicted result, y j is the true result, and M ∈ (G P, L_ A P) are the two inputs of the decision-level.

A. Datasets
To make fair comparisons with previous studies, we perform experiments on five popular real-world facial expression datasets (RAF-DB [9], AffectNet [10], SFEW 2.0 [48], FER-2013 [49] and FED-RO [20]), and occlusion and variantpose test sets [39]  AffectNet: is currently the largest facial emotion dataset, with more than 40 W images that are manually annotated into 11 discrete emotion categories can contain valence and arousal dimension emotions. In these experiments, images with 7 and 8 classes of emotions are used for testing, and data balance processing is used to address the quantitative differences between different emotion categories in the training samples. The experiments use the 7 basic emotions plus neutral, which includes 70,181 training images and 3500 test images; the 8 types of emotions are added with contempt and include 73,931 training images, and 4,000 test images. SFEW 2.0: is created by selecting static frames from the AFEW database based on key frames and contains 7 types of emotion labels, 958 training images, 436 verification images, and 372 test images. In these experiments, because the emotion label with the test set could not be obtained, the verification set is used to evaluate the FER.
FER-2013: contains approximately 30,000 greyscale images with a size of 40 × 40 and seven emotion categories. We select 28,709 images as a training set and 3,589 for the public test set images to evaluate recognition performance.
FED-RO: is a set of facial occlusion datasets that is searched through by the Bing and Google search engines. Images that are duplicated with the RAF-DB and AffectNet datasets are removed to obtain a total of 400 face images with seven emotion categories.
Occlusion-RAF-DB, Occlusion-AffectNet, and Occlusion-FERPlus: contain images with facial occlusion collected in the validation set of AffectNet, the test set of RAF-DB, and the test set of FERPlus, respectively, which include 683, 735, and 605 images, respectively.
Pose-RAF-DB, Pose-AffectNet, and Pose-FERPlus: contain images with facial-pose changes that are in the validation set of AffectNet, the test set of RAF-DB, and the test set of FERPlus, with 1949, 1248, and 1171 images that contain facial poses greater than 30 • , and 1,171, and 958, 558, and 634 images with facial poses greater than 45 • , respectively.

B. Implementation Details
For all facial images, Retinaface [46] is used to extract five facial key points of the eyes, nose, and mouth and intercepts the facial area with a pixel size of 224 × 224. Random flip and translation as well as random changes in brightness, contrast, and saturation for data enhancement. To make fair comparisons with previous studies, we use ResNet-34 [42] as the backbone of the proposed method. For the RAF-DB, AffectNet, and FER-2013 datasets, first pretrained on the large-scale face recognition dataset VGGFace2 [50] and then fine-tuned. The SFEW 2.0 dataset was pretrained on the RAF-DB dataset and then fine-tuned; FED-RO, Occlusion-AffectNet, Occlusion-RAF-DB, Occlusion-FERPlus, Pose-AffectNet, Pose-RAF-DB, and Pose-FERPlus use the same settings as RAN [39]. By default, the region size L is set as 10, and the hyperparameter is set as λ = 0.5. The proposed method is implemented on the GeForce RTX 3090 Ti platform using the PyTorch toolbox [51]. The minibatch size is set to 350 with a momentum of 0.9 and a weight decay of 0.0001.

C. Ablation Experiments
To verify the effectiveness of AMP-Net, ablation experiments are performed on the GP module, the LP module, the AP module, region size L, and hyperparameter λ. The representative real-world datasets include RAF-DB, AffectNet-7 (7 classes) and FED-RO datasets, which are verified without pretraining.
1) GP Module: We analyse the FER performance of the GP module and the gate-OSA block, and the recognition results of each module are shown in Table I. The FER results of the GP module using the Gate-OSA are 2.29%, 0.83%, and 3.75% higher than ResNet-32 on the RAF-DB, AffectNet-7, and FED-RO datasets, respectively. These results show that the GP module can learn the global facial features through multilayer feature reuse more effectively. To verify the performance of the gate-OSA block, the gate-OSA block in the GP module is modified to become an external OSA block. Results show that the gate-OSA block improved by 0.24%, 0.37%, and 0.5% compared with OSA on the RAF-DB, AffectNet-7, and FED-RO datasets. Experimental results show that the gate-OSA block can learn the emotional characteristics of important weight channels more effectively to improve the FER performance.
2) LP Module: We then evaluate the performance of the LP module. In the single module, FER results are shown in Table I, and LP module achieves excellent FER performance by extracting finer facial subregion features, which are 1.99%, 1.68%, and 3.25% higher than ResNet-32 in the RAF-DB, AffectNet-7, and FDE-RO datasets. The LP module also achieves the highest FER result (62.5%) in the recognition of AffectNet-7. In addition, the decision-level fusion of the LP module and the GP module considerably improves recognition performance by 1.21%, 0.29%, and 2.75% and 0.91%, 1.14%, and 2.25% on RAF-DB, AffectNet-7, and FED-RO, respectively, compared with the two single modules, as shown in Table II.
To explore the impact of the local feature selection strategy on the LP module, we design four different schemes for FER performance comparison, as shown in Fig.7. For scheme 1, pose-based division is first performed to obtain the local region, and ROI Align [52] is used to pool local  regions with different sizes into four 128 × 14 × 14 feature maps with uniform size, which are used as the input of the attention blocks (see Fig.7(a)). ROI Align uses bilinear interpolation to convert the feature aggregation process into a continuous operation and can pool feature maps with different sizes into the same size. For scheme 2, we divide the output feature map of the Backbone into four nonrepetitive feature maps with 128 × 14 × 14 as the input of the attention blocks, which is the same setting as MA-Net [18], as shown in Fig.7(b). For scheme 3, we modify the strategy of pose-based division by changing the subregion length l to the shortest distance between the image boundary in the corresponding subregion direction and the division point. If l is greater than r/2, then we let l = r/2. Then four feature maps with 128 × 14 × 14 are obtained through mapping and locationbased padding as the input of the attention blocks, as shown in Fig. 7(c). For scheme 4, the proposed method is used to obtain regional features as the input of the attention blocks, as shown in Fig.7(d).
The FER results of different schemes under a single module are shown in Table III. Scheme 1 obtained the lowest recognition result of 79.35%. We believe that although ROI Align uses bilinear interpolation to pool feature maps to the same size, this process changes the feature value and causes the possibility of important information loss. Scheme 2 obtains a recognition result of 84.65%, which is 0.62% lower than

3) AP Module:
We also evaluate the performance of the AP module. The performance of the AP module is not outstanding in single module recognition, only reaching 0.3%, 0.5%, and 0.5% higher than ResNet-32 on RAF-DB, AffectNet-7, and FED-RO datasets, as shown in Table I. However, when the AP module is combined with the GP module and the LP modules as auxiliary information, FER performance can be effectively improved by 0.45%, 0.68%, and 1.5% on RAF-DB, AffectNet-7, and FED-RO datasets, respectively, as shown in Table II. Experimental results show that the AP module can be used as supplementary information for the GP module and LP module because it pays more attention to the small facial regions with significant emotion correlation obtained based on prior knowledge, to avoid the lack of important information caused by the inaccurate positioning of facial emotion expression regions by the model and effectively improve the robustness of salient feature extraction.

4) Region Size L:
The AP module extracts facial feature maps of the eye and mouth regions with 128 × L × L as auxiliary information. To explore the influence of the region size L, we take the FER performance of L as 6, 8, 10, 12, and 14 under the combination of all modules, as shown in Fig.8(a). Results show that when L = 10, AMP-Net has the highest recognition result (88.06%). When L < 10, we believe the reduction of the region size will miss some important features; when L > 10, although the increase of region size improves the integrity of auxiliary information, more information will lead to an overload of the network and reduce the degradation of identification performance.

5) Weight λ:
To explore the influence of the loss function weight λ on AMP-Net, different λ from 0 to 1 are selected. The FER results are shown in Fig.8(b), and indicate that when λ = 0.5, AMP-Net achieves the highest FER performance,

D. Comparison With the State-of-the-Art Methods
In this section, we compare the proposed method's best results to several state-of-the-art methods on RAF-DB, AffectNet, SFEW 2.0, FER-2013, and FED-RO real-world datasets as well as Occlusion-AffectNet, Occlusion-RAF-DB, Occlusion-FERPlus, Pose-AffectNet, Pose-RAF-DB, and Pose-FERPlus occlusion and variant pose datasets. All benchmark results are reported in the literature. Table IV compares the proposed method and the state-of-the-art methods in RAF-DB with seven emotion categories of happiness, neutral, surprise, fear, anger, disgust, and sadness. AMP-Net achieves the highest FER result (89.25%) under the pretraining in the VGGFace2 face dataset on the RAF-DB dataset. In the confusion matrix shown in Fig.9(a), fear and disgust experience high recognition difficulties, and 18% of the fear dataset is incorrectly identified as surprise. The reason for these results may be that both fear and surprise exhibit high confusion in recognition determination and in facial muscle movement [53], [54]. In addition, to fairly compare with previous studies, we also test AMP-Net on ResNet-18 [42] as the backbone network, and the results also outperformed existing methods.  [18] produces the result with the highest known recognition accuracy (59.40%) by performing multiscale FER through the global region and evenly distributed feature maps of the local region. The proposed method achieves 61.17% in SFEW 2.0, which highlights the FER robustness of AMP-Net. Fig.9(b). shows the recognition confusion matrix. Fear and disgust achieve low recognition results primarily due to the scarcity of emotion images in SFEW 2.0 and the ease of confusing fear and disgust. d) Comparison with FER-2013: The FER-2013 dataset is a set of greyscale images of average quality that were collected via network search, and problems such as blurry images and missing labels make recognition difficult. In the experiment, we do not apply the data enhancement method of random  The FED-RO dataset contains specially collected face images with occlusion. In the experiment, using the same settings as RAN [39], training on the training set of the AffectNet-7 and RAF-DB datasets, and testing on the FER-RO dataset, the results are shown in Table VIII. The proposed method achieves the highest recognition result (71.75%) among known FER methods. Experimental results show that AMP-Net can adapt to the problem of facial occlusion more effectively and has a higher generalisability. In addition, Fig.9(c) shows the confusion matrix of the proposed method for FED-RO. AMP-Net achieves better FER performance for most emotion categories under occlusion, and the incorrect recognition of FED-RO is primarily due to the high confusion of fear, surprise, and disgust.   IX   COMPARISON TO THE STATE-OF-THE-ART RESULTS ON  OCCLUSION-RAF-DB, POSE-RAF-DB DATASET   TABLE X   COMPARISON TO THE STATE-OF-THE-ART RESULTS ON  OCCLUSION-AFFECTNET, POSE-AFFECTNET DATASET   TABLE XI   COMPARISON TO THE STATE-OF-THE-ART RESULTS   In addition, to investigate the performance of AMP-Net in more detail, we use gradient weighted class activation mapping (Grad-CAM) [54] to visualise attention maps of AMP-Net under occlusion and variant poses, As shown in Fig.10, the attention of the GP module, LP module and AP module to different facial regions is shown, and dark red indicates areas of high concern. Face occlusion and pose variation markedly change facial visual appearances, as shown in Fig.10. For face occlusion, the GP module can focus on unoccluded facial areas from a global perspective, which indicates high robustness to occlusion (see Fig.10(a)). The LP module adaptively divides the face into four finer parts based on the head pose, as shown in Fig.10(a). This module eliminates ineffective occluded subregions, such as hands, masks, and sunglasses, and pays more attention to unoccluded eyes and mouth areas. Due to the small occlusion area of glasses, the LP module can adaptively focus on the unobstructed eye region to obtain robust emotional features when people wear glasses. As a supplementary module of AMP-Net, AMP-Net pays more attention to small salient areas of the eyes and mouth with high emotional correlation. Fig.10(a) shows the adaptability of the AP module under facial occlusion. The LP module can allocate facial organs such as the eyes and mouth in variant poses to more refined subregions, and focus on the eyes and mouth regions that are similar to the human visual attention mechanism. Fig.10(b) also shows the efficient adaptability of the proposed method under variant poses. Therefore, these results demonstrate the high robustness of AMP-Net for facial occlusion and variant poses.   11. Some example images with occlusion and variant pose on FED-RO and AffectNet datasets that AMP-Net failed to predict the correct expression categories with. Note that 'blue' represents true labels and 'red' represents prediction labels. to cause recognition errors. As shown in Fig. 11, during mouth occlusion and pose ≥ 30 • , surprise is wrongly identified as fear, anger is wrongly identified as disgust, etc., because surprise and fear have similar inner brow raisers and upper lid raisers (AU1, 5), and anger, disgust and sadness have similar brow lowers and Lip corner depressors (AU4, 15). The solution to these problems is to supplement multimodal explicit behaviour information, such as body or language information, to enhance emotional differences.

V. CONCLUSION
In this paper, we propose an adaptive multilayer perceptual attention network (AMP-NET) that is inspired by the facial attributes and human visual perception mechanism to acquire multilevel facial emotional features from coarse to fine to improve robustness under occlusion and variant poses. We design three modules to obtain facial information from different perception domains to ensure that the model pays more attention to facial regions with substantial emotional correlation and robustness to real-world facial emotion data. The final ablation experiment and comparison with the existing methods show that the proposed method can effectively eliminate invalid information under occlusion and exhibits high robustness to facial occlusion and variant poses. In future work, we plan to investigate the construction of a multimodal emotion recognition model based on federated learning to improve the model generalisability and recognition accuracy in real-world scenarios to ensure user privacy.