Weakly Supervised Local-Global Attention Network for Facial Expression Recognition

Combining global and local features is an essential solution to improve discriminative performances in facial expression recognition tasks. The limitations of existing methods are that they cannot extract crucial local features and ignore the complementary effects of local and global features. To address these problems, this paper proposes a Weakly Supervised Local-Global Attention Network (WS-LGAN), which uses the attention mechanism to deal with part location and feature fusion problems. Firstly, an Attention Map Generator is designed to get a set of attention maps under weak supervision. It mimics the attention mechanism of human brain and quickly finds the local regions-of-interest. Secondly, bilinear attention pooling is employed to generate and refine local features based on attention maps. Thirdly, a building block called Selective Feature Unit is designed. It allows adaptive weighted fusion of global and local features before making classification. In WS-LGAN, global and local features represent expressions from different aspects. Compared with methods relying on single type of feature, it benefits from local-global complementary advantages. Additionally, contrastive loss is introduced for both local and global features to increase inter-class dispersion and intra-class compactness under different granularities. Experiments on three popular facial expression datasets, including two lab-controlled facial expression datasets and one real-world facial expression dataset show that WS-LGAN achieves state-of-the-art performance, which demonstrates our superiority in facial expression recognition.


I. INTRODUCTION
Facial expression is a fundamental manner of transporting human emotions and takes on a significant part in our daily communication. Facial expression recognition is a complex but interesting problem, and finds its extensive applications in fatigue surveillance [1], human-machine interaction [2], patient care [3], neuromarketing [4] and interactive games [5] etc. Thus, facial expression recognition has received substantial attention among the researchers in computer vision, affective computing and human computer interaction fields.
Despite great success has been achieved in recent years [6]- [8], accurate facial expression recognition is still challenging. It is mainly due to the complexity and variability of facial expressions. We summarize the obstacles as follows: The associate editor coordinating the review of this manuscript and approving it for publication was Mohammad Ayoub Khan .
(a) High intra-class variances. Expressions of the same class may vary from one person to another. It is influenced by factors such as age, gender, race, cultural background and other person-specific characteristics. (b) Low inter-class variances. Expressions belonging to different categories may be similar except for some minor differences. For example, sadness and anger sharing traits of uniformity across some facial regions. The main difference of sadness and anger only lies in the corners of the mouth. The existing methods for overcoming the above two obstacles can be divided into two categories. One category is non-part based method which focuses on learning global representation, while the other is part-based based method which pays more attention to extract partial discriminative features.
For the first category, several works propose novel loss layers to replace or assist the supervision of the softmax loss for more discriminative features. Inspired by the center loss [9] and the triplet loss [10], some variations such as, island loss [11], locality-preserving loss [12] and (N+M)tuples cluster loss [13] are designed for facial expression recognition. They require projecting the features to another space in which inter-class discrimination and intra-class similarity are enhanced. However, these methods usually extract features from the holistic facial image. It ignores fine-grained information in local facial regions.
For the second category, an essential prerequisite of learning discriminative part features is that parts should be accurately located. Since facial expression datasets usually have only image-level expression labels, we cannot have an extra part annotation, such as pixel-level segmentation labels or bounding boxes. Some part-based methods crop facial expression image into patches and try to learn local representation of them. For instance, [14]- [16] divide facial image into non-overlapping or overlapping patches. Features extracted from selected patches or all patches can highlight some details about local facial regions. Although their results are encouraging, they still have shortcomings. Firstly, selected facial patches may vary with the training data, it is difficult to conceive a generic system. Secondly, a large number of candidate patches make the training of the model timeconsuming and computationally intensive. Thirdly, manually defined patches may not be optimal for the final classification tasks. Some patches have no influence on expression or even have a negative impact on facial expression recognition.
Moreover, we may lose some complementary information if we only concentrate on local features. Attributes such as age, gender, race and other person-specific characteristics, provided by the holistic facial image, can also affect expressions significantly. Thus, methods based on either localized or global features alone will ignore their joint benefit and mutual complementary effects.
In fact, when humans try to conduct an object recognition process, we firstly obtain the global description. Then our attention orientates rapidly toward salient regions where facial details will be filled out [17]. Besides, studies in [18]- [20] have provided a prior knowledge that much of expressional clues come from salient facial regions, such as the regions around mouth and eyes. Conversely, other parts have little impact on facial expression recognition, such as ears and hair. Motivated by the process and the prior knowledge, we propose a Weakly Supervised Local-Global Attention Network (WS-LGAN) to learn global representations and, at the same time, learn local features around eyes and mouth to facilitate local-enhanced facial expression recognition. The pipeline of our proposed method is illustrated in Figure 2. Different from previous part-based methods, we mimic the way humans recognize facial expressions. Attention mechanism is introduced to guide our network to perceive crucial local regions autonomously. The location of crucial facial region is designated by attention maps which are generated by Attention Map Generator (AMG). Attention maps learning is only weakly supervised by image-level label. Therefore, the lack of part annotations in facial expression dataset is well solved. We show some samples generated by AMG in Figure 1. Inspired by [28], we propose a bilinear attention pooling. Based on the attention maps, the local features are extracted and refined by combining the attention maps and the global feature through bilinear attention pooling. In the same time, we integrate global features to introduce complementary information displayed in holistic facial images. Local features and global features are fused with adaptive weights through a Selective Feature Unit (SFU). Furthermore, we develop similarity metric for local features and combine it with classification errors. They assimilate local features extracted from same category and discriminate local features extracted from different categories. As a result, the intra-class variations of local features are reduced, while the inter-class differences of local features are increased. Similarly, the similarity metric is also utilized in global features.
In summary, the main contributions of this work are as follows: (1) Weakly supervised model for local feature extraction: We propose a local feature extraction method that explicitly considers and extracts region-specific local features, including features around eyes and mouth. Most previous models obtain local representation relies on segmented expressional image patches [14]- [16]. A large number of candidate facial image patches makes the model inefficient, while expressionindependent patches hinder the training of the model. In addition, the position and scale of crucial local regions also change with the input image, it is difficult to determine the size of the patch. Different from these methods, we propose handling local feature extraction by directly locating the crucial regions and extracting the corresponding features. Specifically, our method trains the AMG under weak supervision to generate attention maps that strongly indicate the locations of the eyes and mouth. Based on these attention maps, bilinear attention pooling is proposed to generate and refine local features. Such local feature extraction method has two advantages comparing with previous methods. Firstly, we VOLUME 8, 2020 markedly reduce the number of candidate local regions, expression-irrelevant local regions are discarded. Secondly, changes in the position and size of the mouth and eyes regions will not affect the settings of the model. There is no need to consider the size of the patches. Besides, weak supervision allows us to overcome the limitation of no part annotations in facial expression dataset, resulting in a concise feature extraction.
(2) Adaptive weighted local-global fusion: We formulate a SFU to fuse local and global features. The local features focus on extracting region-specific fine-grained information, while the global features concentrate on representing the integrity of the expression. Compared with methods relying on either local or global representation alone, the joint use of them makes the model can benefit from their complementary advantages. Previous method [14] aggregates global feature and all locals by concatenate fusion. However, such a fusion method treats all features uniformly, while our proposed SFU enables an adaptive weighted fusion. Specifically, the SFU has two effects on the features. On one hand, it weights the specific gravity of features extracted from region around eyes, region around mouth, and the holistic face region, respectively. On the other hand, it weights different semantic information and finds the most meaningful feature within each region.
(3) Multi-granular similarity metrics: We extend metric learning to both local features and global features to increase inter-class differences as well as decrease intraclass variations at different granularities. Previous methods [11]- [13], [22] employ similarity metrics only on global representation, and thus fine-grained features are not well learned. Part-based methods [14]- [16] which extract local features through patches are difficult to metric learning on local features because the facial pose changes in each image. Among multiple patches, patches at the same position may correspond to different facial part. In our method, we have located the eyes and mouth regions on each image. This is equivalent to ''aligning'' the mouth and eyes on each image. Under the ''alignment'', we can perform metric learning on mouth-related features and eyes-related features, respectively. Therefore, we propose local-sensitive contrastive loss for eyes-related and mouth-related features. We are able to make full use of local-sensitive contrastive loss and globalsensitive contrastive loss simultaneously.
(4) Competitive experimental results: To demonstrate the superiority of our proposed method, we employ experiments on lab-controlled facial expression datasets (e.g., CK+, Oulu-CASIA) and real-world facial expression dataset (namely, RAF-DB). Our facial expression recognition solution achieves state-of-the-art results on CK+, Oulu-CASIA and RAF-DB with accuracies of 98.06%, 88.26% and 85.07%, respectively.
The remainder of this paper is organized as following. Section 2 reviews related work on facial expression recognition. Section 3 details the proposed Weakly Supervised Local-Global Attention Network. In section 4, we show the experimental results and evaluate the performance of the WS-LGAN. In Section 5, we conclude the paper and give some remarks.

II. RELATED WORK
Researchers have long acknowledged that facial expression recognition struggles when coupled with inter-subject variations. The visual differences among categories or instances are subtle. They are easily overwhelmed by other factors. Existing methods that concentrate on these problems can be classified into two categories: non part-based methods and part-based methods.

A. NON PART-BASED METHODS
An immensely popular recent approach is to enhance the discriminative power of the deeply learned features by proposing new loss functions. These methods aim to obtain representation with compact intra-class variations and separable inter-class differences. Cai et al. [11] propose an island loss, which can penalise the distance between deep features and their corresponding class centers as well as increase the pairwise distances between different class centers simultaneously. Li et al. [12] propose a deep locality-preserving CNN, which preserves the locality proximity by minimizing the distance to the K -nearest neighbors within the same class.
Besides, some works attempt to make the network disentangle the identity and the expression by either performing multi-signal supervision or using Generative Adversarial Network. Meng et al. [22] propose a model that contains two identical sub-CNNs. One stream learns expression-discriminative features, and the other stream learns identity-related features for identity-invariant expression recognition. Liu et al. [13] propose (N+M)-tuples cluster loss with the supervision of identity and expression labels to alleviate the difficulty of anchor selection and threshold validation in the triplet loss for identity-invariant facial expression recognition. Hui et al. [23] learn facial expressions by extracting the expressive component through a deexpression procedure. Given a facial image with arbitrary expressions, its corresponding neutral expression is generated by the trained generative model. Through the procedure, the identity information of a subject remains unchanged while the expressive component is removed. The expressive component is used to make facial expression recognition.

B. PART-BASED METHODS
Studies in psychology show that most of the descriptive facial features of expressions are located in several crucial regions. Therefore, extract local features from facial image has attracted the attention of some researchers. Some partbased methods have been proposed. Happy and Routray [15] propose a framework by using appearance features of selected facial patches. They select different facial patches as salient for different expressions. Liu et al. [16] model a system named boosted deep belief network to classify different expressions. They divide expressional images into patches. Some patches with high discriminative power are selected and combined to train a strong classifier. 3DCNN-DAP [24] incorporate a deformable parts learning component into the 3D CNN framework, which can detect specific facial action parts under the structured spatial constraints, and obtain the discriminative part-based representation simultaneously.
Recently, some methods have demonstrated that integrating local and global features can improve the performance of facial expression recognition. For instance, Xie and Hu [14] propose a convolutional neural network with two branches. One branch extracts local features from uniform image patches while the other extracts global features from the holistic expressional image. Global features and all local features are concatenated for final expression classification.

III. PROPOSED METHOD A. OVERVIEW
Our method aims to mine discriminative parts of the face via object localization and extract local features through the mined parts. Besides, we fuse local and global features to utilize their complementary advantages jointly in coping with local detail loss and emphasizing global integrity. Since facial expression datasets do not have labeled part locations, we formulate part localization problem in a weakly supervised manner by introducing a facial attributes dataset. We decompose our pipeline into two stages, as shown in Figure 2. We train the AMG on facial attributes dataset in the first stage. Then the well-trained AMG is transferred to facial expression datasets with weight fixed. Local and global features are jointly learned through a deep CNN framework in the second stage. The second stage consists of two identical CNN streams whose weights are shared. Each CNN stream contains four sub-parts: AMG, Feature Extractor (FE), SFU and Classifier. The AMG generates attention maps by calculating weighted feature map in the binary classification network. The attention maps can highlight the regions around eyes and mouth. FE extracts features directly from the holistic facial image. Without any additional processing, the feature extracted by FE is the global feature. Local features are extracted and refined by combining the attention maps and the global feature. The backbone of each branch in the AMG is a variant of Densenet [25]. It consists of 3 dense blocks and 2 transition layers. The dense block contains 6, 12 and 24 dense layers, respectively. Due to the limited images in facial expression datasets, we use the backbone as the FE after reducing the number of dense layers to 6 for each dense block. The SFU aims to fuse the global features with the local features. Classifier is a softmax classifier for the final expression classification. During training, the model takes a pair of facial expression images as input. We optimize the parameters by simultaneously minimizing the classification errors, local-sensitive contrastive loss and global-sensitive contrastive loss. During testing, an image is fed into one CNN stream, and predictions are generated based on both the local and the global features.

B. ATTENTION MAP GENERATOR
Attention map is a weight map where crucial regions have higher values. We use it to locate crucial facial regions. In general, the direct way of locating a region is to use an image and its pixel-wise segmentation as input and target respectively, such as facial parsing. However, it requires label maps with pixel-wise annotations, which are expensive to collect. Pixel-level image processing can be time-consuming and computationally expensive. More importantly, facial expressions are generated by contracting facial muscles around facial organs. The result of pixel segmentation is too fine to focus on the areas around these organs that contain abundant apparent features. An alternative approach is weakly supervised object localization. Zhou et al. [26] enable the classification network to have remarkable localization ability despite being trained on image-level labels. It can be applied to a variety of computer vision tasks for fast and accurate localization. Inspired by them, we designed our AGM.
Facial expression datasets usually have only expression labels, while the image in the CelebA dataset [27] is labeled with 40 facial attributes. Some attributes can guide the AMG training to locate crucial regions. Since we only focus on the regions that related to facial expressions, eyes and mouth related facial attributes for each image are selected and divided into two groups based on their respective facial parts. For instance, bushy eyebrows, arched eyebrows, narrow eyes, eyeglasses are grouped together, as all of them are related to eyes. The grouped attributes are summarized in Table 1. The AMG consists of two branches that locate the regions around the eyes and mouth respectively. Figure 3 takes the eyes-related branch as an example. If the image does not contain any eyes-related attributes listed in Table 1, let's take it as a negative example, otherwise, as a positive example of the eyes. We use them to train eyes-related branch. Fully convolutional layers and global average pooling (GAP) are used to generate features for classification. The predicted class score is mapped back to the previous convolutional layer to generate the attention maps. As illustrated in Figure 3, GAP outputs the spatial average of the feature map of each unit at the last convolutional layer. A weighted sum of these values is used to generate the final output. We back the weights of the output layer to the convolutional features and compute a weighted sum of the feature maps to obtain our attention maps. We normalize the attention map so that all values fall in the range [0, 1].
Weakly supervised part location allows us to overcome the limitation of no part annotations in facial expression dataset. Figure 1 illustrates the effect of attention maps outputted using the AMG. The regions around eyes and mouth are highlighted. After we trained the AMG module on CelebA dataset, we transfer it to the facial expression datasets. In the second stage, the AMG is frozen.

C. LOCAL FEATURE REFINEMENT
Bilinear pooling has been proved to be effective in extracting fine-grained features [28]. It is developed to localize distinct object parts and model the appearance conditioned on their detected locations, but it cannot obtain the details of specific regions. We propose bilinear attention pooling which naturally combines the attention maps to solve this problem. Local feature refinement for two crucial local regions is implemented through bilinear attention pooling. Besides, contrastive loss is utilized for each local feature to learn a similarity metric for image pairs. It makes sure the local features extracted from samples of the same expression have similar representations. On the contrary, those of different expressions are faraway in feature space. As illustrated in Figure 4, we take the refinement of eyes-related features as an example. The process of mouth-related features refinement is the same as that of eye-related features.

1) BILINEAR ATTENTION POOLING
Firstly, well-trained AMG with fixed weights is used to generate attention maps A e ∈ R 1×H ×W (eyes-related attention maps) and A m ∈ R 1×H ×W (mouth-related attention maps) respectively. Note that, A e and A m have the same map size H × W . 1 is the number of channel, R is the set of real number. Then, we element-wise multiplies each channel in feature maps F g ∈ R C×H ×W (C is the number of channels) by attention maps A e and A m , as shown in Eq.1: 37980 VOLUME 8, 2020 where indicates element-wise multiplication for two tensors. Feature maps F g are extracted by the FE from the holistic facial image. F i g ∈ R 1×H ×W is the ith channel in F g . After the element-wise multiplication, F i e and F i m represent the ith feature map of eyes and mouth, respectively. We concatenate the F i e across the channel to obtain F e , and concatenate the F i m across the channel to obtain F m . F e ∈ R C×H ×W and F m ∈ R C×H ×W reflect the feature maps of eyes and mouth, respectively. Bilinear attention pooling explicitly define two streams to locate and extract features respectively. We treat the AMG branch as the dorsal stream in the human visual cortex, which processes the object's spatial location and the FE branch as the ventral stream in the human visual cortex, which performs object recognition. The bilinear attention pooling bridges the relationship between appearance models and part locating models. It provides a solution for local feature extraction.

2) LOCAL-SENSITIVE CONTRASTIVE LOSS
To reduce the intra-class variations and enlarge the inter-class differences at finer granularity. Local-sensitive contrastive loss L e C and L m C are designed for the eyes-related features and mouth-related features respectively. As illustrated in Figure 4, we introduce an auxiliary fully connected (FC) layer to represent the eyes-related features. L e C pulls the eyes-related features extracted from samples of the same expression towards each other, while push the eyes-related features extracted from samples of different expressions away from each other. We adopt the loss function based on the squared Euclidean distance, which is denoted as: where x i and x j are a pair of training images, and f e (x i ) and f e (x j ) are their corresponding eyes-related feature vectors. When x i and x j have the same expression label, θ ij = 1. Otherwise, θ ij = 0. δ e is the size of the margin which determines how much dissimilar pairs contribute to the loss function. In our experiment, δ e is set to 10 empirically. The contrastive loss L m C of mouth-related features is defined similar to L e C .

D. LOCAL-GLOBAL FUSION
We define a global-sensitive contrastive loss to extract more discriminative global features. Besides, to enable WS-LGAN to infer image categories from both local details and global context cues concurrently, we propose a SFU to fuse local and global features.

1) GLOBAL-SENSITIVE CONTRASTIVE LOSS
Global-sensitive contrastive loss L g C is designed to reduce the intra-class variations and enlarge the inter-class differences globally. Global feature maps F g ∈ R C×H ×W are extracted by the FE from holistic facial image directly without any additional processing. The global feature vector that used to calculate the loss is obtained by inputting F g into the FC layer. Similar to the local-sensitive contrastive loss, L g C is defined as following: where f g (x i ) and f g (x j ) are global feature vectors for a pair of training samples. When x i and x j have the same expression label, θ ij = 1. Otherwise, θ ij = 0. δ g is the size of the margin VOLUME 8, 2020 which determines how much dissimilar pairs contribute to the loss function. It is set to 10 empirically.

2) SELECTIVE FEATURE UNIT
The SFU is inspired by Selective Kernel (SK) convolution (SK convolution) [29], which proposes a dynamic selection mechanism in CNNs that allows each neuron to adaptively adjust its receptive field size based on multiple scales of input information. The main difference between the SFU and the SK convolution is motivation. The SFU is designed to learn an adaptive weighted fusion of features. It models the complementarity of information between local and global features, while the SK convolution aims to address the adaptive changing of receptive fields size. Specifically, the SFU consists of two key schemes: fuse and select as illustrated in Figure 5.

a: FUSE
To compute the adaptive weights for F e , F m and F g , we use gates to control the information flows from multiple branches carrying features extracted from different regions into the next layer. The gates integrate information from all branches. We first obtain the hybrid representation U = F e + F m + F g , (U ∈ R C×H ×W ) from three branches via an element-wise summation. Then, we adopt average-pooling to squeeze the spatial dimension of the hybrid representation and generate a channel-wise statistics s ∈ R C . Further, s is fed to a fully connected layer with ReLU function and Batch Normalization to reduce the dimensionality. It generates a compact feature z ∈ R d×1 to guide the precise and adaptive selection.
Here, d represents the length of the vector z. The size of d is determined by the number of output neurons of the fully connected layer with s as input. In our experiment, d = C/16.

b: SELECT
A soft attention across channels is used to adaptively select three different spatial feature descriptors: F e , F m and F g , which are guided by the compact feature descriptor z. Specifically, a softmax operator is applied on the channel-wise digits: where the matrix E, M , G ∈ R C×d . E, M and G are three matrices that can be learned by the model. Each row of E, M and G can be used to calculate each element of e, m and g, respectively. e, m, g denote the soft attention vectors for F e , F m and F g , respectively. E c ∈ R 1×d is the cth row of E and e c is the cth element of e, likewise M c and m c , G c and g c . The final representation V is obtained through the attention weights on three different spatial feature descriptors: The SFU plays two different roles in our method. Firstly, features from different spatial descriptors (F e , F m and F g ) make different contributions to facial expression recognition task. Between each spatial descriptor, the SFU plays the role of weighting the specific gravity of different spatial descriptors. It estimates the weight adaptively to denote the importance of each spatial descriptor and makes a reasonable trade-off and selection among them. Besides, Besides, the adaptive weight can prevent the model from being sensitive to the performance of AMG. Secondly, as each channel of a feature map is considered as a feature detector, each channel represents features with different semantic information. Within each spatial feature descriptor, the SFU plays the role of weighted different semantic and finding the most meaningful feature from each spatial descriptor.

E. TOTAL LOSS
After the SFU, the final fused representation is fed into the Classifier, which is performed using the softmax classifier. Softmax loss that calculates the classification error is used on the end of each component network to ensure the learned features are meaningful for expression recognition. The total loss of the proposed WS-LGAN is: where {λ 1 , λ 2 , λ 3 , λ 4 , λ 5 , λ 6 } are the weight of each loss. L e C and L m C are the local-sensitive contrastive losses that correspond to eyes-related features and mouth-related features, respectively. L g C is the global-sensitive contrastive loss, L 1 S and L 2 S are the final classification errors.

IV. EXPERIMENTS
In this section, we conduct comprehensive experiments to verify the effectiveness of WS-LGAN. We compare our method with the state-of-the-art methods on three popular facial expression datasets. To demonstrate the effectiveness of our proposed components, we also employ a series of ablation studies. For representing the results of the AMG on facia expression datasets, we visualize attention maps intuitively.

A. DATASETS 1) CELEBA DATASET
The CelebA dataset is an additional dataset we introduced. Note that, we only use CelebA dataset [27] to train the AMG. The dataset contains 202,599 web-based images. Every image are labeled with 40 facial attributes. We select 7 facial attributes for each image and divide the attributes into two groups based on their respective facial parts as shown in Table 1. We randomly select 30,000 (The ratio of positive and negative samples is 1:1) images to train the eyes-related branch and select 3,000 images for validation. For the training of mouth-related branches, we use the same configuration.

2) CK+ DATASET
The CK+ [30] dataset is a lab-controlled dataset which consists of 593 facial expression sequences collected from 123 different subjects. It is an extended version of Cohn-Kanade (CK) dataset [31]. Its subjects range from 18 to 30 years old, most of whom are female. Each sequence starts with a neutral facial expression and ends with a peak facial expression. Among these sequences, only 327 sequences from 118 subjects are annotated with seven expressions, i.e. Anger (An), Disgust (Di), Fear (Fe), Happiness (Ha), Sadness (Sa), Surprise (Su) and Contempt (Co). As a general procedure [11], [13], [14], [16], [21]- [23], [46], the last three frames of each sequence are used for training and testing. Thus, CK+ contains 981 images for our experiments.,

3) OULU-CASIA DATASET
The Oulu-CASIA [33] dataset is a lab-controlled dataset which contains 480 facial expression sequences collected from 80 different subjects aged between 23 and 58 years old. Similar to the CK+, each sequence begins with a neutral expression and ends with a peak expression. All sequences have been labelled with six expressions: anger, disgust, fear, happiness, sadness or surprise. The last three frames of each sequence are selected for training and testing [11], [21], [23], [32]. Hence, 1440 images are used in this dataset totally.

4) RAF-DB DATASET
The Real-world Affective Face Database (RAF-DB) [12] is a real-world dataset that contains 29,672 highly diverse facial images downloaded from the Internet. With manually crowd-sourced annotation and reliable estimation, seven basic and eleven compound expression labels are provided for the samples. In our experiment, only images with basic expressions (surprise, fear, disgust, happiness, sadness, anger and neutral) are used, including 12,271 images for training and 3,068 images for test. Because of CK+ and Oulu-CASIA do not provide specified training and test sets, we employ the most popular 10-fold validation strategy as in the previous methods [11], [14], [21], [23], [32], [37], [38], [41], [46]. To ensure the generalization of our model, each dataset is split into ten groups without subject overlapping between the groups. For each run, nine groups are used for training and the remaining is used for testing. The results are the average of 10 runs. For the experiments on the RAF-DB dataset, we use their official split for training and test.

C. IMPLEMENTATION DETAILS
The training of WS-LGAN contains two stages. In the first stage, we train the AMG on CelebA. The initial learning rate is set to 0.1, which is decreased by 0.1 after every 20 epochs. After we obtain the well-trained AMG, we freeze it and transfer it to facial expression datasets. In the second stage, we use the frozen AMG to generate attention maps, and train the remaining part of WS-LGAN jointly.
Following previous work, before training on the target expression datasets, we pre-train WS-LGAN on FER2013 dataset [35] and fine-tune WS-LGAN on the target expression datasets. The initial learning rate for pre-train and fine-tuning are set to 0.1, 0.01 respectively. They are divided by 10 at 50% and 75% of the total training epochs. We optimize the model using Stochastic Gradient Descent (SGD) with a batch size of 100, momentum of 0.9, weight decay of 0.0005 for all stages. In Eq.6, λ 3 are set to 2, 5, 2 for CK+, Oulu-CASIA and RAF-DB respectively, while other parameters are set to 1 empirically.

D. EXPRESSION RECOGNITION RESULTS
To evaluate the performance of WS-LGAN, we compare it with other competitive approaches through two indicators, namely feature (dynamic feature or static feature) and average recognition accuracy.  [30] in terms of the average recognition accuracy.  [30]. The ground truth and the predicted labels are given by the first column and the first row, respectively. Table 2 shows the results of the comparative studies in terms of average accuracy on CK+. Table 3 is the confusion matrix, which illustrates the detailed classification results of all seven expressions. The diagonal entries represent the recognition accuracy for each expression. From Table 2 and Table 3, we draw the following conclusions.

1) RESULTS ON CK+ DATASET.
(1) Our method achieves an average recognition accuracy of 98.06% on CK+. Among the methods which utilize only static image, our result achieves state-of-the-art. It seems that our performance is a little worse than PHRNN-MSCNN [21]. This is because PHRNN-MSCNN utilizes partial-whole, geometry-appearance and dynamic-still features. Motion information is added to their model and their inputs are complex. While our method needs only static appearance features, which is more favorable for online applications or snapshots where per frame labels are preferred.
(2) It performs well on disgust, fear and happiness, but the performances on contempt and sadness are poor. The low accuracy of contempt is mainly due to the lack of data. The samples of contempt are only 18/327 of the total, which is far less than others. Besides, sadness and anger are confused in some samples. A reasonable explanation is that sadness and anger share some similar actions in local facial regions [42]. In Facial Action Coding System (FACS) these two expressions have shared Action Units (AUs): AU4 (Brow Lowerer) and AU17 (Chin Raiser) [43], [44].
(3) WS-LGAN obtains a better recognition accuracy than IACNN [22] and 2B(N+M)Softmax [13], which aim to obtain more discriminative representation for holistic facial image by metric learning. The results show that more attention to local regions will facilitate expression classification.  [33] in terms of the average accuracy.  [33]. The ground truth and the predicted labels are given by the first column and the first row, respectively.

2) RESULTS ON OULU-CASIA DATASET
For Oulu-CASIA dataset, the results are reported in Table 4. Details of classification results are shown in the confusion matrix in Table 5. From Table 4 and Table 5, discussions can be summarized as the following.
(1) Our method achieves an average recognition accuracy of 88.26% on Oulu-CASIA. The performance of WS-LGAN is better than all the state-of-the-art methods, including methods that use dynamic or static features. Note that, Oulu-CASIA is a more challenging dataset, which includes changes in facial attributes, such as with glasses on. WS-LGAN can still correctly classify most expressions, which demonstrates the robustness of the proposed method.
(2) WS-LGAN performs well when recognizing happiness and surprise, which reach the accuracy of 95.8% and 96.7%, respectively. Disgust and anger, sadness and anger are seriously confused. The main reason is they act similarly in some facial action units in FACS, such as AU10 (Upper Lip Raiser), AU17 (Chin Raiser), AU25 (Lips Part) and AU26 (Jaw Drop) involved in disgust and anger, AU4 (Brow Lowerer) and AU17 (Chin Raiser) involved in sadness and anger [42]- [44].  [12] dataset in terms of the average accuracy. Some papers report performance as an average of diagonal values of confusion matrix. We convert them to regular accuracy for fair comparison.

TABLE 7.
Confusion matrix of the proposed method evaluated on the RAF-DB dataset [12]. The ground truth and the predicted labels are given by the first column and the first row, respectively.

3) RESULTS ON RAF-DB DATASET
For RAF-DB dataset, the results of comparison are reported in Table 6. Details of classification results are shown in the confusion matrix in Table 7. From Table 6 and Table 7, discussions can be summarized as the following.
(1) The proposed WS-LGAN achieves an average recognition accuracy of 85.07% on RAF-DB which is closer to the natural scene. The performance of WS-LGAN is far better than all methods. It proves that our method is robust to both lab-controlled and real-world facial expression dataset.
(2) The highest accuracy is obtained when recognizing happiness, which reaches to 93.8%. However, the performances on anger, disgust and fear are poor. This is mainly due to the lack of data. In RAF-DB the samples of anger, disgust and fear are far less than others.

E. ABLATION STUDIES
The performance of the network is mainly determined by the following four components: global features, local features, SFU and contrastive loss. To assess these four components, we conduct some ablation experiments on the CK+ dataset to evaluate their effect on recognition.

1) THE EFFECTS OF FEATURE FUSION
We construct another two models to take on the evaluation task. The model only utilizes global features to make classification is denoted as GFNet. The model that recognizes expressions only with local features is denoted as LFNet. The recognition performances of these two models are listed in Table 8.
From Table 8, we can observe that the recognition accuracy of WS-LGAN is much higher than GFNet and LFNet, which means that facial expression recognition benefit from feature fusion. This is reasonable as global features or local features only focus on representing expressional information with a specific aspect. The global feature is intended to represent the integrity of the expression, while the local feature focuses on the subtle traits of the local region. The improvement on recognition accuracy by fusion indicates that these two types of features are complementary to each other.

2) THE EFFECTS OF THE SFU
In our model, feature fusion is crucial to the final recognition performance. In addition to the SFU, we also explore the properties of sum fusion and concatenation fusion. Sum fusion computes the sum of all feature maps at the same spatial location and feature channel. Since the channel numbering is arbitrary, sum fusion defines an arbitrary combination between feature channels. The model with sum fusion is denoted as WS-LGAN-Sum. Concatenation fusion stacks the two feature maps at the same spatial location across the feature channels. All features are treated with the same confidence. The model with concatenation fusion is denoted as WS-LGAN-Concat. Experimental results on the CK+ dataset are summarized in Table 9. Our WS-LGAN achieves the highest accuracy by fusing features through the SFU. The excellent performance of the SFU can be attributed to the adaptive weighted mechanism among different features.

3) THE EFFECTS OF CONTRASTIVE LOSS
In this experiment, the model without contrastive loss is denoted as WS-LGAN-WCL. In other words, in WS-LGAN-WCL, we only use the softmax loss as supervision signal to optimize the parameters. We compare the performance of WS-LGAN-WCL with the proposed model. The recognition result is shown in Table 10.
From Table 10, we can see that the proposed model performs better than WS-LGAN-WCL. This is reasonable as softmax loss forces the features of different expressions staying apart, but it has not a strong constraint to reduce the VOLUME 8, 2020  variations of identical expressions. The two local contrastive losses and one global contrastive loss correspond to local representations and global representation work together to push our model to focus on expression details in different granularities. With the joint supervision of softmax loss, local contrastive loss and global contrastive loss, not only the inter-class features differences are enlarged, but also the intraclass features variations are reduced. Hence the discriminative power of the learned features can be highly enhanced. The improvement in recognition accuracy demonstrates the effectiveness of contrastive loss.

F. VISUALIZATION OF ATTENTION MAPS
In Figure 6, we visualize the attention maps generated by transfer the AMG to CK+ and RAF-DB datasets to demonstrate the effectiveness of weakly supervised attention learning. Rectangular boxes of different colors contain visualized results of different expressions. Within each rectangular box, the first column is the original images, the second column is the eye-related attention maps, and the last column is the mouth-related attention maps. We can see that, regardless of the person or expression in the image, our model can always accurately locate the eyes region and mouth region. It provides an efficient and accurate guidance for the extraction of local features. In addition, it avoids the introduction of many unrelated factors compared to using all face patches.

V. CONCLUSION
In this paper, we propose a Weakly Supervised Local-Global Attention Network to perform facial expression recognition with jointly use of local and global features. Our approach shows how we can directly locate the crucial regions and extract the corresponding local features under weak supervision. Selective Feature Unit is designed to fusion local and global features in an adaptive manner. It enables two types of features to complement each other to boost the recognition performance. Besides, contrastive loss is introduced for both local and global features to increase inter-class differences and decrease intra-class variations under different granularities. Experimental results on three databases demonstrate that our proposed methods have achieved the state-of-the-art performance.
Furthermore, the approach of perceiving crucial local regions proposed in this work has potential application value for other face related tasks, such as face detection, face alignment and face attribute manipulation.