Self-Attention-Masking Semantic Decomposition and Segmentation for Facial Attribute Manipulation

Many face attribute manipulation methods can only provide global attribute manipulation according to the attribute labels. In this paper, we propose a self-attention-masking semantic decomposition method which is able to learn an attribute attention mask for each attribute. User can adjust the strength and color of each attribute smoothly and more freely. We decouple the attention of different attributes and overcome the disadvantage of overlap between different attribute attention masks by an attention weighting module. Thanks to the attribute attention masks, our method allows manipulate facial attribute without generator after only once generation. Moreover, we can perform facial semantic segmentation without pixel level semantic labels. Experiments show that our method simultaneously improves the freedom of attribute manipulation and the authenticity of synthetic face. The mean intersection over union of semantic segmentation is over 65% for hair and skin. Our code is available at github.com/flyfeatherok/SAMSD.


I. INTRODUCTION
Face attribute manipulation is an interesting but challenging task with many real-world vision applications. It has experienced significant improvements following the introduction of generative adversarial networks (GAN) [1] and enabling lots of functions such as facial expressions changing, eyeglasses adding, and styles (e.g. hair color, beautification/ de-beautification) transfer.
As mentioned in [2], facial attribute manipulation task can be roughly categorized into two types: semantic-level manipulation [3]- [6] and geometry-level manipulation [7], [8]. Early approaches such as StarGAN [3] and AttGAN [9] provide a kind of basic generator architecture and training The associate editor coordinating the review of this manuscript and approving it for publication was Yongjie Li. strategy for semantic-level manipulation. However, they can only provide global attribute manipulation according to the attribute label and cannot be customized freely (e.g., you cannot adjust the hair color to green because there is no such attribute label). On the contrary, geometry-level manipulation methods have a higher degree of user freedom. User can guide the system to fix the image when the result is not as expected. But most of their training relies on expensive pixel level semantic labels.
It is highly desirable to adjust both strength and color of each face attribute smoothly at the same time. There are few solutions can do it. SCDFM [10] provides a solution since it divides a high-level attribute edit into multiple semantic components, where each works on one semantic region of a human face. It is the first attempt to learn semantic components from high-level attributes. However, SCDFM is difficult to do multi-attribute training and need a pertained VGG network as the encoder.
In this paper, we propose a GAN based self-attentionmasking semantic decomposition method which, unlike SCDFM, can generate an attribute attention mask (AAM) for each attribute and is fused into a general attention mask (GAM) for all attributes by an attention weighting module (AWM). Hence our method can manipulate the color and strength of single attribute more freely such as hair color, eyeglasses color, and gender swap strength. Meanwhile it will not interfere with the effect of other attribute manipulations. Moreover, the attention mask of single attribute gives us the opportunity to segment facial region automatically, without the supervision by semantic segmentation labels. Our method allows adjusting color and strength of different attributes, but what's more, allows to manipulate them freely even without generator after only once generation. Figure 1 demonstrates some attribute manipulation examples by our method. The generator outputs a single attention mask for each attribute, shown in the first row, no matter whether the attribute is changed or not. Then we can adjust the color of attribute area arbitrarily only by the AAM, as shown in the second row. Please note that the AAM allows us to adjust not only the color of the attribute without generator, but also the strength of the attribute without generator, which will be described in a later section.
To summarize, our contributions are as follows: 1. We propose a self-attention-masking framework for face attribute manipulation, which is able to learn an attribute attention mask for each attribute semantic.
2. Our method can edit the color and strength of single attribute more freely, it benefits from the attention masks of semantic decomposition among different attribute. It is able to edit quickly without generator.
3. Our method can perform simple semantic segmentation of some facial areas automatically such as hair and skin, without semantic segmentation labels or any location labels. It is the first attempt for facial semantic segmentation only by image-level attribute labels to the best of our knowledge.

A. GAN BASED FACIAL ATTRIBUTE MANIPULATION
Several methods utilize GAN to build general face attribute manipulation frameworks since the success of GAN for image-to-image (I2I) translation [11]. For unpaired I2I translation tasks, CycleGAN [12] and its variants provide a method for evaluating image semantic consistency only by the images themselves. This makes it easier for face attribute manipulation without attribute disentangling in a deep space. Typical approaches such as StarGAN [3] and AttGAN [9] confirmed that only a pair of generator and discriminator is required for the face attribute manipulation, which can achieve remarkable translation effect. However, attribute label alone is not enough for accurate face attribute manipulation. There is still a lot of room for improvement.
Residual learning [6], [13] enables the network to learn the changing parts of the image while retaining other areas, which inspired the study of attention guidance. Then many scholars have noticed that the accuracy and freedom of attribute edit depend on the generator's attention guided by input information. With a similar training strategy and network structure to the reference [13], GANimation [14] proposes to use the attention mask to get the key areas for efficient attribute manipulation automatically, without affecting irrelevant areas and it worked out wonderfully. STGAN [15] proposes to use attribute vectors to guide the generator's attention. The generator in STGAN only reconstructs the image when the attribute vector is zero, then the generator can learn to distinguish between key areas and backgrounds. On the other hand, the strategy of supervision learning by semantic segmentation labels provides the capability of precise geometrylevel manipulation. The attribute manipulation becomes very efficient since the generator can pay attention to the edited area directly through the semantic mask, such as SC-FEGAN [8] and MaskGAN [2]. However, their training is complex and the semantic annotation is expensive.

B. ATTRIBUTE SEMANTIC DECOMPOSITION
Although GANimation provides an attention mask for key attribute areas, it cannot be decomposed among attributes. Meanwhile, deep feature interpolation (or called latent space interpolation) [6], [16] was employed for face attribute manipulation. By shifting deep features of the query image with the attribute vectors in latent space, the semantic facial attributes can be updated accordingly. ELEGANT [6] even decouples the attribute in the latent space, but has to manipulate the attribute by target images. Based on this, Facelet [17] and SCDFM [10] provide two deep feature interpolation solutions without adversarial leaning and paired data. Facelets propose a Facelet-Bank framework that models face effects with respective middle-level convolutional layers. SCDFM divides a high-level attribute edit into multiple semantic components, where each works on one semantic region of a human face and users can make more fine adjustments. It allows adjusting edit strength of different components and manipulating edit effect on each component. It is the first attempt to learn semantic components from high-level attributes. However, SCDFM is difficult to do multi-attribute training since it has no control over the number of decompositions and its correspondence to attribute. Both Facelet and SCDFM need a pertained VGG network for training. This may restrict the scope of application.

C. WEAKLY SUPERVISED SEMANTIC SEGMENTATION
The attention mask of single attribute gives us the opportunity to segment facial region only by the image-level labels. Meanwhile most semantic segmentation methods rely on the pixel-level annotations, which require extremely expensive labeling efforts.
After FCN [18] and U-net [19] created the basic semantic segmentation network architecture under supervised learning, researchers have also strived to leverage weakly supervision instead such as multiple instance learning [20], EM algorithm [21] and constrained CNN [22], or semisupervision by additionally using a few pixel-wise segmentation labels [23], [24]. Similar to this paper, some weakly supervised methods [25], [26] used attention masks and classification tags. They achieved an excellent level of semantic segmentation. However, the semantics of human face attribute labels overlap with each other on the face (e.g., ''gender'' and ''age'' almost share a same facial area), so it is difficult to obtain accurate attention mask simply by applying classification loss in the I2I translation task.

III. SELF-ATTENTION-MASKING SEMANTIC DECOMPOSITION
Give an origin image I o ∈ R h×w×3 and the corresponding attribute label s o ∈ R 1×c , where h × w is the size of I o , and c is the category number of attribute label. We expect our model to generate a group of attribute attention masks M ∈ R h×w×c that can be used to control the strength of each attribute change, and freely synthesize an image I t with target attributes s t by M and a color mask I c ∈ R h×w×3 .

A. OVERALL FRAMEWORK
For the same reasons mentioned in [14], we define the difference attribute vector v s as the difference between target and source attribute labels that should be put into the generator, where s t is the target attribute label, and s o is the source attribute label. Only the attributes to be changed should be considered, to prevent faulty manipulation. In GANimation, the attention mask changes with the attribute labels if attention mask and color mask share the same generator (e.g., the attention mask upon the hair area is zeroed if the hair attributes are unchanged). However, for our purpose, the scope and intensity of attention must be decoupled in generator in order to stabilize the semantic segmentation results. Hence as shown in figure 2, color mask I c and AATs M are generated by a color mask generator G c and an attention mask G a generator respectively. Our generator G consists of these two parts.
As shown in figure 2, the color mask generator G c consists of several strided convolutional layers to down-sample the input, six adaptive residual blocks [27], and several convolutional layers for up-sampling. We equip the adaptive blocks with AdaIN [28] layers: where ω is the activation produced by the previous convolutional layer, µ and σ are channel-wise mean and standard deviation, γ and β are parameters generated by a 4-layer multilayer perceptron (MLP) from the attribute vector v s . The attention mask generator G a follows a basic U-net structure: several strided convolutional layers for downsampling, several convolutional layers for up-sampling, and several skip connections between them. Two generators share the same down-sampling path for parameter saving.
Note that G a has nothing to do with v s , and then the attention of different attributes on a face will remain stable. However, G a cannot automatically generate masks that correspond to attributes one to one without any guiding. Hence an attention weighting module (AWM) is proposed to guide AAMs generation and the synthesis of the GAM I m , which will be introduced in the next section.
Finally, I t is synthesized by I o , I c , and I m , where In this way, the generator can focus exclusively on the pixels defining the facial attribute changes, leading to more realistic synthetic images. Meanwhile it retains the attention mask for each attribute, no matter whether the attribute has changed or not. Similar to StarGAN, a discriminator D containing an attribute classifier is used to distinguish the true image I o and the fake image I t . Meanwhile the attribute classifier outputs the attributes estimationŝ and ensures that I t has the specified attributes s t . The specific parameters of our network structure are detailed in the appendix.
GANimation reports that attention masks can easily saturate to 1 without ''total variation regularization''. We found that this problem could be solved easily by adding the selfattention module [29] in the discriminator. This may be because the self-attention module in the discriminator is more efficient in passing the key information on the attribute region to the generator.

B. ATTENTION WEIGHTING MODULE
The synthesis of I m and the generation of AAMs have two difficulties: 1. Generate masks of each attribute and decouple them. 2. Synthesize I m without affecting the overlap region of AAMs. For the former difficulty, we use the absolute value of v s as the attribute strength indicator to update AAMs, According to (1), the corresponding value of the changed attribute in v s ∈ [-1, 0) ∪ (0, 1], and the value of the unchanged attribute in v s is 0. Therefore, |v s | determines which masks are activated and the strength of activation. This forces G a to learn to decompose the attention of different attributes to different masks. For example, the AAM corresponding to the hair color attribute must only pay attention to the hair area, because only this AAM is activated when only the hair color attribute changes, as shown in figure 3. However, the mask value will also decrease for small v s , lowering the strength of manipulation. This problem will be mitigated next by the attention weighting module.
The values in I m must between 0 and 1, hence I m can't be simply summed by AAMs. One plausible option is to take the maximum value on each pixel location in all AAMs. However, the maximization operation cannot effectively calculate the gradient in back propagation. Therefore, we use an attention weighting module for resolving the later difficulty: where |v s | i is the i th value of |v s |, M i (m, n) is the value of the i th AAM on the pixel location (m, n), α ∈ [1, 2] is a scalar that controls attention weight, ε is a small constant value for prevent the division by 0. The reason that α ∈ [1, 2] is as follows: On the one hand, the values of I m must be between 0 and 1. Note that This guarantees the values of I m must be between 0 and 1. On the other hand, in extreme cases, there are only some small values in the i th AAM and 0 for the rest, then I m (m, n) = |v s | i M i (m, n) α−1 . Now α must smaller than 2, otherwise it will cause the value of I m to be smaller, i.e., the attention will be weaker. On the contrary, even weak attention areas can be enhanced to ensure the attention strength if α < 2. By this attention weighting module, the overlapping areas of different AAMs can be properly fused together, while non-overlapping areas are less affected if α is not too large. A large α exaggerates the attention difference among AAMs and makes the weak weaker. We found that α = 1.6 and ε = 0.01 is appropriate in our experiments.
To summarize, AWM acts as an attribute switch, forcing the specified attention channel to generate the corresponding attribute attention mask. Meanwhile, the GAM outputted from AWM still contains the full attention of all the changed attributes. Other than that, the conduction of attention in the generator follows the same path as GANimation, that's why our framework works.

C. LOSS FUNCTIONS 1) CYCLE CONSISTENCY LOSS
The cycle consistency loss guarantees that translated images preserve the content of the input images. In this paper, it is defined as where || · || 1 means L1 norm, G is the generator that contains G c and G a . G(I o , v s ) could be written according to (3) in more detail as

2) ATTRIBUTE CLASSIFICATION LOSS
This objective has two terms: a loss of real images used to optimize D, and a loss of fake images used to optimize G. The former is defined as where D cls (s o |I o ) represents a probability distribution over attribute labels computed by D. D learns how to classify facial attributes through this loss.
On the other hand, G tries to generate images that can be classified as the target attributes s t . Hence the loss is defined as

3) ADVERSARIAL LOSS
We adopt the Wasserstein GAN adversarial loss with gradient penalty [30], [31] as the adversarial loss to solve the problem of mode collapse. It is defined as whereÎ is sampled uniformly along a straight line between a pair of real and generated images. G generates a fake image I t , while D adv tries to distinguish between real and fake images by this loss. λ gp is a hyper-parameter, we use λ gp = 10 for all experiments.

4) FULL OBJECTIVE
The objective to optimize D and G are where λ cls and λ cyc are hyper-parameters. We use λ cls = 10 and λ cyc = 10 in all of our experiments.

IV. EXPERIMENTS
According to the characteristics of our method, the experiment is divided into the following three aspects: 1. Face attribute manipulation with generator This section performs standard I2I translation tasks. 2. Face attribute manipulation without generator Since our method can generate the AAM corresponding to each face attribute, we can further adjust the strength of the attribute transformation and the color of the attribute area by AAM without generator. Therefore, this part of the experiments will demonstrate the flexibility and effectiveness of our method.
3. Facial semantic segmentation AAMs and prior knowledge can be used to further rough semantic segmentation of the face area. This section will show the results and performance evaluation of the facial semantic segmentation.

A. IMPLEMENTATION DETAILS 1) BASELINE MODELS
We choose state-of-the-art StarGAN, GANimation and STGAN as our baselines. The performances of some existing literature on I2I translation for two domains like DIAT [32] and CycleGAN or on facial attribute transfer like IcGAN [33] and [13] have been discussed in detail in [3] and [9]. StarGAN and AttGAN surpass them with significant margins. Therefore, we ignore them to save space.
In StarGAN, the attribute labels are combined with image by depth-wise concatenation, and the cycle consistent loss is used to preserve domain-unrelated contents. The generator of GANimation provides an attention mask for better preserve domain-unrelated contents. StarGAN and GANimation both use almost the same loss function. STGAN follows a basic U-net structure with selective transfer units and attribute vector. STGAN reduces the error of image reconstruction but almost doubles the number of parameters and the training time.

2) DATASETS
CelebA. [34] The CelebA dataset contains 202,599 face images of celebrities with 40 binary attributes. We use the 5-point landmarks to align all face images, then crop and resize them into 128×128 and 256×256. Just like StarGAN, we randomly select 2,000 images as test set and use remaining images for training data. We use the following attributes: gender (male/female), skin color (pale/not pale), hair color (black, blond, brown, gray), eyeglasses (with/without), smiling (with/without), and age (young/old).

3) TRAINING DETAILS
Our model is trained using Adam [36] with β 1 = 0.5 and β 2 = 0.999. The batch size is set to 16 for CelebA dataset. We flip the images horizontally with a probability of 0.5 for data augmentation. We perform one generator update after five discriminator updates and train our model with an initial learning rate of 0.0001 for the first 10 epochs and linearly decay the learning rate to 0 over the next 10 epochs (10000 iteration for one epoch). We use only one AAM for the four hair color attributes since their corresponding areas are the same. Therefore, the four hair color attribute vectors of |v s | after entering AWM are merged into one by summing and clipping. Training takes about 17 hours on a single NVIDIA RTX 2080Ti GPU.

B. FACIAL ATTRIBUTE MANIPULATION WITH GENERATOR
An ablation study of semantic decomposition is carried out. We trained a network without G a and everything else remains the same. Just like GANimation, it output only one color mask and only one attention mask by a single generator without semantic decomposition. We train both networks with the same parameters, and observe the difference between their attention masks by using the same interpolation. Some ablation study results are shown in Fig. 4. We can observe that although it is difficult to tell the difference between generated images for the same interpolation by eye, their attention masks are different. Attention becomes more stable in our method. For example, with semantic decomposition, the attention in the red box changes only in strength as the interpolation changes (greater interpolation, stronger attention). On the contrary, without semantic decomposition, the attentions in the green box are unstable. The region of attention changes with the interpolation (e.g., there is no attention in the chin area when interpolation is 0.5). The ablation study has shown that our method contributes to the attention stability.
Secondly, we compare our method with StarGAN, GANimation, and STGAN. We retrain all of them for the fair comparison. In the paper of GANimation, GANimation trained by action units (AUs). We do not use AUs for fairness. The qualitative results are shown in Fig. 5. It can be observed from Fig. 5 that some of the results of StarGAN show certain level of blur and artifact. And StarGAN cannot accurately reconstruct the details and colors of background. GANimation, STGAN, and our method have much better results. The results of STGAN look real but the background is still inevitably affected. By contrast, the background of GANimation and our method remains intact. However, some results of GANimation may lose details like StarGAN (e.g., the mole on the old man's face disappears).
However, our method still shows certain level of artifact such as the attributes of gender, pale skin and eyeglasses. We speculate that this may be because I c lacks constraint in training, which is verified at some level by the better performance of STGAN.
To quantify the performances among different methods, we recruited six volunteers (5 male and 1 female) for user study as shown in Table 1. Each volunteer was asked to evaluate 50 × 4 × 5 generated faces from 50 persons (half of these faces come from our own collection) with 128 × 128 size. Every person has five transformations: gender swap (G), hair color (H), eyeglasses adding (E), age swap (A), and smiling swap (S). Volunteers are asked which image is more realistic (images are randomly scrambled). We can draw some conclusions from Table 1: our method is better than StarGAN and GANimation, but worse than STGAN in gender swap, eyeglasses adding, and smiling swap. In general, our method has a performance close to STGAN in I2I translation task.
In particular, our method has great advantages in image reconstruction since I t = I o when v s = 0. The peak signal to noise ratio (PSNR) and structural similarity (SSIM) of reconstructed image of StarGAN and STGAN are 22.80/0.819 and 31.67/0.948 respectively reported by [15]. By contrast, PSNR/SSIM of reconstructed image of our method is ∞/1.
However, as our mentioned early, our goal is the facial semantic decomposition and segmentation, but not to have better image-to-image translation. Hence the performance evaluation of our method for I2I translation is not the focus of this paper. The main advantages of our method are described in detail in the next two sections.

C. FACIAL ATTRIBUTE MANIPULATION WITHOUT GENERATOR
AAMs remain stable since G a has nothing to do with v s , and each AAM overlays the area of the corresponding attribute. Hence, we can manipulate the color and strength of single attribute without generator after we get AAMs. Color manipulation can be achieved simply by adjusting the value of the pixel with AAM as where, C is color adjustment value, M c is the AAM corresponding to the attribute you want to manipulate, [·] means clip the value to the effective color range. Figure 6 illustrates the results of arbitrary manipulation of hair color. Hair color can be controlled at will through the hair mask and is no longer subject to attribute labels.
On the other hand, according to (3), where I t , I c , and M are already known after the generation of I t . Hence, we can reconstruct I o without generator.
What's more, we can adjust the reconstruction strength of any attribute by a strength factor ρ: where ρ ∈ {0, . . . , 1} 1×c , the values in ρ determine the reconstruction strength of attributes. E.g., (16) is equivalent to (15) if all the values in ρ are 1, but only the hair color will turns back to the color in I o if only the value of hair color in ρ is 1. Therefore, we can adjust the strength of the attribute changes even without the generator. This process can be called fading because I f fades from the translated face I t . Figure 7 shows the results of qualitative comparison between interpolation and fading in different attribute strength manipulation. It makes a small difference in the effect of attribute strength manipulation whether the generator is used or not. I c changes with v s when manipulating with generator, but it is an invariant tensor in (16). Therefore, fading can provide a more linear changes for the attribute strength manipulation, attribute change is more obvious when ρ = 0.2 and 0.4. However, fading may not suitable for geometry-level manipulation due to the ghosting. For example, compared to the first row, the girl in the second row has a more pronounced double chin when ρ = 0.4 and 0.6. Figure 8 demonstrates the process of attribute strength fading between age and gender. The overlapping areas of their masks cause them to be unable to adjust attributes independently without affecting one another. Interestingly, we can find out which areas are more important for which attributes. Eyebrows, for example, are more important to gender than age.
To quantify the difference between interpolation and fading, we recruited eight volunteers (5 male and 3 female) for user study. Each volunteer was asked to evaluate 70 × 2 × 4 generated faces from 70 persons with 128 × 128 size. Every person has four transformations: gender swap, paler skin, age swap, and smiling swap. There are two ways to do each transformation: 0.5 interpolation and half fading from the completely transformed face (0.5 for |v s | and ρ) respectively. Volunteers are asked two questions for interpolated face and 36160 VOLUME 8, 2020 faded face: which transformation is more obvious and which image is more realistic (images are randomly scrambled).
We can draw the following conclusions from table 2: 1) In general, the faded image has much more obvious attribute changes than the interpolated image. This proves that fading can provide more linear changes.
2) However, people tend to think that the images with small changes are more realistic. On the one hand, it is due to the lack of reality of the fake images, on the other hand, it may also be because people can speculate the results through hairstyles and so on (e.g., people tend to doubt the reality of men with long hair).
3) It is difficult for people to distinguish the obvious and realistic skin color, which shows that there is not much difference between the two methods in the result of color transformation.

D. FACIAL SEMANTIC SEGMENTATION
Theoretically, the attention mask of single attribute gives us the opportunity to segment facial region automatically, without the supervision by semantic segmentation labels. For example, the area corresponding to the hair color attribute can be used to segment the hair and the skin segmentation in the same way by skin color. VOLUME 8, 2020  However, the corresponding region of some attributes may contain unexpected regions (e.g., the AAM of hair includes the eyebrows due to they have a same color). On the other hand, some facial features have no corresponding mask such as mouth and eyes. Therefore, semantic segmentation results need to be processed by prior knowledge. Here are the logical rules based on prior knowledge in our method for calculating the face area: 1. Skin = Skin 2. Gender = Gender 3. Eyes = Eyeglasses -Skin 4. Hair = Hair -Skin -Gender -Eyes 5. Mouth = Smiling -Skin -Hair -Eyes We first binarize each mask with a threshold of 25, and then according to the above rules, we get the semantic segmentation of each face area. Figure 9 shows some semantic segmentation results trained by CelebA-HQ with 256 × 256 size. Although there are still many holes in the image, our method has completed the correct semantic segmentation.
We use the semantic labels of CelebAMask-HQ as the benchmark to calculate the mIoU (mean intersection over union) with two sizes. The mIoU of each area is shown in Table 3. The '' * '' in table 2 means the model is trained by CelebA-HQ, otherwise trained by CelebA. We use deeper G a and D for 256 × 256 size training. Some areas in CelebAMask-HQ are separated such as ears and neck, but our method identifies them all as skin. Hence in ground truth, we uniformly label them as skin.
As mentioned early, our method is the first attempt for facial semantic segmentation only by attribute labels. No other weakly supervised facial semantic segmentation method can be used for performance comparison. Existing weakly semantic segmentation methods such as [37] and [38] are based on class labels but not attribute labels. In these methods, the label indicates the existence of the object, e.g., in the training, [37] and [38] will output the semantic segmentation of the horse if ''horse'' in the image is labeled as 1, and will not output horse segmentation if ''horse'' in the image is labeled as 0.
However, attribute label does not indicate the existence of object but the attribute strength, e.g. ''pale skin'' is 0 doesn't mean there's no skin in the image. Therefore, when applying the existing weakly semantic segmentation methods directly, the methods in [37] and [38] will not output the semantic segmentation result of skin when ''Pale skin'' is 0. They can only output the skin segmentation when the skin is pale. Therefore, existing weakly semantic segmentation methods can only output segmentation results when the class label ground truth is 1. By the same reason, they cannot output semantic segmentation of the eyeglasses region for the people who do not wear glasses. On the contrary, our method outputs the eyeglasses mask even there is no eyeglasses. Hence, we can find the eyes segmentation by ''Eyeglasses -Skin''.   It is unfair to compare a weakly supervised method with supervised ones. In spite of this, we trained a U-net [19] by CelebAMask-HQ for comparison (20,000 images as training set and 10,000 images as test set). We believe this should help reveal the gap between supervised learning and weakly supervised learning.
We can note that the segmentation of small area such as eyes and mouth by our method is difficult in all size images, meanwhile U-net has much higher accuracy. Therefore, there is still a big performance gap between weakly supervised methods and supervised ones. Interestingly, we found that the mIoU of the skin decreased when the ear was added to ground truth, possibly because CelebAMask-HQ marks the ear area completely, even though it is partially covered, such as the first image in Figure 8. This indicates that the semantic labels of CelebAMask-HQ may have potential defects.

V. CONCLUSION
In this paper, we propose a self-attention-masking semantic decomposition method, which is able to learn an attribute attention mask for each attribute. We decouple the attention of different attributes and overcome the disadvantage of overlap between different attribute attention masks by an attention weighting module. Our method allows manipulating facial attribute without generator after only once generation. User study shows that fading result is more obvious than interpolation result (over 80% for gender swap, age swap, and smiling swap). Moreover, the attention mask of single attribute can perform facial semantic segmentation without pixel level semantic labels, with mIoU over 65% for hair and skin.
Through the attention mask, we can segment the facial image semantically. At the same time, attention mask determines the authenticity of I2I translation. Therefore, the accuracy of this weakly supervised semantic segmentation may also determine the performance of I2I translation. Our future work will focus on improving this accuracy of semantic segmentation. On the other hand, we didn't train a model for 512 × 512 and 1024 × 1024 sizes since there is not enough memory for these sizes in one single GPU. We hope that in the future we will be able to achieve more streamlined network structure and larger size image processing. Table 4 and 5 show details about the network architecture. We use instance normalization (IN) [39] in all layers in G a except the last output layer. In G c , we use IN in all downsampling layers except the weights sharing layers, and layer normalization (LN) [40] in all up-sampling layers except the output layer. We use nearest neighbor sampling before the convolution for up-sampling. For the discriminator network, we use Leaky ReLU with a negative slope of 0.02. A standard self-attention module is applied in the middle of discriminator. In tables, N is the number of output channels, K is kernel size, S is stride size, P is padding size, and L v is the size of attribute vector.  CHENGGUANG ZHU received the bachelor's degree from Shenyang Ligong University, Shenyang, China, in 2010, and the master's degree from the Institute of Seismology, China Earthquake Administration, Wuhan, China, in 2013. He is currently pursuing the Ph.D. degree with Shanghai Jiao Tong University. His current interests include error analysis, image processing, visual navigation, and relative pose estimation. VOLUME 8, 2020