Generative Adversarial Networks with Attention Mechanisms at Every Scale

Existing works in image synthesis have shown the efficiency of applying attention mechanisms in generating natural-looking images. Despite the great informativeness, current works utilize such mechanisms at a certain scale of generative and discriminative networks. Intuitively, the increased use of attention should lead to a better performance. However, due to memory constraints, even moving a single attention mechanism to a higher scale of the network is infeasible. Motivated by the importance of attention in image generation, we tackle this limitation by proposing a generative adversarial network-based framework that readily incorporates attention mechanisms at every scale of its networks. A straightforward structure of attention mechanism enables direct plugging in a scale-wise manner and trains jointly with adversarial networks. As a result, networks are forced to focus on relevant regions of feature maps learned at every scale, thus improving their own image representation power. In addition, we exploit and show the usage of multiscale attention features as a complementary feature set in discriminator training. We demonstrate qualitatively and quantitatively that the introduction of scale-wise attention mechanisms benefits competitive networks, thus improving the performance compared with those of current works.


I. INTRODUCTION
T HE generative adversarial networks (GANs) [1] have attained tremendous improvements in generating realistic, sharp images. Various findings in terms of architectural composition [2]- [7] and training stability [8]- [12] highly contributed to this matter. Moreover, the implication of attention mechanisms [3], [13] assisted GAN to focus on spatial dependencies. Specifically, attentions complementary to ordinary convolutions of adversarial networks facilitated capturing long-range dependencies across image regions. As a result, by adopting the above findings with attention mechanisms, existing methods such as SAGAN [13] and BigGAN [3] have shown impressive performance where generated images convey fine details. However, the usage of attention mechanism is limited due to memory constraints [3], [14]. Intensive computations that require correlation between feature maps add to it. Thus, it is applied at a certain scale (layer) of generator and discriminator networks. For instance, BigGAN [3] introduces attention mechanism only at the scale of 64 × 64 for a generating final image of 128 × 128. Whilst, moving the attention mechanism into a different scale (256 × 256) of the discriminator is prevented by constraints.
In such circumstances, benefitting from attention at higher resolutions is quite challenging, whereas considerations on the usage of attention at all scales are impracticable. However, as the generator and discriminator learn image composition and decomposition in a coarse-to-fine manner and vice versa, respectively, drawing attention to the features of every scale should increase the data representation power of networks. Although attention mechanisms facilitate a close look at certain information, one more problem may lay down in the discriminator. Specifically, a network structure is designed to focus on learning data distribution that describes it by the most distinctive characteristics in compact form. Hence, it often leans more on global or local information and does not maintain sufficient data representation power to capture both. In fact, this causes overlooking several useful features that might be valuable for performing more accurate real/fake image classifications.
To alleviate the aforementioned problems in a single form, we propose a GAN-based framework that considers utilizing scale-wise information in a more optimal manner. Unlike existing GANs that apply attention mechanisms at a single layer, first, we propose a framework incorporating attention mechanisms at every scale of the generator and discriminator. Considering that drawing attentions should not affect computational resources, we construct a straightforward mechanism that can be readily incorporated into our generative model. As attentions are drawn in a scale-wise manner, they hold coarse-to-fine details necessary for capturing global and local variations. In turn, this encourages the generator to synthesize semantically and structurally coherent images and the discriminator to capture reliable data distribution characteristics. Thus, by employing scale-wise attentions, we facilitate both competitive networks to focus on where to take a close look, and so making the game between them more adversarial.
Additionally, inspired by the ideas from multi-scale (hierarchical) feature classification as well as the advanced strategy of U-Net based discriminator [7], second, we change the structure of the U-Net based discriminator to compose an additional feature set from attention-weighted features learned at every scale. As in the U-Net based discriminator, we maintain deep features for performing per-sample classification and utilize a composed feature set for ensuring that information from every scale mitigates the problem of overlooking features while keeping per-pixel predictions. Such architectural change facilitates increasing the representational power of the discriminator and contributes to distinguishing real images from synthesized ones in more reliable way. As shown in Fig. 1, we can utilize scale-wise attentions, thus pointing to the most important regions of feature maps in the generator and discriminator while training them in an adversarial manner.
In summary, our main contributions are as follows: • An efficient implementation of the spatial attention mechanism to incorporate at every scale of the generator and discriminator.
• An attention feedback that provides the ability to amplify the importance of features in critical regions for refining image quality in the generator and to capture the most distinguishable features in the discriminator. • A structural modification in the U-Net based discriminator to increase its data representational power. • An experimental comparison revealing the better performance of our proposal on well-known datasets qualitatively and quantitatively.

II. RELATED WORKS A. GENERATIVE ADVERSARIAL NETWORKS
GANs [1] and its conditional variant (cGANs) [15] have shown the first possibilities of generating images within adversarial game. This gave rise to GAN-based methods, especially, in the image synthesis task. Since then, various techniques regarding advanced training strategies and improved objective functions [8], [9], [16]- [18] have been proposed. Additionally, an adversarial game between the generator and discriminator was empowered by the current architectural modifications [2], [3], [7]. Such improvements facilitated achieving impressive results where generated images depict more realism [2]- [7], [9], [19]. A recently proposed framework, U-Net GAN [7], improved current image synthesis performance by modifying the discriminator structure to act like a U-shaped network. The method added an encoder network to the original discriminator structure and made the overall network produce per-sample and per-pixel predictions. Exploiting such outputs allowed providing global/local feedback to the generator that enabled the generation of images with finer details.

B. ATTENTION MECHANISM
In general view, attention can be exploited as a guider that contains useful weights towards the most informative regions of feature maps. There are two categories of attention mechanisms: post-hoc network-based analysis [20]- [23] and learnable attention modules [13], [24]- [28]. The works belonging to the first category have been mostly utilized in tasks like object recognition to understand and explain network behavior. In turn, in the second category, learnable attention  mechanisms have been used in various tasks because it allows to be trained in an end-to-end manner. The aforementioned advances in GANs have also assisted in constructing reliable attention applicable GAN. One of the representative learnable mechanisms that built on such advances is self-attention GANs [13]. This method forces the generator to consider long-range dependencies in the feature maps for producing globally coherent images. The importance and performance gain of using attention mechanisms in image synthesis has been shown in the literature [13]. BigGAN [3] further applied such a mechanism in its architecture and generated realistic images. Unlike BigGAN which used attention in novel image synthesis task, GANimation [29] found application of attention for anatomically-aware facial animation (expressions). By embedding an attention mechanism to the generator network, this method regressed the attention mask specifying the importance of each RGB pixel for synthesizing the novel expression. Hence, GANimation could focus only on the pixels necessary for facial movements. U-GAT-IT [30] used the attention in challenging selfie-to-anime translations where anime drawings have unique properties. To deal the task, U-GAT-IT learnt the weight for auxiliary classifier and exploited such weight with feature maps of image to calculate a set of domain specific attention maps for further usage in generator. The work of [31] proposed a Masked Spatial-Channel Attention (MSCA) module that stands for crossattention mechanism to model the semantic correspondence between two scenes used in example guided image synthesis.

III. PROPOSED METHOD A. OVERVIEW
A GAN comprises two networks: generator G and discriminator D, where one attempts to fool another in a min-max game. The aim of G is to map a latent variable z drawn from random distribution p(Z) to a image space such that synthesized image looks realistic and depicts characteristics of real data distribution p(X ). Simultaneously, D aims to differentiate real image x ∼ p(X ) from synthesized G(z). Generally, G and D are constructed as upsampling and down- sampling deep convolutional neural networks, respectively. We follow such standard way, but include the recent advances in GAN [3], [7], [9]. In the generator and discriminator, we incorporate an attention mechanism for every scale in order to unveil the significant feature information learned at the particular layer. Additionally, on the top of U-Net based discriminator, we consider using a composition of multi-scale attentions as complementary information for discriminator training. We provide an overview of our proposed method in Fig. 2.

B. ATTENTION MODULE
Our ultimate goal is to enhance the quality of generated images by the strength of attention information. To this purpose, we construct a straightforward attention mechanism (A). As shown in Fig. 3(a), we have two 1 × 1 convolutional layers learning q and k, which we denote as query and key following the conventional terminology, respectively. These convolutions are intended to decide where and how much importance on the value v i (i.e., feature maps of a i-th scale) should be given by network. Note that we consider featuremaps as value information and do not add another convolutional layer for it. We think feature maps already convey valuable information, and so, should be directly weighted by attention. In this way, subsequent layers are instantly obtain knowledge on important regions. Besides, this way also simplifies the structure and allow reducing computations. To draw attention, we apply channel pooling on the Hadamard product of q and k. Here, channel pooling is similar to the operation defined in [24] that aims to apply max pooling and average pooling across the channels. We perform this operation to obtain spatial attention. Such attention highlights the informative regions within the feature map that are essential to learning. In contrast to [24], we add max and average pool outputs to highlight informative regions more. We convert the resultant matrix into a probability distribution through an activation function and obtain the attention-weighted feature map (v i ) of a particular scale by multiplying it with spatial attention.
Incorporating this structure at every scale of the generator and discriminator allows benefiting from scale-wise spatial attention maps. The main advantage of attentions is that they force the generator and discriminator to take a close look at attended regions as they highlight the most informative and crucial part of feature maps. We visualize attention maps in Fig. 4 and Fig. 5 drawn in the discriminator and generator respectively. As can be seen, learned attentions contain scalespecific information. For instance, as the generator goes to upsampling layers, we observe that attention starts unveiling the importance on more global information (at lower layers), and gradually moves to highlight the significance of more local features (i.e., eyes, mouth, skin color, hair style, etc.), and reaches more finer details at the resolution near the final layer.

C. ARCHITECTURE
To construct the base network for our framework, we adopt the generator from BigGAN [3] and U-Net based discriminator [7]. Although we opt for these architectures, our attention mechanism can be readily incorporated into different ones. While it is straightforward to introduce attention mechanisms (A) at every scale of generator, the discriminator part requires several considerations. The U-Net based discriminator is constructed using encoder-decoder. The encoder performs per-sample predictions whereas the decoder makes per-pixel predictions where each pixel depicts real or fake values. To get benefit from per-pixel predictions and avoid affecting its process, we add our attention mechanism at every scale of the encoder and keep the decoder part unchanged. We provide architectural structures and layouts for generator and discriminator networks in Table 1 and 2 and their detailed description in Section IV-A3.
Nevertheless, we consider that encoder part of the discriminator may overlook some useful information. To address it, we modify the encoder of the discriminator according to the structure shown in Fig. 3(b). Specifically, this structure shows adding global average (GAP) and global max-pooling (GMP) layers at each scale of the encoder. We consider these layers to get compact representations of attention-weighted features at every scale. We utilize concatenation of such multi-scale features to have a complementary representation encompassing global and local characteristics of the image. The U-Net based discriminator already has a twoheaded function. We add additional head at the bottleneck of this discriminator, where it makes per-sample predictions using an additional linear layer. For training, we use the optimization and regularization functions defined in [7] and adapt computing them on all heads. For more details on the objectives, we refer to the work by [7]. Here, we define main objective functions needed to optimize both generator and discriminator networks. As discriminator from U-Net GAN contains encoder and decoder parts, we update them by where D enc provides a per-sample output of discriminator through encoder, and [D dec (·)] i,j represents the decision made by the decoder part of the discriminator at pixel position (i, j). In order to incorporate multi-scale (ms) features into optimization process described in the above sections, we utilize a following function which performs a similar operation as (1) (3) Hence, a new discriminator loss is computed as a combination of above defined functions: Note that our proposed additions providing multi-scale attention-weighted features in the discriminator part are intended to serve as complementary information for helping to capture and understand meaningful properties of data distributions in GAN training. In order to not disrupt the training process of U-Net based discriminator with its functions for encoder and decoder parts, we follow the strategy of [7] on weighting different terms of Eq. (4). We do not put any special emphasis on L Dms , and use it in the same way as other loss functions by treating them equally.
Correspondingly, the generator optimization can be performed by

A. EXPERIMENTAL SETTINGS 1) Datasets
In our evaluations, we consider two well-known face datasets, CelebA [32] and FFHQ [4]. CelebA contains 200k face images of 10k different celebrities (identities). The face images depict different expressions in various facial poses. FFHQ is a relatively new dataset providing 70k high quality face images. Compared with CelebA, this dataset is more challenging as it contains images with high diversity in terms of age, ethnicity and viewpoints. For additional demonstration purposes, we consider CelebA-HQ [2], and newly released AFHQ [33] datasets. CelebA-HQ is a highquality version of the CelebA dataset, but consists of only 30k images. AFHQ also contains 30k animal images classified into cat, dog, and wild categories. The dataset comes at a 256 × 256 resolution that makes it possible for use in synthesizing more detailed animal images. As the CelebA has been released with lower resolution images, we perform experiments by setting the resolution to 128 × 128, whereas we maintain 256 × 256 for others.

2) Training details
To train our framework, we follow parameter settings of [3], [7]. We employ Adam [34] optimizer with momentum parameters β 1 = 0, β 2 = 0.999. The learning rates of generator and discriminator are set to 1 * 10 −4 and 5 * 10 −4 respectively. We initialize the weights of both networks using orthogonal initialization [35]. For training stability, we use spectral normalization [9]. We sample input to the generator from uniform distribution z ∈ [−1, 1]. Similar to [7], we set the batch size of 20 for FFHQ, CelebA-HQ, and AFHQ, 50 for CelebA. In evaluation, to generate new images, we utilize an exponential moving average over parameters of the generator network with decaying parameter of 0.9999. We perform the whole task on Tesla V100 GPU with our implementation in PyTorch [36].

3) Implementation
To implement our framework, we use publicly available implementations of generator 1 and discriminator 2 networks from BigGAN [3] and U-Net GAN [7] respectively. Note that adversarial training of these networks might require about a week or more based on the training configurations. We adopt BigGAN [3] generator as a base for constructing our generative model (Table 1). To enhance generated images produced by this generator, we incorporate our proposed attention mechanism at every scale of the network. All blocks are in the form of residual units (ResBlock up) [37] as shown in Figure 6(a), where our mechanism do not interrupts its feature learning process. Similar to existing GANs, we use batch normalization [38], and ReLU activation within this blocks. The upsampling is achieved by applying the nearest neighbor interpolation.
In discriminator part, we make use of U-Net based discriminator (Table 2) as competitor to our generator as discussed earlier. Note, this architecture follows discriminator and generator setups from BigGAN [3]. The network is built on residual units (ResBlock up and down) as shown in Figure 6. We introduced attention mechanism at every scale of encoder sub-network of the discriminator, and kept unchanged the decoder part for per-pixel predictions. To provide complementary information for discriminator, we introduce multi-scale features that are being composed using attention mechanism. In order to make a compact representation for additional adversarial training of discriminator, we add ms block shown in Figure 3(b) that consists of global average and global max pooling layers (see Section III-C). For these features, we add additional linear layer (Linear ms ) as a complementary to encoder's original layer (Linear e ). The decoder part of the discriminator produces a single channel output with resolution same as input.

4) Evaluation metrics
To evaluate the performance of our work quantitatively, we use widely applied metrics: Fréchet Inception distance (FID) [39] and Inception score (IS) [40]. These metrics have been reported [39], [40] to have a correlation with human evaluation on assessing the image properties (fidelity and diversity). Designed for this purpose, both measure the quality of generated images to provide a score defining how realistic GAN can perform. Such measurement involves  using a pretrained Inception network [41]. FID measures the difference between two distributions (i.e., real and fake) by using specific layer feature-vectors of the Inception network, whereas IS considers the KL-divergence from conditional and marginal distributions. Following the works of [2], [4], [7], we compute FID and IS on 50k generated images. We report the best FID and IS scores achieved through over training processes.

1) Ablation Study
We provide an ablation study that depicts the importance of adding our changes into GAN. In Table 3, we present the performance of each proposal added to the baseline GAN. As a base network, we select a BigGAN model with U-Net based discriminator. To demonstrate the effect of each proposal quantitatively, we use FID as a performance estimator. The baseline model adapts attention mechanism at the single scale of the generator and discriminator. First, we improve the baseline GAN by removing this mechanism and adding our attentions at every scale of both networks. As a result, FID score drops from 7.69 to 7.10. Unlike a single attention mechanism, ours forces the networks to take a close look at relevant regions of features in each scale. Further, we consider adding multi-scale features learned at the discriminator as complementary information in adversarial learning. Consequently, adding multi-scale features improves this score and reaches FID of 6.95. Exploiting such features in combination with attentions induces the discriminator to classify images accurately, and so penalize the generator to refine its image composition. Overall, the results demonstrate the efficiency of our proposals and their role in improving GAN performance. In addition, we present a study demonstrating the behavior of our framework in the training stage. To quantify the performance at the particular training stamps, we make use of the FID metric. In terms of quality of generations, we visualize one exemplar image generated at these stamps for better exploration of quality. For this purpose, we use the same noise as input to our framework. Fig. 7 presents the behavior of framework from early to final training steps on the FFHQ dataset. By observing FID curve, we can verify that our framework smoothly learns image composition task and achieves lower (better) FID score. The quantitative results are also supported by the improvements in the quality of generated images.

2) Qualitative Results
We start presenting qualitative results by demonstrating the influence of attention information on the feature maps. For ease of visualization and analysis, we test this on the featuremaps of 256 × 256 scale. In Fig. 8, we demonstrate the generated images with and without applying learned attentions. As shown, attention provides useful information that enforces the generator to focus more on those locations of feature maps. Comparing generated images in this manner shows that many fine details, e.g., in eyes, teeth, hairstyle, etc., are introduced using attention information. This example shows that the generator attempts to obtain maximum benefit from attention to improve the overall image quality.
Additionally, we provide Figure 9 which presents more images generated with and without using attention information. By amplifying feature-maps on most crucial regions, attention facilitates generation of images in better quality. As can be seen the generated images are diverse in terms of background variations, pose, expression, etc.
We present the main qualitative results on the FFHQ and CelebA datasets in Fig. 10 and 11. The generated images depict data distribution characteristics (e.g., facial pose, age, ethnicity, expressions, etc.) of FFHQ and CelebA. Thus, generated images are diverse and high quality. It is noticeable that better quality is facilitated with finer details.
Along with our results, we provide results from U-Net based discriminator [7] on the FFHQ dataset in Fig. 10(a) as this method is the closest to ours. As can be observed, results from both U-Net based approach and ours depict highly realistic faces. In addition to 256 × 256 images, we present results of our method on generating samples at 512 × 512 resolution in Fig. 10(c). A number of exemplar faces demonstrate the capability of our method on generating images at higher dimensions.
Additionally, we present generated images for the CelebA-HQ and AFHQ datasets in Fig. 11 (b) and (c). For AFHQ, we conduct an experiment with a cat group as we are interested in how our framework can learn to mimic and generate images with such "fluffiness" features. As shown, generated images are natural-looking and diverse, especially, the cat images. Observing such favorable results, we verify that our proposals facilitate GAN to perform better in generating novel images.
(b) 256 × 256 images generated by our proposed method.
(c) 512 × 512 images generated by our proposed method.   proposed method competes favorably with the recent stateof-the-art methods. Notably, we have better performance compared to U-Net GAN. In Table 5, we provide a comparison of our method against BigGAN and U-Net GAN on FFHQ. Similar to CelebA, our method improves the current performance in this dataset. It is worth mentioning that our framework achieves much better results than BigGAN. By performing such experiments, here, we also demonstrate the better performance of our proposed attentions compared with self-attentions [13] applied in original U-Net GAN at the specified layers of generator and discriminator. Overall, the quantitative and qualitative results demonstrate that our proposals force the generator and discriminator to pay attention to the feature maps at every scale, and exploit them in a better manner.

V. CONCLUSION
In this work, we addressed the issue of using attention mechanism in GANs, specifically constraints that limit its application. To alleviate this issue, we reconsider the structure of attention mechanism. In our proposed framework, we introduced a straightforward attention calculation mechanism at every scale of generator and discriminator. By applying such a scale-wise approach, we demonstrated the possibility of drawing attention at the important regions of feature maps at every scale without having memory constraints. Additionally, by leveraging such attention information, we strengthened the discriminator to learn data characteristics more accurately. Overall, exploiting attention in a scale-wise manner demonstrated better performance and improvements in the quality of generated images.