LiWGAN: A Light Method to Improve the Performance of Generative Adversarial Network

Generative adversarial networks (GANs) gained tremendous growth due to the potency and efficiency in producing realistic samples. This study proposes a light-weight GAN (LiWGAN) to learn non-image synthesis with minimum computational time for less power computing. Hence, the LiWGAN method enhanced a new skip-layer channel-wise excitation module (SLE) and a self-supervised discriminator design for non-synthesis performance using the facemask dataset. Facemask is one of the preventative strategies pioneered by the current COVID-19 pandemic. LiWGAN manipulates a non-image synthesis of facemasks that could be beneficial for some researchers to identify an individual using lower power devices, occlusion challenges for face recognition, and alleviate the accuracy challenges due to limited datasets. The study evaluates the performance of the processing time in terms of batch sizes and image resolutions using the facemask dataset. The Fréchet inception distance (FID) was also measured on the facemask images to evaluate the quality of the augmented image using LiWGAN. The findings for 3000 generated images showed a nearly similar FID score at 220.43 with significantly less processing time per iteration at 1.03s than StyleGAN at 219.97 FID score. One experiment was conducted using the CelebA dataset to compare with GL-GAN and DRAGAN, proving LiWGAN is appropriate for other datasets. The outcomes found LiWGAN performed better than GL-GAN and DRAGAN at 91.31 FID score with 3.50s processing time per iteration. Therefore, LiWGAN could aim to enhance the FID score to be near zero in the future with less processing time by using different datasets.

the quality of the specific image in terms of the Fréchet 97 inception distance (FID) scores with state-of-art methods. For 98 comparison with StyleGAN [23], we repeated the experiment 99 using the same facemask dataset to identify the challenges 100 of data augmentation in increasing the number of images 101 and their quality. The main contributions of this study are as 102 follows: 103 1) We built the Skip-Layer channel-wise excitation mod- 104 ule, which revises the channel responses on a high-scale 105 feature map via low-scale activation. SLE allows for 106 a robust gradient flow across the model weights for 107 faster training. Programmatically, it helps to disentan-108 gle styles and content like StyleGAN.
109 2) We propose a self-supervised discriminator, D, as the 110 feature encoder with an additional decoder. We force D 111 to learn a more descriptive feature map covering more 112 regions from an input image, yielding more details 113 about the setting to train a generator, G.
114 3) We validate our proposed method concerning other 115 benchmarks utilising CelebA datasets and demonstrat-116 ing that LiWGAN significantly reduced processing 117 time in applications requiring less computing power.

129
A new approach called BicycleGAN was created from 130 a conditional variational autoencoder GAN and conditional 131 latent regressor GAN [30]. They developed a method to 132 simultaneously implement the relationship between latent 133 encoding and output to boost decoder performance without 134 enforcing a tough decision. Moreover, BicycleGAN can yield 135 diverse and visually pleasing outcomes in several image-to-136 image conversion problems. 137 Kodali et al. [26] proposed a novel gradient penalty scheme 138 called DRAGAN to eradicate a low local equilibrium in 139 non-convex sports. They proved that they could accomplish 140 asymptotic convergence without requiring the discriminator 141 to be adequately satisfied. 142 Some authors develop a GAN semi-supervised training 143 scheme for chest anomaly classification, patch-based retinal 144 vessel classification, and cardiac diagnosis [32], [33], [34]. 145 They found that the method can perform better than the 146 conventional supervised CNN. Proposing a softmax GAN, 147 they substitute the classification loss with cross-entropy 148 losses for the generator and discriminator in a single batch 149 of images [27]. information. The produced data that contains inaccurate 207 sentiment information can be removed using the proposed 208 data screening method.

209
Face Augmentation Generative Adversarial Network 210 (FA-GAN) was proposed in [41] to minimize the effect of 211 deformation attribute distribution imbalances for the CASIA-212 WebFace dataset. Besides improving face recognition accu-213 racy, the FA-GAN uses disentangled identity representations 214 to manipulate various characteristics of an individual's face. 215 Research on face recognition and synthesis tasks shows that 216 the proposed network preserves identity well in restricted 217 datasets.

218
An adaptive global and local bilevel optimization model 219 (GL-GAN) proposed in [42] optimized the image from the 220 local and global aspects. The local bilevel optimization model 221 was proposed based on the discriminator's output feature 222 matrix, in which each element evaluates image receptive field 223 quality and determines the area with low quality. However, 224 the GL-GAN needs extension studies for edge computing and 225 mobile devices due to limitations where the proposed method 226 selects a low-quality rectangular receptive field that causes 227 overlapping images.

228
The author in [43] proposed DCGAN, known as adver-229 sarial learning-based data augmentation (Ada), to produce 230 additional malicious users. The DCGAN-based data augmen-231 tation approach can produce better user embeddings than 232 simple data augmentation methods, making it better at detect-233 ing malicious users in sparse-sample situations. However, 234 although DCGAN can entirely imitate malicious users' dis-235 persion, there is a constraint to the generated fake users. The 236 hostile users are being injected into the system to make them 237 more like genuine users to evade detection. As a result, these 238 malicious users' false users are less likely to include attack 239 characteristics.

240
A combination of GAN and re-id model called Jot-GAN 241 was proposed to train the generator and re-id model con-242 currently to obtain their respective optimums using a dis-243 criminator [44]. Furthermore, the adversarial training and the 244 produced samples enhance the re-id model's ability to trick 245 the discriminator, thereby boosting its performance. Findings 246 showed that the Jot-GAN surpassed the existing methods with 247 the identification loss and triplets loss.

248
In 2018, the NVIDIA team introduced a style-based GAN 249 model (StyleGAN) [23]. The normalization of the gener-250 ator was restructured and regularized to facilitate a good 251 mapping condition from latent to image codes. As a result, 252 the authors enhanced the training performance for superior-253 quality images. The simplified data flow produced the most 254 significant performance due to weight demodulation, lazy 255 regularization, and algorithm optimization.

256
The advantages of StyleGAN are that it is easy to allocate 257 an image produced to its root and improve its quality. How-258 ever, training requires more computational time. Therefore, 259 we take advantage of the weaknesses of StyleGAN to improve 260 the performance of our proposed model, specifically in min-261 imizing the computational time.  Style generation uses an intermediate vector at each level 306 of the synthesis network, which may cause the network to 307 learn the correlation between different levels. Therefore, the 308 model randomly selects two input vectors, z 1 and z 2 , and 309 generates intermediate vectors, w 1 and w 2 to reduce the cor-310 relation. It then trains some of the levels using the first and 311 switches method (in a random split point) to the other to train 312 the remaining levels. The random split point switch ensures 313 the networks do not learn the correlation effectively.

314
Finally, the generator offers a direct mechanism of cre-315 ating stochastic detail by explicitly incorporating noise 316 inputs. Each layer of the synthesis network receives sepa-317 rate single-channel images formed of uncorrelated Gaussian 318 noise. First, the noise image is transmitted to all feature maps 319 using learned per-feature scaling factors, as illustrated in 320 Fig. 2b. The corresponding convolutional filtering operation 321 result is subsequently subjected to the noise image.

323
This study compared our proposed model, a LiWGAN, with 324 StyleGAN. We adopted StyleGAN in [23], [45], and [46], 325 including the model configuration and differentiable data-326 augmentation, for the best training on few-sample datasets. 327 Furthermore, we compared our proposed model with Style-328 GAN based on the computing time because StyleGAN 329 requires much more computing time to train. However, for 330 non-image synthesis quality, we compared it regardless of 331 the computing time. Therefore, we developed our proposed 332 model using two proposed techniques, namely a skip-layer 333 excitation module and a self-supervised discriminator.

334
A LiWGAN requires a generator, G, that can learn fast 335 and a discriminator, D, to provide valuable signals to train G 336 continuously, as illustrated in Fig. 3a. It utilizes a single con-337 volution layer on each resolution in G and applies three input 338 channels: 8 2 , 16 2 , and 32 2 , including three output channels: 339  were used at a similar resolution in previous GAN works. 364 However, skip connections were performed between resolu-365 tions into a more extended range in this study, as an equal 366 spatial dimension was no longer necessary. The ResBlock is 367 applicable with a shortcut gradient flow without additional 368 computational cost to ensure that the SLE succeeds. The 369 computation of SLE is as follows, where x and y are the input and output feature maps of the 372 SLE component, function F comprises the operations on x l 373 and W i specifies the learned module weights. The SLE com-374 ponent indicates that x l and x h are the feature maps at resolu-375 tions of 8 2 and 128 2 , as shown in Fig. 3b  less than other regularization methods. The f 1 was randomly 420 cropped with 1 8 of its height and width, then crop the genuine 421 image on the same portion to obtain the cropped image, I c . 422 Next, the genuine image was resized to obtain a down-423 sampled image, I . Then, the decoders were produced I c from 424 cropped f 1 , and I from f 2 . Finally, D and the decoders were 425 trained to reduce the loss for the corresponding of I c to I c and 426 I to I . The simple decoder has three convolution layers with 427 the nearest unsample layer using gated linear unit (GLU) to 428 alleviate the loss of sample images.

429
D extracts a more comprehensive representation from 430 the inputs, covering the overall compositions (from f 2 ) and 431 detailed textures (from f 1 ). Note that the processing in G 432 and T is not limited to cropping; more operations remain 433 to be explored for better performance. The auto-encoding 434 approach is a typical method for self-supervised learning, 435 which improves model robustness and generalization abil-     Table 1. These four were executed due to the sig-476 nificant output for data augmentation in a low-data setting.   A facemask detection dataset was utilized in this study to per-492 form data augmentation and classification of face images with 493 and without a mask. The facemask dataset was taken from 494 Kaggle and found in [56]. The dataset consisted of 7553 RGB 495 images in two separate folders with and without a mask. 496 Images of faces with masks contained 3725 images, and faces 497 without masks contained 3828 images. The facemask images 498 were trained for 10,000 iterations for 256 2 resolution images. 499

500
Google Colaboratory, the most well-known Google Colab, 501 is used in this study as a development platform to run our 502 proposed model via the Colab Notebook. Google Colab is 503 a research project used for machine learning models on 504 powerful hardware options, namely, the graphics processing 505 unit (GPU) and tensor processing unit (TPU) [57]. Google 506 Colab offers core machine learning and artificial intelligence 507 libraries, such as TensorFlow, Matplotlib, and Keras, with 508 either Python 2 or 3 runtimes pre-configured [58]. We used 509 Google Colab Pro to work faster and with longer runtimes 510 in this study. The GPU of NVIDIA Tesla P100 is utilized 511 with high-memory virtual machines (VMs). We need a faster 512 GPU and high RAM to run our proposed model because our 513 proposed model needs to generate more images that require 514 longer runtimes of more than 24 hours and less disconnection. 515

516
The Fréchet inception distance (FID) computes the total 517 semantic realism of the synthesized image. The FID created 518 by Martin Heusel et al. [59] generates genuine images and 519 improves the current inception score (IS). The conditional 520 class predictions for synthetic images and the marginal proba-521 bility for the predicted classes were combined to obtain the IS. 522 However, the IS does not demonstrate how synthetic images 523 interact with genuine images.

524
The purpose of the FID score is to measure synthetic 525 images based on synthetic image data compared with the 526 VOLUME 10, 2022 results of a set of genuine images in the target domain [60]. Therefore, a lower FID indicates better quality images; how-528 ever, a high FID indicates a low-quality image and displays a 529 linear relationship.

530
In this study, we let the generator G produce 5000 images 531 and measure the FID between the synthesized images and the 532 entire training set for datasets with more than 1000 images. 533 Therefore, we used 1000, 2000, 3000, and 5000 images to 534 compute the FID in this study. By considering the significant 535 performance difference between our proposed model and the 536 comparable models, FID is likely to be consistent with the 537 others; thus, it is unnecessary to implement other metrics. The 538 FID, d 2 provided in [56] is computed as follows, where x is the genuine image, and g is the generated image.  Table 3 and Table 4. The resolution of 1024 2 from 574 StyleGAN is also limited, given the increased training time.  and StyleGAN with 256 2 and 1024 2 resolutions. While Style-581 GAN suffers from converging on the bottom high-resolution 582 datasets, LiWGAN successfully learns the style representa-583 tions along the channel dimension on the ''excited'' layers 584 (i.e., for feature maps with 256 2 and 512 2 resolution).

585
Collecting large-scale image datasets are expensive for 586 a particular character, genre, or subject. On these few-shot 587 datasets, a data-efficient model is valuable for image gener-588 ation. The computational cost comparison is evaluated and 589 tabulated in Table 3, which presents the normalized models 590 combined with a mask and without a mask on an NVIDIA 591 Tesla P100 GPU and high memory virtual machines (VMs), 592 implemented using PyTorch in Google Colab Pro. The com-593 parison evaluated the training time per 10,000 iterations and 594 trained the GPU with a resolution of 256 2 for multiple batch 595 sizes, such as 8, 16, and 32, for the total images of 1000. 596 We executed StyleGAN into the dataset to obtain a fair com-597 parison with our proposed model.

598
The results were compared with StyleGAN to determine 599 the effective, efficient, and best training time and training 600 GPU. According to the findings, StyleGAN took five hours 601 of training time and a training GPU of 7.62 GB for batch-602 size 8, which is considered a longer training time and large 603 training GPU to be compared with our proposed model.   We also tested our model on facemask datasets with sufficient 667 training samples for a more thorough evaluation. We trained 668 the full StyleGAN for approximately four to five days on 669 the facemask dataset with a batch size of 8 on two Tesla 670 P100 GPUs. Instead, we trained our model for only 24 hours, 671 with a batch size of eight on a single GPU. The standard 672 method for calculating FID is to generate 50,000 images 673 and use the entire training set as the reference distribution. 674 We computed the FID for 1000, 2000, 3000, and 5000 images 675 at 1024 2 resolution. Compared with StyleGAN, as tabulated 676 in Table 6, the results show that LiWGAN can work with 677 many datasets with a minimum computer budget. In addition, 678 the FID results proved that LiWGAN boosts distance perfor-679 mance consistently compared to StyleGAN. 680 We experimented with the two proposed modules in 681 Table 5, where both SLE and decoding-on-D (decode) 682 can separately boost the model performance. Results show 683 that the two modules are orthogonal to each other in improv-684 ing the model performance, and the self-supervised D makes 685 the most significant contribution. Significantly, StyleGAN 686 diverges rapidly after the training time. In contrast, our 687 model is less likely to cause mode collapse among the tested 688 datasets. Furthermore, our model maintains good synthesis 689   image. In contrast, a classification task does not guarantee 701 that D covers the entire image. Instead, the task drives D to 702 focus only on small regions because the model can find class 703     Table 6. These state-of-art methods utilized 712 various datasets with image synthesis and non-image synthe-713 sis. We compared the FID to analyze the methods based on 714 the image quality of the original image. The FID is accept-715 able when the image quality is almost zero. However, this 716 depends on the dataset used to compute the FID. Therefore, 717 our study analyzed the FID score for the datasets that consist 718 of non-image synthesis to produce a fair comparison with the 719 state-of-art methods.

720
The findings show LiWGAN generated a 142.78 FID score 721 for a non-image synthesis dataset, which is considered a high 722 score compared to the image synthesis dataset of StyleGAN, 723 with the lowest FID. StyleGAN generated an image synthesis 724 task using LSUN datasets as their training images. Image 725 synthesis has high-resolution images compared to non-image 726 synthesis. However, non-image synthesis embeds the training 727 images in a low dimensional space; thus, it comprises poor 728 quality image synthesis.

729
LiWGAN produced a slightly lower FID score than Cycle-730 GAN using Zebra to Horse, MSGAN using Maps, and 731 DRAGAN using CelebA. In this case, the Zebra to Horse 732 dataset contained image synthesis, while other datasets are 733 non-image synthesis datasets. However, the processing time 734 per iteration shown by LiWGAN takes a few seconds longer 735 than CycleGAN to generate the image, while MSGAN and 736 DRAGAN are not stated the specific processing time by the 737 authors.

738
On the other hand, the processing time of GL-GAN is 739 longer than other state-of-art methods, including LiWGAN 740 at 12.24s for CelebA datasets, a non-image synthesis with a 741 good FID score. Furthermore, the author optimized the local 742 bilevel for poor image quality. Hence, GL-GAN enhanced the 743 poor-quality images to a high quality non-image synthesis. 744 Moreover, it helps to reduce the FID score nearly to zero in 745 generating the images. 746 We validated LiWGAN with CelebA dataset for non-image 747 synthesis and compared the FID score and processing time 748 per iteration. The results tabulated in Table 7 show that LiW-749 GAN produced a better FID score at 91.31 than GL-GAN and 750 DRAGAN methods. Furthermore, the processing time per 751 iteration is considerably increased from 1.03s for 142.78 FID 752 to 3.50s for 91.31 FID. Despite this, LiWGAN showed a 753 significant improvement in FID score and processing time 754 compared to GL-GAN.