MSFSA-GAN: Multi-Scale Fusion Self Attention Generative Adversarial Network for Single Image Deraining

Bad weather such as rainy days will seriously affect the image quality and the accuracy of visual processing algorithm. In order to improve the image deraining quality, a multi-scale fusion self attention generation adversarial network (MSFSA-GAN) is proposed. This network uses different scales to extract input characteristics of rain lines. First, Gaussian pyramid rain maps with different scales are generated by Gaussian algorithm. Then, in order to extract the features of rain lines with different scales, the coarse fusion module and fine fusion module are designed respectively. Next, the extracted features are fused at different scales. In this process, the self attention mechanism is introduced to make the network focus on the extracted features of different scales. And before the fusion, the rain pattern reconstruction operation is also carried out, so that the network can reproduce the input image more perfectly. Finally, it is input into the discriminator network with dense blocks to obtain the image that removes the rain lines. We used R100H and R100L datasets to train and test our network. The results show that our method as high as 27.79 in PSNR and UQI is 0.94, which is superior to the existing methods in performance. Meanwhile, we also compared the cost of time, the result of our network is only 0.02s.


I. INTRODUCTION
Rainy days always affect the image quality, making the image blurred, deformed, poor visibility and so on. This will affect the work of outdoor computer vision system, such as vision based unmanned vehicle environment perception, pedestrian and road sign detection, tracking, and other tasks. Clean, clear and visible images are very important. Reliable visual images can help people complete various visual tasks more accurately and efficiently, which can reduce or even avoid unnecessary accidents. However, in bad weather conditions, especially in rainy days, raindrops will seriously affect the visual effect of the image and bring great difficulties to various visual tasks. Therefore, the research on image deraining technology is necessary. The task is challenging. Firstly, due to the fog brought by rain, the image will become blurred, The associate editor coordinating the review of this manuscript and approving it for publication was R. K. Tripathy . which makes it difficult to remove the rain stripes while retaining enough background details.
For a single image, it is difficult to eliminate raindrops. Many scholars have proposed methods for removing raindrops from a single image, which are mainly divided into model-based methods and depth learning methods. For example, model-based guided filtering [1], low rank appearance model [2], nonlocal mean filtering [3], Gaussian mixture model [4], contrast enhancement [5], scattering repair method [6]. For the rain removal of a single image, Kim et al. [3] first detected the position of the rain line in the image, and then eliminated the rain line through the nonmean filter. The disadvantage is that the rain line remains in the image after rain removal, and the image becomes blurred. [6] proposed a scattering restoration method, which uses the pixel values near the scattering line to fill the image area affected by raindrops. However, this method requires manual setting of the area to be scanned. Luo et al. [7] used discriminant sparse coding to distinguish and separate the rain line from the background, which can remove most of the rain lines. However, when the background and rain line have the same gradient, the background will be distinguished as rain lines, so that the image loses the detail information and becomes blurred. He et al. [8] classified the rain line density, and then selected the appropriate rain removal method to remove the rain line according to the corresponding rain line density label. Fu et al. [9] used deep learning to extract features from the high-frequency part of the image to improve the visual effect of the restored image, but there are still rain marks on the rain removal image, resulting in unsatisfactory rain removal effect. The rain removal algorithm, which decomposes the rain image into high-frequency and low-frequency parts, does not need preprocessing and has a wide range of applications. Based on deep learning to deal with the rain removal problem of a single image, it makes full use of the feature information in the image, and the rain removal effect is better, which is widely recognized. This paper will further study and improve the rain removal algorithm based on image decomposition and depth learning.
In addition to the above application methods to remove image raindrops, there are deep learning methods such as dictionary learning sparse coding [10], convolutional neural network (CNN), recurrent neural network (RNN), generation discriminator network (GAN) [11]. In recent years, data-driven methods have been used to remove rain, and data-driven and deep learning have been combined, but there are still problems such as instability and convergence difficulties.
In order to solve the above problems, the detailed feature information of the image background is retained as much as possible while removing the rain line. According to the enlightenment of some articles [12]- [17], we explore the multi-scale representation and neural network representation of the input image in a unified framework, and propose a self attention generation discriminator network based on multiscale fusion. Specifically, we first use the Gaussian kernel to generate the Gaussian pyramid rain image, sample the original image downward in turn, and the coarse fusion module (CFM) obtains the global information from the multi-scale rain image through cyclic calculation. The fine fusion module uses the results of CFM to further extract features and send them to the multi-scale fusion module to obtain the output of the generator.
The output of the generator and the real rainless image are sent as the input of the discriminator, sent to the global pooling layer through the dense layer (DB), compared with the eigenvalues of the two images, and then output after FC and sigmoid [14]. So, the contributions of our paper are as follows: The first contribution is that we improve the generator of generating adversarial network, we use Gaussian function to generate Gaussian pyramid, so as to obtain the input of three sizes. The generator extracts the features of each size, and finally fuses the features of the three sizes, so as to extract raindrops information as much as possible.
The second contribution is to improve the discriminator network, comparing with the traditional GAN, we introduce some dense blocks, make full use of the output of each volume layer. It can effectively suppress the gradient explosion and make the convergence speed of the network faster. Then pool it globally.
The third contribution is that we propose a multi-scale fusion self attention generation adversarial network (MSFSA-GAN) and compare it with other networks in peak signal-to-noise ratio (PSNR), structural similarity index measure (SSIM), universal image quality index (UQI) and mean square error (MSE) in the same experimental environment. It is found that our algorithm performs better in PSNR and UQI.

II. RELATED WORK
Scholars have proposed many methods to deal with raindrops in images. Considering the popularity of deep learning methods, these methods are divided into two categories according to whether deep learning is used: data-driven solutions and model-based solutions.

A. MODLE-BASED SOLUTION
Li et al. [4] imposed constraints on the background layer and rain layer, and learned the prior information of the background layer and rain layer based on the Gaussian mixture model, so as to adapt to the multi-directional and multi-scale rain line information. Elad et al. [18] and others have developed a sparse coding method, which represents the input vector as a sparse linear combination of basis vectors, which can be used to reconstruct the image after rain. Lin et al. [19] and others use bilateral filters to decompose the image into low-frequency and high-frequency parts. By using dictionary learning and coefficient coding methods, the parts with rain can be deleted from the image, and the detailed features in most original images can be retained. For continuous work, Luo et al. [7] Strengthened the sparsity of rain algorithm and used very high discriminant sparse coding on the learning dictionary with strong mutual exclusion. In this way, the rain layer and rain removal image layer will be separated more accurately. In order to further improve the visibility of rainfall images, Zhu et al. [20] et al. Constructed a global sparse model containing three sparse terms to remove raindrop fringes in a single raindrop image, and adopted the alternating direction multiplier method to ensure that the global optimal solution can be obtained.

B. DATA-BASED SOLUTION
Since 2017, deep learning has developed rapidly. Yang et al. [21] Constructed a repetitive rain stripe detection and removal network, which can remove the rain stripes in the binary rain stripe pattern, and can complete the rain removal operation in the overlapping rain stripes, such as heavy rain. At the same time, inspired by the depth residual VOLUME 10, 2022 network, Fu et al. [22] constructed the depth detail network. By focusing on the high-frequency details and using a priori knowledge to eliminate the interference of the background, the model can focus on the structure of the rain in the image. The network is convenient for removing background information, but it cannot deal with sharp raindrop components.
After above articles were published, many scholars began to apply CNN to the operation of image rain removal, such as [23]- [26]. These methods build a more advanced network architecture, which can not only improve the accuracy of rain removal on the basis of predecessors, but also provide new ideas for future generations' innovation in neural network. However, because neural networks belong to supervised learning and often need to synthesize rain images, when processing some real rain images that do not appear in the training set, the processing effect of these networks is not as good as the trained images.
After that, GAN [27] is proposed and popular. Its unique generation discriminator is used for image rain removal, which can reduce the discrimination between the generated results and the real image. The typical network structure includes two parts: generator and discriminator. The generator generates the rain removal image through input, compares the generator with the real rain free image, and transmits the feedback information back to the generator, which makes the accuracy of GAN for image rain removal and noise removal higher than other neural networks. Zhang et al. [28] improved the GAN into a conditional generation adversarial network (CGAN) and used it to realize the rain removal operation of the image. The network takes the rain removal image and its corresponding real rain free image as an additional constraint, which has excellent visual performance. However, if the rain image used in the test is different from the rain image used in the training, the result is still not very ideal. But it has better generalization than traditional neural networks such as CNN. Figure 1 shows the overall structure of MSFSA-GAN, which realizes the rain removal function of images by using the feature information of different scales. The following sections give specific details of each structure and loss function.

A. GENERATOR 1) MULTI-SCALE COARSE FUSION
For a given rain image, the first step is to use Gaussian check for down sampling at different scales to generate Gaussian pyramid rain map. The network takes the pyramid rain map of different scales as the input, and extracts the shallow features through the convolution layer, which makes the extracted image feature information richer and helps to improve the image rain removal ability. According to the initial features of different scales, the coarse fusion module (CFM) extracts deep features and fuses rain images of different scales through multiple parallel residual recursive units (RRU), as shown in Fig. 2. The input of RRU is the feature map of different sizes. The feature map of different sizes is roughly fused through its core conv LSTM unit, and the parameters of the system are compressed by jumping. Finally, the feature map after rough fusion is output. The design of rough fusion module is mainly for: a) using the repeatability of rain bands with the same scale, recursive calculation and residual learning are used to extract global texture information. Specifically, Conv-LSTM is introduced to model the information of upper and lower textures in space. Because a simple Conv network can not always remember information in different dimensions, we add LSTM model which has memory function on the basis of a separate Conv layer b) Multi-scale fusion greatly increases the acceptance domain, which can make the network pay more attention to the content of rain images.

2) MULTI-SCALE FINE FUSION
The input of the fine fusion module is the output of the coarse fusion module. As shown in Fig. 1, for convenience, coarse fusion and fine fusion are set as similar multi-scale structures. However, the difference between the fine module and the coarse module is that the attention mechanism is introduced to enhance the network's ability to distinguish between rain band and non-rain band through the self attention mechanism, so as to enhance the network's ability to learn features. We also use step convolution to reduce the spatial dimension to reduce the computational burden of the model. As shown in Figure 3, URAB is composed of several CAU modules, jump connections, and long jump connections are also adopted between fine fusion modules to realize the progressive fusion of multi-scale rainfall information. F3C64S2 means that the number of filters is a convolution of 3, 64 channels and step size is 2. CAU is the introduced channel attention mechanism unit, specifically represented as the part below the pink arrow, which is composed of four convolution layers and a global pooling layer.

3) RAIN STREAK RECONSTRUCTION
In order to integrate the features extracted by the fine fusion module, we introduce a reconstruction module (RU), as shown in Figure 1. This module is introduced to enable our generator to more perfectly present all the rain pattern information in the input image, because we find that only through the coarse fusion module and fine fusion module can not completely extract all the rain pattern information. Specifically, the output of the fine fusion module is connected with the input of the fine fusion module, and then CNN is used to compare the correlation of the two images. Similarly, different pyramid rain belt information is iteratively sampled and fused to estimate the rain belt image. It contains three convolution layers, which are used to extract the features of the three dimensions respectively, and then integrate and output the features of the three dimensions through the reconstruction layer.

B. DISCRIMINATOR
The input of the discriminator is the image generated by the generator or the real rainless image, which is composed of a dense block (DB), two convolution blocks, a global average pooling layer and FC + Sigmoid. The specific structure of dense blocks is shown in Fig. 4, which is composed of four convolution layers and three filter concatenation layers, mainly to extract the features of the input image. Then it is sent to the convolution block for further feature extraction, and then the network is flattened by the global average pooling layer. Finally, FC + Sigmoid layer is used to distinguish the authenticity of the image.

C. LOSS FUNCTION 1) GENERATOR
Usually, the loss function uses mean square error (MSE), but it will produce blurred and transition smooth visual effects, and some information will be lost due to the existence of square. Therefore, inspired by [29], [30], [32], we approach the real rain belt step by step, so as to improve the tolerance of small errors and have good convergence performance in the training process. The function is expressed as follows: In formula (1), I * PR represents the predicted rain image, the predicted rain free image I Derain is generated by subtracting I GR from the rain image, ε is the penalty factor and is set to 10 −2 .
In order to preserve the information of other parts of the image while removing raindrops, we use the edge loss function with constraints to constrain the high-frequency component between the real value I Clean and the predicted rainless VOLUME 10, 2022 image I Derain , which is defined as follows: L(I Clean ) and L(I Derain ) in formula (2) are the edge maps extracted from I Clean and I Derain by Laplace operator [31]. Therefore, the total loss function is: where λ is the weight parameter, which is set to 0.05 by referring to [17], to balance the loss.

2) DISCRIMINATOR
The loss function of the discriminator includes cross loss and feature mapping loss. The loss of feature mapping is described as a formula (4): In the above formula, x is the original rain map, G(x) is the rain free map generated by the production network, and D(G(x)) is the output of the discriminator. The cross loss is presented as following equation (5): where y corresponds to a clean rain free image, and x is the original rain image. The overall loss of the discriminator is as follows (6), L Discri min ator = L cl + β × L fm (6) β is the weight of feature mapping, which is equal to 10.

A. DATASET
Datasets are indispensable for deep learning. A large number of data sets are often required in the process of training and testing, especially paired rain and clean images. Now many scholars are trying to synthesize rain images, but most of them lack density, scale and other problems. In order to solve these problems, we manually select the data set used for synthesis from some existing data sets and websites. Considering the density and realizability of the network, we use R100H and R100L dataset to train and test the network. The data set contains 1000 real rainless images and 14000 simulated rainy day images composed of rain lines of different directions and sizes added to these rainless images. During training, 5000 images with different directions and sizes of rain lines in the data set are selected as the network training set, and 300 images are randomly selected from the remaining images for testing,

B. IMPLEMENTATION DETAILS
The method in this paper is based on Pytorch experimental framework. The hardware parameters of the experiment are as follows: AMD Ryzen 53600 6-core processor 3.59 GHz, NVIDIA GTX2060 6 GB. We use Adam optimizer and piecewise constant decay on generator and discriminator with several learning rates during different iteration intervals. For the generator, we start training with a learning rate of 10 −4 , and after 50 epochs the rate is 8×10 −5 . For the discriminator, since the judgment of the discriminator has a guiding function at the beginning of training, it will seriously affect the direction of training. We set the learning rate of the discriminator before 45 epochs as 4 × 10 −4 , and after will be 10 −4 . Both generators and discriminators trained for 200 generations.

C. COMPARISION ON SYNTHETIC IMAGES
We compare the performance of three excellent methods on synthetic images qualitatively and quantitatively. The first method [13] uses ground live monitoring to guide the training of generators at different levels. The second method [22] designs a multi stream dense network using the characteristics of different scales. The third method [31] focuses on high-frequency details, simplifies the training process and improves the rain removal effect. The last method [32] is improved GAN by introducing LSTM and dense residual network (DRN) into the generator to improve the accuracy of removing rain bands.
In order to verify the rain removal effect of the algorithm in this paper, the method in this paper is compared with the methods proposed in literature [13], literature [22] and literature [32]. The test image used is a simulated rain image. The experimental results are shown in Figure 5. Fig. 5 (a) is the original rain image, Fig. 5 (b) is the result of the method of literature [13], which can be seen that not only the rain line is not removed completely, but also the rain removal image becomes blurred, Fig. 5 (c) is the rain removal result of the method of literature [22], the rain removal effect is significantly improved, but there are still rain marks on the image, Fig. 5 (d) is the rain removal effect of the method of literature [32], and the clarity of the restored image is improved, However, the rain line is not completely removed. Fig. 5 (e) shows the rain removal effect of MSFSA-GAN proposed in this paper. It can be seen that the rain line removal effect after our method is better, the image after rain removal  [13]. (c) is literature [22]. (d) is literature [32]. (e) is MSFSA-GAN (ours).
is clear, and the texture and detail retention of the image are also high, which has achieved good visual effect.
During the test, we selected four typical and commonly used image quality evaluation indexes: a) peak signal-tonoise ratio (PSNR): It is usually used to measure the quality of processed images, b) structural similarity image measurement (SSIM): Usually used to measure the similarity of two images, c) universal image quality index (UQI): Usually used to measure the quality of a single image, d) mean square error (MSE): Used to measure the degree of image change. The corresponding formulas of the above four evaluation indexes are as follows: In formula (7), MAX i represents the maximum value of image point color. If each sampling point is represented by 8 bits, it is 255. The larger the PSNR, the better the image quality. In equations (8) and (9), suppose that the two images we input are X and Y respectively, µ x and µ y represents the average of X and Y respectively, δ x and δ y represents the variance of X and Y, δ xy represents the covariance of X and y, C 1 and C 2 are to prevent constants with denominator 0. Equation (10) is for two single m × n monochrome images I and K.
It can be seen from the experimental results in Table 1 that since our method extracts the image information containing more detailed rainbands from the multi-scale feature map, the image rainbands generated by the generator are removed more obviously, and the probability of judging to be true after being sent to the countermeasure network increases. It can also be said that the rain removal image generated by the generator is more real. We choose to extract multiscale feature images from rainy images, which is conducive to the later nonlinear mapping estimation of rainy images and rainless images, so as to improve the rain removal ability of the network model.

D. COMPARISION ON SYNTHETIC IMAGES
For the application of image rain removal model, it is very important to remove the rain band on the image in a short time and restore the image to a rain free image. Therefore, this paper also compares the time of removing the rain belt in the same picture. The results are shown in Table 2. All tests were conducted on the GPU. This time, due to limited conditions, only the rain removal time of pictures with the size of 250 × 250 was tested. For images with larger resolution, no test was conducted.

V. CONCLUSION
In this paper, a multi-scale fusion self attention generation adversarial network (MSFSA-GAN) is proposed, which can remove rain quickly and effectively. Firstly, we enhance the original rain image, and use Gaussian algorithm to transform it into three scale Gaussian pyramid and send it to the generator. In the generator module, we add coarse fusion module, fine fusion module and feature fusion module. As for discriminator, we introduce dense blocks to further extract features so that we can reduce time consumption. We compute four indexes: PSNR, SSIM, UQI and MSE. The raindrops removal effect and time consumption of single images are compared. The results show that our MSFSA-GAN is better in PSNR and UQI. However, our tests are carried out indoors. If conditions permit, we will carry out test verification outdoors in the future.