RASWNet: An Algorithm That Can Remove All Severe Weather Features from a Degraded Image

The advanced driving assistant system (ADAS) is an important vehicle safety technology that can effectively reduce traffic accidents. This system can perceive information about the surrounding environment through in-vehicle cameras. However, these cameras are easily affected by severe weather conditions, such as those involving fog, rain, and snow. The quality of the images acquired by the system is degraded, and the function of the ADAS is thus weakened. In response to this problem, we propose a comprehensive imaging model that can represent the features of fog, rain streaks, raindrops and snowflakes in an image. Subsequently, an algorithm called RASWNet is proposed, which can remove all severe weather features from a degraded image. Based on the generative adversarial network, RASWNet combines the focus capture ability of a visual attention mechanism, the memory ability of the recurrent neural network and the feature extraction ability of the dense blocks approach. We verify the network structure through several ablation studies and use various synthetic and real images to test it. The results of these experiments show that our algorithm is not only better than the commonly used algorithms in terms of its clarity enhancement capacity but is also suitable for all severe weather conditions.


I. INTRODUCTION
With the continuous increase in car ownership, the traffic safety situation is becoming increasingly serious. To improve driving safety, the ADAS market has been growing rapidly [1]- [4]. The four types of ADAS sensors are LIDAR, radar, cameras and ultrasonic sensors [1]. These sensors can detect the surrounding environment and obtain all types of information needed by the system. However, these sensors are costly, and they require continuous maintenance and complex synchronization for the fusion of different sources of data [1]. Because vision is the most important perception of human beings, similarly, the camera is the most important perception component of an ADAS. Thus, we use images collected by a low-cost in-vehicle monocular camera as our research object to remedy these limitations.
The associate editor coordinating the review of this manuscript and approving it for publication was Thomas Canhao Xu .
However, the ability of in-vehicle cameras to detect the surrounding environment is easily affected by severe weather conditions, such as fog, rain, and snow. For example, the presence of fog can considerably reduce the visibility and contrast of the images collected by the cameras in addition to blurring the details. Raindrops or snowflakes move and fall rapidly in the air, which will lead to partial occlusion or blurring of the images. More specifically, the raindrops that adhere to the windshield or camera lens will reflect light from other areas, thereby degrading the images [5]. Consequently, the degradation of images caused by fog, rain and snow will not only reduce the driver's response speed but also weaken the functioning of the ADAS.
Currently, there are many clarity enhancement algorithms for a single image degenerated by severe weather conditions, such as dehazing algorithms [6]- [9], rain streak removal algorithms [10]- [12], raindrop removal algorithms [5], [13] and snow removal algorithms [14], [41]. Simultaneously, there are some clarity enhancement algorithms for two types of severe weather conditions, such as deraining and desnowing algorithms [15]- [18] and rain streaks and mist removal algorithms [33], [40]. If the algorithms are used in an ADAS, there are two challenges. The first challenge is how to recognize the current weather condition before using the algorithm. Weather recognition requires expensive equipment, which will strongly increase the cost of the car. The second challenge is that a weather recognition error will lead to a failure in the clarity enhancement. For example, if a rainy image is recognized as a foggy image, the defogging algorithm cannot remove the rain streaks and raindrops, and vice versa. In summary, we need an algorithm that can remove all of the severe weather features from a degraded image. The algorithm would be used not only in ADAS, driverless vehicles but also in intelligent monitoring, unmanned aerial vehicles (UAVs) and other fields.
In addition, for the images collected in severe weather conditions, researchers have proposed a variety of imaging models. Although these models are effective, there are still two problems: the first is the lack of uniform standards in addressing the masks of fog, rain streaks, raindrops and snowflakes; the second is the lack of a comprehensive severe weather imaging model. In summary, when we combine all types of degraded images collected in severe weather conditions together for processing, we must build a comprehensive imaging model. Hence, the contributions of our paper are as follows: The first contribution is that we propose an algorithm called RASWNet, which can remove all severe weather features from a degraded image. Based on the generative adversarial network (GAN), it can use the dense blocks to extract the severe weather features, the visual attention mechanism to capture the regions of the features, and the recurrent neural network (RNN) to remember these regions. It can automatically locate and remove the fog, rain streaks, raindrops and snowflakes in an image, and it has excellent clarity enhancement results.
The second contribution is that we propose a comprehensive imaging model that reflects all types of severe weather features. The model unifies the masks of fog, rain streaks, raindrops and snowflakes, which not only conforms to the imaging situation of single severe weather conditions but also conforms to the imaging situation of multiple severe weather conditions. This paper is organized as follows. Section I describes the research background and significance. Section II briefly reviews the related work. Section III describes the comprehensive imaging model. Section IV proposes RASWNet. Section V describes the datasets used. Section VI describes the experimental research, and Section VII provides the conclusions.

II. RELATED WORK
The research content of this paper involves image defogging, image deraining and desnowing, and severe weather imaging models. This section introduces the relevant research studies for these three aspects. The traditional defogging algorithm mainly uses image restoration technology based on a variety of prior studies [6]- [9], while the rain and snow removal algorithms are mainly based on image decomposition technology [19]- [21]. The results of these algorithms are generally not as good as those based on deep learning (DL). Consequently, the algorithms introduced in this section are based on DL.
A. IMAGE DEFOGGING Image defogging algorithms based on DL involve the process of training a deep neural network to make the defogging image continuously approach the ground truth image. According to the different network structures, they can be divided into two types: defogging by using a convolutional neural network (CNN) and defogging by using a GAN.
The first is defogging by using a CNN. In 2016, Cai et al. [22] designed an end-to-end CNN model (DehazeNet). It took a hazy image as input and outputted its medium transmission map, which was subsequently used to recover a haze-free image via an atmospheric scattering model. In 2017, Li et al. [23] proposed an all-in-one network (AOD-Net) based on a CNN, which was a lightweight CNN model and could be easily embedded in an object detection algorithm. In 2018, Ren et al. [24] proposed an end-to-end gated fusion network (GFN), which was composed of an encoder and a decoder. The experimental results were better than those of other mainstream algorithms, but it could not remove thick fog. Ancuti et al. [25] collected two hazy image benchmark datasets for related research. The I-HAZE dataset contained 35 scenes corresponding to indoor domestic environments, with objects with different colors and specularities. O-HAZE contained 45 different outdoor scenes depicting the same visual content recorded in haze-free and hazy conditions under the same illumination parameters. Song et al. [26] proposed a novel Ranking-CNN. In this network, a novel ranking layer was proposed to extend the structure of CNN such that the statistical and structural attributes of hazy images could be simultaneously captured. In 2019, Yeh et al. [27] proposed a deep learning-based architecture (denoted by MSRL-DehazeNet) for single-image haze removal relying on multiscale residual learning (MSRL) and image decomposition. They reformulated the dehazing problem as restoration of the image base component.
The second is defogging by using a GAN. In 2018, Zhang et al. [28] proposed a densely connected pyramid dehazing network (DCPDN) that used a new edge-preserving densely connected encoder-decoder structure with a multilevel pyramid pooling module for estimating the transmission map. In 2019, Dudhane et al. [29] proposed a dehazing network by using a cycle-consistent GAN (CDNet), which consisted of an encoder-decoder architecture that was used to estimate the transmission map and restore the hazefree scene. Qu et al. [30] proposed an enhanced pix2pix dehazing network (EPDN). First, the discriminator guided the generator to create a pseudo realistic image on a coarse scale, VOLUME 8, 2020 and then the enhancer following the generator was required to produce a realistic dehazing image on a fine scale.
In general, the defogging results of these algorithms are good. However, they can only be used to remove fog or haze from a single image, not rain or snow.

B. IMAGE DERAINING AND DESNOWING
Similar to the image defogging algorithms, the image deraining and desnowing algorithms can also be divided into two types: deraining and desnowing by using a CNN and deraining and desnowing by using a GAN.
The first type is deraining and desnowing by using a CNN. In 2017, Fu et al. [31] introduced a deep network architecture called DerainNet for removing rain streaks from an image. It had two characteristics: the network layers were not deep, and it was trained on a detailed (high-pass) layer. The experimental results showed that the algorithm was effective and fast. In 2018, Li et al. [32] proposed a nonlocally enhanced encoder-decoder network for single-image deraining, which was composed of a series of nonlocally enhanced dense blocks. It could not only remove rain streaks of various densities but also effectively preserve similar linear details. Liu et al. [14] proposed a context-aware deep network called DesnowNet to remove translucent and opaque snow particles. These researchers also differentiated the snow attributes of translucency and chromatic aberration for accurate estimation. In 2019, Yang et al. [33] proposed a joint rain detection and removal algorithm based on a CNN, which could remove a large number of rain streaks and mist via a contextual dilated network. The experimental results showed that the algorithm had a good effect on heavy rain images. Pei et al. [34] proposed a novel network architecture named multiweather network (MWNet), which could improve the performance of the on-board object detection system under extreme weather conditions. However, it could only recognize good weather and bad weather, and it could not enhance the clarity of the image. Ren et al. [35] proposed a progressive recurrent network (PReNet), which notably reduced network parameters with unsubstantial degradation in deraining performance. The experiments showed that the PReNet performed favorably on both synthetic and real rainy images. Wang et al. [36] proposed a novel spatial attentive network (SPANet) that could learn to identify and remove rain streaks in a local-to-global spatial attentive manner. Extensive evaluations demonstrated the superiority of the proposed method over the state-of-theart derainers. Fu et al. [37] proposed a lightweight pyramid of networks (LPNet) for single-image deraining. By using the pyramid to simplify the learning problem and adopting recursive blocks to share parameters, LPNet had fewer than 8K parameters while still achieving good performance. In 2020, Jiang et al. [38] explored the multi-scale collaborative representation for rain streaks from the perspective of input image scales and hierarchical deep features in a unified framework, termed multi-scale progressive fusion network (MSPFN) for single image rain streak removal. Experimental results on several synthetic deraining datasets and real-world scenarios showed great superiority of their proposed MSPFN algorithm over other top-performing methods.
The second type of algorithm is deraining and desnowing by using a GAN. In 2018, Qian et al. [5] proposed an attention generative network (ATT-GAN) for raindrop removal from a single image, which used the visual attention mechanism to make the generative network learn the raindrop regions and their surroundings, and the discriminative network could evaluate the local consistency of the recovery area. In 2019, Zhang et al. [39] proposed an image-deraining conditional GAN (ID-CGAN) algorithm, which used a new loss function to reduce the artifacts produced by the GAN, and they designed a multiscale discriminator to improve the ability to distinguish real and fake images. Li et al. [40] proposed a 2-stage network: a physics-based backbone followed by a depth-guided GAN refinement. Extensive experiments showed that their method outperformed the state of the art algorithms on real rain image data, recovering visually clean images with good details. Li et al. [41] proposed a snow removal composition GAN (SR-CGAN), which comprised a clean background module and a snow mask estimation module. The former aimed to generate a clear image from an input snowy image, and the latter was used to produce the snow mask in an input image. The experiments showed that the snow removal results of this algorithm were better than those of other similar algorithms.

C. SEVERE WEATHER IMAGING MODELS
In view of the various image degradation types in severe weather conditions, researchers have proposed many imaging models. These models are described as follows: The first imaging model is for foggy images, and the commonly used atmospheric scattering model [30], [42], [28] is as follows: where I is the degraded image by fog, B is the fog-free scene image, t is the medium transmission map, and A is the global atmospheric light value. Here, denotes elementwise multiplication, where t ∈ [0, 1], t = 0 means that the fog concentration is maximum, the scene is completely invisible, and the image shows the atmospheric light value I = A; t = 1 means that there is no fog, the scene is completely visible, and I = B; the values from 0 to 1 indicate changes in the medium transmission, and the larger the value of t is, the higher the visibility of the scene is. The second imaging model is for images with rain streaks; Li et al. [43] proposed the following model: where I is the original image with rain streaks, B is the clean background scene, and R is the component image of the rain streaks. The clean background scene B can be obtained by subtracting the rain streaks component R from the image I . The third imaging model is for images with snowflakes; Liu et al. [14] proposed the following model: where I represents the original image with snowflakes, B represents the snow-free image, and S represents the component image of the snowflakes. M S is a snowflake mask, which indicates the transparency of the snowflakes in the image.
Here, M S ∈ [0, 1]. M S = 0 means no snowflakes, the scene is completely visible, and I = B; M S = 1 means that only snowflakes can be seen, the scene is completely invisible, and I = S; the values from 0 to 1 mean that the snowflakes are translucent, and the smaller the value is, the higher the visibility of the scene is. The fourth imaging model is for images with raindrops; Qian et al. [5] proposed the following model: where I represents the original image with raindrops, B represents the background image, and D is the effect brought by the raindrops, which represents the complex mixture of the background information and the light reflected by the environment and passing through the raindrops that adhere to a lens or windscreen [5]. Here, M D is a raindrop mask that represents the binary state of the raindrop region in the image. When M D = 1, the background is completely invisible, and I = D; when M D = 0, the raindrops overlay on the completely visible background, and I=B+D.

III. COMPREHENSIVE IMAGING MODEL
The four imaging models described in Section II.C are effective when processing image degradation caused by a certain severe weather condition, but there are two challenges when processing degraded images caused by combinations of various severe weather conditions. The first problem is the lack of uniform standards in addressing the mask problems of fog, rain streaks, raindrops and snowflakes. The mask is the transparency of the image in the fog, rain streak, raindrop and snowflake regions, which is represented by t, M R , M D and M S , respectively.
First, there is no rain streak mask M R in (II.2) when processing an image with rain streaks. The authors [43] thought that, in the rain streak region, the pixel intensity of the background image B did not decrease when overlaid on rain streaks. However, it is weakened in the actual image, as shown in Fig. 1(a). The intensity of the red exterior wall in the black box (rain streak region) of the right figure is weaker than that in the white box (no rain region). It can be seen that the rain streak has an impact on the intensity of the image background, and the impact degree varies with the location. Therefore, it is unreasonable not to set the rain streak mask M R in (II.2). Second, when processing an image with raindrops, (II.4) indicates that, when the raindrop mask M D is 0, the background is visible without attenuation. However, this expectation is not the case in the actual image, as shown in Fig. 1(b). The intensity of the blue windows in the black box (the raindrop region) is weaker than that of the blue windows in the white box (the nonraindrop region). Therefore, it can be seen that, in (II.4), it is unreasonable to set the raindrop mask M D as a binary number.
Third, when processing an image with snowflakes, according to (II.3), when the snowflake mask M S is between 0 and 1, there is translucency. The smaller the value is, the higher the background visibility is, as shown in Fig. 1(c). The gray floor tiles in the black box of the right figure are located in the snowflake region (the reflection of a snowflake), and the intensity value of the floor tiles is weaker than that of the snow-free region in the white box. Therefore, (II.3) is reasonable.
Finally, when processing a foggy image, according to (II.1), when the transmission map t is between 0 and 1, VOLUME 8, 2020 there is translucency. The larger the value is, the higher the background visibility is, as shown in Fig. 1(d). The streetlight in the black box of the left figure is clearly visible, but the streetlights in the red box enlarged in the right figure are faintly visible. Because the former is nearer, t is larger; the latter is farther away, and t is smaller. Therefore, (II.1) is reasonable.
In summary, the intensity values of the background scenes in the fog, rain streak, raindrop and snowflake regions are all weakened; thus, the mask factors t, M R , M D and M S must be considered when establishing the imaging model. With reference to (II.3), we change the rain streak imaging model from (II.2) to (III.1), and we change the raindrop imaging model from (II.4) to (III.2). The expressions are as follows: The second problem is the lack of a comprehensive imaging model of severe weather conditions. We must build an imaging model to address the above four types of severe weather conditions together.
By combining (II.1), (II.3), (III.1) and (III.2), we can obtain a comprehensive imaging model of severe weather conditions as follows: If there is only one severe weather condition, (III.3) can be directly converted into (II.1), (II.3), (III.1) or (III.2). If there are multiple severe weather conditions, such as rain streaks and raindrops in an image, then the mask is M R and M D , and Because of the influence of (1-M R ) (1-M D ), the intensity of the background scene located in both the raindrop and rain streak regions will be weaker than the single raindrop or rain streak region. If we place background B on the left side of the equal sign, (III.3) is transformed as follows: According to (III.4), the background scene B can be obtained to remove the severe weather features. Note that t cannot be 0 and that M R , M D and M S cannot be 1. They represent that the background is completely occluded. At this time, B should be 0. Except for the image I , all of the variables on the right side of the equation are unknown, which is a typical ill-posed problem. If we estimate the atmospheric light value A, rain streak R, snowflake S, raindrop D and each mask and then combine them into a clean background scene B, there will be a large cumulative error, and the generated B will be distorted. Consequently, we take the right side of (III.4) as a whole and design a deep neural network. By training the network, the loss value tends to the minimum (i.e., the image with removal of all severe weather features is closer to the ground truth image), and a clean background scene image B can be obtained.

IV. RASWNET
To obtain a clear image in any severe weather condition, we propose an algorithm called RASWNet that can remove all severe weather features from a degraded image. It is based on the GAN, and it uses the technology of the visual attention mechanism, the RNN and the CNN. The overall structure of RASWNet is shown in Fig. 2. It can be seen that RASWNet consists of a generator and a discriminator. The generator consists of an AttDenseGRU network and a Stacked autoencoder network. The input severe weather image is sent to the AttDenseGRU, which uses dense blocks and gated recurrent units (GRUs) to generate attention maps, and it outputs the last attention map to the Stacked autoencoder. The Stacked autoencoder sends the generated image to the discriminator, and the discriminator can distinguish whether the image is real or fake. The generated image also participates in the calculation of the loss function. As a label, the ground truth image is sent to the AttDenseGRU, the Stacked autoencoder and the discriminator. In addition, it is used to calculate the loss function.

A. ATTDENSEGRU NETWORK
The AttDenseGRU network is a structure that combines the CNN and the RNN. Its purpose is to locate the regions of severe weather features from the input image (including fog, raindrop, rain streak or snowflake) that need to be removed and the pixels around them. It can generate the attention maps that highlight the regions that must be removed in the image. On the one hand, the attention map is the reference of the Stacked autoencoder to remove fog, rain streaks, raindrops and snowflakes. On the other hand, it is one of the parameters of the loss function of the Stacked autoencoder and the discriminator. The overall structure of the AttDenseGRU is shown in Fig. 3.
It can be seen that an AttDenseGRU consists of three network parts with the same structure, and each of them is composed of five dense blocks, one GRU and one Conv layer. The input image is sent to the first network part to generate attention map 1. Subsequently, attention map 1 and the input image are concatenated into the second network part to generate attention map 2. Finally, attention map 2 and the input image are concatenated into the third network part to generate attention map 3. In the network, three GRUs are also connected in sequence, thus forming an RNN structure. The most commonly used RNN unit is the long short-term memory (LSTM) unit [44], but its structure is complex. Thus, the GRU, which is simpler than the LSTM, is adopted as a unit of the RNN. The memory ability of the GRU will gradually improve the attention level through each network part. The regions of fog, raindrops, rain streaks and snowflakes to be removed become more and more highlighted in the attention map. The changes of the highlighted fog regions of attention maps 1 to 3 can be seen in Fig. 3. The function of a Conv layer is to generate an attention map. Before training the network, the initial value of the attention map is set to 0.5.
The internal function of the convolution GRU [45] is realized by (IV.1). One GRU consists of an update gate z t , a reset gate r t , a new hidden stateH t and a hidden state H t . In this instance, * is the convolution operation, σ is the Sigmoid function, tanh is the Hyperbolic tangent function, and b is the bias value. The first expression is the update gate z t; it executes convolutions of the input X t and the previous hidden state H t−1 in sequence, and then it performs nonlinear processing with a Sigmoid function, where z t ∈ (0,1). It can be seen from the fourth expression that, the smaller the value of z t is, the smaller the proportion of H t−1 is, and the larger the proportion ofH t is. The second expression is the reset gate r t ; its calculation is similar to the update gate, where r t ∈ (0,1). It can be seen from the third expression that, in the calculation ofH t , r t determines the proportion of the previous hidden state H t−1 . The smaller the value of r t is, the smaller the proportion of H t−1 is.
The right side of Fig. 3 shows the internal structure of a dense block. Each block consists of two 3×3×8 Conv layers, two 1 × 1 × 8 Conv layers and three Filter concat layers. The dense block can not only reduce the number of network parameters and calculations but also ensure the ability of feature extraction.
The loss function L ADG (A, C) of the AttDenseGRU is shown in Fig. 3 and (IV.2). In this instance, A is the attention map, C is the ground truth image, n = 3 is the number of attention maps, φ = 0.9 is the base number of the coefficient, and A t (t = 1,2,3) denotes the attention maps 1 through 3. It can be seen that the loss function L ADG (A, C) calculates the mean square error (MSE) between each attention map and the ground truth image, multiplies them by 0.9 3−t and sums the three products. When t changes from 1 to 3, the 0.9 3−t coefficient changes from 0.9 2 , 0.9 1 to 1, which is the weight of each attention map, which indicates that the preceding attention map has less influence on the loss function and that the subsequent attention map has more influence.
The input of the Stacked autoencoder network concatenates the input image and attention map 3, and the output is the generated image, as shown in Fig. 4 The DilConv layers can make the receptive field expand exponentially and do not affect the resolution of the image. According to [47], [48], the two DeConv layers can double the width and height of the input feature maps. To make the output feature maps smoother, the average pooling with unchanged size is used after each DeConv layer. To ensure that the output image is not distorted, four shortcuts are used in the network. In other words, the outputs of Conv 1 and DeConv 2 are added as the input of Conv 8; the outputs of DilConv 1 and DeConv 1 are added as the input of Conv 7; the outputs of DilConv 2 and DilConv 4 are added as the input of Conv 4; and the outputs of Conv 4 and Conv 6 are added as the input of DeConv 1.
The Stacked autoencoder has two loss functions, which are the multiscale loss function L M and perceptual loss function L P , as shown in Fig. 4. In this figure, L M compares the output of 3 × 3 × 3 Conv layers 6b, 7b and 8b with the ground truth image. Because the sizes of the three Conv layers are different (they are 25%, 50% and 100% of Conv 8, respectively), the loss function is called the multiscale loss function. The expression is as follows: where Y 6 and Y 7 represent the outputs of Conv 6b and 7b, respectively, and Y 8 is the output of Conv 8b through the function tanh. Here, C is the ground truth image, and C 4 and C 2 represent that their sizes are 1 4 and 1 2 of C, which is to remain consistent with the sizes of Y 6 and Y 7 . L MSE indicates the mean square error. Here, 0.8, 0.9 and 1.0 are three weight values, which indicate that the subsequent network layer has more influence on the loss function.
The perceptual loss function L P , proposed by Johnson et al. [49], can compare the difference in the feature maps between the generated image G(I) and the ground truth image C. Its expression is as follows [49]: The two images G(I) and C are sent to the VGGNet-16 [50] pretrained model for forward propagation, and then, the first seven Conv feature maps are extracted. Finally, the MSE of these feature maps of G(I) and C is calculated.
Combining (IV.2) through (IV.4), the expression of the generator total loss function L G is obtained as follows [5]: The first item on the right side of the equation is the original loss function of the GAN generator [51], which is multiplied by 0.01 to reduce its weight and enhance the role of the next three loss functions.

C. DISCRIMINATOR
The input of the discriminator is the generated image O(O = G(I)) or the ground truth image C. It consists of six Conv layers, one discriminator map, three Filter concat layers, one global average pooling layer and one FC+Sigmoid layer. The network structure of the discriminator is shown in Fig. 5, in which the Conv layers and the Filter concat layers in the dotted box form a dense block to extract the features of the input image. The output of the dense block constitutes a discriminator map. Then, the output of the discriminator map and the Filter concat layer 3 are multiplied to highlight the regions of the severe weather features in the feature maps. Subsequently, the sizes of the feature maps are reduced by Conv layers 5 and 6, and then the network is flattened by the global average pooling layer. Finally, regardless of whether the image is real or fake is discriminated by going through the FC+Sigmoid layer.
It can be seen from Fig. 5 that the loss function of the discriminator consists of the output of the ground truth image or the generated image through the discriminator, the discriminator map and attention map 3. Its expression L D is as follows: The first two items on the right side of the equation are the original loss function of the GAN discriminator [51]. The third item is called the discriminator map loss function L Dmap (O, C, A 3 ), which is related to the generated image O, the ground truth image C and attention map 3. The expression of this loss function is as follows: where D map (O) or D map (C) represents the discriminator map produced by the generated image O or the ground truth image C going through the dense block. The first item on the right side of the equation represents the MSE of D map (C) and D map (O), and the second item represents the MSE of D map (O) and attention map 3. The smaller the MSE values of these two items are, the smaller the difference between the generated image, the ground truth image and attention map 3 is.

V. DATASET
The severe weather images mainly include degraded images caused by fog, rain streaks, raindrops or snowflakes, and thus, we must collect the corresponding synthetic image dataset. After collection and arrangement, the foggy images come from the realistic single image dehazing (RESIDE) benchmark dataset established by Li et al. [52]. The rain streak images come from the dataset of Fu et al. [18]; the raindrop images come from the dataset of Qian et al. [5]; and the snowflake images come from the Snow100K dataset of Liu et al. [14].
We select road traffic scene images from the datasets for the training, validation and testing of the network. Our severe weather image dataset is shown in Table 1. Note that a pair of images represents one severe weather image 76008 VOLUME 8, 2020  and one corresponding ground truth image. First, we select 700 pairs of outdoor foggy images from the RESIDE outdoor training set and 300 pairs of indoor foggy images from the RESIDE indoor training set. Second, we select 1000 pairs of rain streak images from the synthetic rain streak image dataset and 1000 pairs of raindrop images from the raindrop image dataset. Third, we select 422, 289 and 289 (1000 in total) pairs of snowflake images from the large, medium and small snowflakes of the Snow100k testing dataset. Finally, we establish our dataset by these 4000 pairs of images.
To speed up the training and validation of the network, it is necessary to convert the images into TFRecord (including images and labels), which is the binary data format of TensorFlow. The image sequence is shuffled by the program. Subsequently, 3400 pairs of images are randomly selected to train, 200 pairs to validate and 400 pairs to test. Among them, the original images are all adjusted to the JPG format with 720 × 480 size. When the TFRecord files are generated, the images are resized to 448 × 308. During training and validation, the images are automatically cropped to 420×280 at random. Finally, the image size can be set independently during testing, which is 420 × 316 by default. Five pairs of samples in our severe weather image dataset are shown in Fig. 6.

VI. EXPERIMENTS
After the severe weather image dataset has been collected, we must perform experimental research on RASWNet. VOLUME 8, 2020 The experiments consist of six parts. Section A is the study of the Stacked autoencoder to optimize its network structure. Section B is the ablation study of RASWNet to verify whether our proposed network model is effective. Section C shows how RASWNet compares with the other algorithms regarding the clarity enhancement results of synthetic images. Section D shows how RASWNet compares with other algorithms regarding the clarity enhancement results of real images. Section E shows the running time of RASWNet compared with other algorithms. Section F shows the improvement in object detection results after using image clarity enhancement algorithms.
The settings of the training and validation hyperparameters are as follows. The initial learning rate is set to 0.0002. Because the network model is divided into two parts, the discriminator and the generator, the optimizer is different. The Adam is used in discriminator optimization, and the SGD with a momentum of 0.9 is used in generator optimization. To achieve the best training effect, the number of iteration steps is set to 200,000. The batch size is set to 1, which is actually a pair of images. The GPU memory fraction is 81% during training and 75% during validation because Windows 10 takes up more GPU memory. The PSNR and SSIM are used to evaluate the image quality during the training and validation.

A. STUDY OF THE STACKED AUTOENCODER
Before the study, we set the AttDenseGRU as structure A, the Stacked autoencoder as structure B and the discriminator as structure C. Among them, structure B is necessary to make the generated image. Therefore, to eliminate the interference, only structure B is used in this experiment. The study of the Stacked autoencoder mainly includes three aspects: the first is to change the number of DilConv layers, the second is to change the type of activation functions, and the third is to change the number of feature maps. We choose 5 images (20 images in total) from the test set of fog, rain streaks, raindrops and snowflakes, respectively. Through these three aspects of study, we verify their impact on the clarity enhancement results and obtain the best network model of structure B. Their experimental contents are described as follows: The first study is to change the number of DilConv layers in structure B, from 0, 2, 4 to 6. Here, 0 represents that all DilConv layers in Fig. 4 are replaced by Conv layers; 2 represents that DilConv layers 3 and 4 in Fig. 4 are replaced by two Conv layers; 4 is consistent with the network in Fig. 4; 6 represents that Conv layers 4 and 5 in Fig. 4 are replaced by two DilConv layers, and the dilation rate is 2, 2, 4, 8, 16 and 32 in successive order. The average PSNR and SSIM values of the output images are shown in Table 2. It can be seen that the PSNR and SSIM values of the four DilConv layers are the highest, and the clarity enhancement results are not good when there is no DilConv layer or too many DilConv layers. Consequently, it is reasonable to use four DilConv layers in structure B.   The second study is to change the type of activation functions in structure B. In addition to the Conv 8b layer, which keeps the Tanh unchanged, the activation functions in the network are compared by ReLU, LeakyReLU and Tanh. The slope of the negative part of LeakyReLU is 0.2. The average PSNR and SSIM values of the output images are shown in Table 3. It can be seen that the result of using LeakyReLU is basically the same as that of using ReLU. The SSIM value of the former is slightly higher, and the PSNR value of the latter is slightly higher, while the latter is better because the operation of the former is more complex. Tanh is the worst among the three. Consequently, it is reasonable to choose the ReLU function in structure B.
The third study is to change the number of feature maps in structure B, from 1/8 (one eighth of the original), 1/4 (a quarter of the original), 1/2 (a half of the original) to the original. The average PSNR and SSIM values of the output images are shown in Table 4. It can be seen that the result of the original number of feature maps is the best, the 1/4 number of feature maps is second, and the other two are worse. Consequently, it is reasonable to use the original number of feature maps in structure B.

B. ABLATION STUDY OF RASWNET
To verify the effectiveness of our proposed RASWNet model, it is necessary to conduct an ablation study. This study mainly includes two aspects: the first is to change the number of network parts that contain GRU in structure A to verify the impact of the attention map on the clarity enhancement result; the second is to change the different combinations of structures A and B and C to verify the impact of each part on the clarity enhancement result. Their experimental contents are described as follows:  The first ablation study is to change the number of network parts in structure A. The structure of each network part is shown in Fig. 3. Each network part generates an attention map. We use four models to train, validate and test, and the number of network parts of each model ranges from 2 to 5 (n=2, 3,4,5). The generated images and the PSNR and SSIM values of each model are shown in Fig. 7. It can be seen that the clarity enhancement result is the best when n = 3.  On this basis, we continue to study the effect of attention maps on the results of clarity enhancement. We also trained a model of only one network part (n=1) in structure A. Subsequently, we compare the three network models of n = 1 through 3. When n = 1 to 3, structure A outputs attention maps 1 through 3 to structure B, respectively.
The attention maps, the generated images and the PSNR and SSIM values of each model are shown in Fig. 8. It can be seen that the result of clarity enhancement is the best when n = 3 and the worst when n = 1. It can be seen from the attention maps that, when n = 1, the attention of the model is mainly on the road surface, resulting in incomplete removal of raindrops in the position of the road surface. When n = 2, the attention of the model is mainly on the trees, and the removal of raindrops in the position of the road surface is better than the model of n = 1. When n = 3, the attention of the model is basically on the raindrops, so the result is the best.
The second ablation study is to change the combination of structures A and B and C. Therefore, there are four combinations of network models: B, A+B, B+C and A+B+C. We use four models to train, validate and test. The generated images and the PSNR and SSIM values of each model are shown in Fig. 9. It can be seen that the clarity enhancement result of the B+C structure is slightly better than that of B, the A+B structure is better than the former two, and the result of the A+B+C structure is the best. Consequently, our proposed network model is effective.

C. SYNTHETIC IMAGES
Because the ground truth images are available for reference, the clarity enhancement results of the synthetic severe weather images can be verified from objective aspects. Our proposed RASWNet will be compared with the image defogging algorithms DCP [8], BCCR [9], DehazeNet [22] and AOD-Net [23], the rain streak removal algorithms GSM [17], DDN [18], and ID-CGAN [39], and the raindrop removal algorithm ATT-GAN [5].
The first comparison of the defogging results of the synthetic images and the corresponding PSNR and SSIM values are shown in Fig. 10. It can be seen that RASWNet has the best defogging quality, regardless of the detail or color. The defogging quality of AOD-Net is common; the removal of the fog is uneven, but the image color has no obvious distortion. The blue sky in the defogging images of BCCR is obviously distorted, the color is too saturated, and the white area is larger than normal. The defogging images of DCP are better than those of BCCR in the color of the blue sky, but the road color is darker. From the average values of the PSNR and SSIM, we can see that RASWNet is the best, BCCR is second in the PSNR, and DCP is second in SSIM.
To further test the effectiveness of the image defogging function of RASWNet, we also need to test it on the other dataset. The selected dataset is the O-HAZE dataset of the NTIRE 2018 Challenge on Image Dehazing [25]. The second comparison of the defogging results of the synthetic images and the corresponding PSNR and SSIM values are shown in Fig. 11. It can be seen that RASWNet is also the best on this dataset. The defogging results of DCP and BCCR are similar to that of Fig. 10. DehazeNet's result is not completely defogged, and the color is darker.
The comparison of the deraining and desnowing results of the synthetic images and the corresponding PSNR and SSIM values are shown in Fig. 12. RASWNet basically removes the rain streaks. The small and medium raindrops have been removed, but there are some black artifacts after removing large raindrops. It can remove most of the small and medium snowflakes, and only some of the larger snowflakes have residues (see Fig. 12(f)). ID-CGAN can remove most of the rain streaks, and it has the best result of the other three rain streak removal algorithms. It cannot remove raindrops and snowflakes, basically (see Fig. 12(e)). ATT-GAN can remove raindrops slightly better than RASWNet. However, it cannot remove rain streaks. The ability to remove snowflakes is worse than that of RASWNet (see Fig. 12(d)). DDN can remove most of the rain streaks, but the result is not as good as that of RASWNet and ID-CGAN. It cannot remove raindrops and snowflakes, basically (see Fig. 12(c)). GSM can only remove some of the finer rain streaks. It cannot remove raindrops and snowflakes (see Fig. 12(b)). From the average value of the PSNR and SSIM, it can be seen that RASWNet is the best, and ATT-GAN is second.

D. REAL IMAGES
The evaluation of the clarity enhancement results of the real images is different from that of synthetic images because there are no ground truth images. We use DCP [8], DDN [18] and ATT-GAN [5] to compare with RASWNet. Comparison of the clarity enhancement results of the real severe weather images is shown in Fig. 13. From the foggy images, RASWNet is the best because it has strong defogging ability and no color distortion; DCP is the second best except that the sky color is distorted and the brightness is darker; ATT-GAN can remove a thin layer of fog; and DDN is basically useless. From the rain streak images, DDN is the best, RASWNet is the second best, ATT-GAN is the third best, and DCP cannot remove the rain streaks. From the raindrop images, RASWNet is better than ATT-GAN, while DCP and DDN cannot remove the raindrops. From the snowflake images, RASWNet is the best, and ATT-GAN can remove the smaller snowflakes, while DCP and DDN are basically useless.

E. RUNNING TIME COMPARISON
The average running times of the clarity enhancement algorithms are shown in Table 5. We select 20 images (foggy, rain streaks, raindrops and snowflakes, 5 images each) from the test set for the algorithms to run on the same machine (Intel Core-i7 7700K CPU, 16 GB of memory and a Nvidia GeForce GTX 1070 GPU). Experimental results show that the running speed of RASWNet is in the middle of the seven algorithms: slower than AOD-Net [23], GSM [17], and ATT-GAN [5] but faster than DCP [8], DDN [18] and BCCR [9]. Of course, the result is obtained by GPU acceleration.

F. EVALUATION ON OBJECT DETECTION
Image defogging, deraining and desnowing algorithms can be used as a preprocessing step to improve the performance VOLUME 8, 2020 FIGURE 11. Second comparison of the defogging results of the synthetic images. We selected three foggy images from the O-HAZE Dataset of the NTIRE 2018 Challenge on Image Dehazing [25]. The red numbers indicate the best result. AVE means the average. (a) Input, (b) DCP [8], (c) BCCR [9], (d) DehazeNet [22], (e) RASWNet.

FIGURE 12.
Comparison of the deraining and desnowing results of the synthetic images. We selected three rainy and snowy images of road traffic scenes. From the top to the bottom, the first image is from the synthetic rain streak image dataset [18], the second image is from the raindrop image dataset [5], and the third image is from the Snow100K testing dataset [14]. They are not used for training. The red numbers indicate the best result. AVE means the average. (a) Input, (b) GSM [17], (c) DDN [18], (d) ATT-GAN [5], (e) ID-CGAN [39], (f) RASWNet.
of other high-level vision tasks, such as face recognition and object detection [5], [6], [14], [23], [26]. However, the above algorithms can only handle one or two severe weather conditions and cannot be used in all weather conditions. Consequently, to demonstrate the performance improvement obtained after clarity enhancement using RASWNet, we evaluated Faster-RCNN [54] on the VOC 2012 dataset.
First, we selected 102 images of road traffic scenes from the VOC 2012 test set as ground truth images. Using the Weather function of CorelDRAW, these images were made into an equal number of foggy, rain streak and snowflake images. Subsequently, we used RASWNet to process the degraded images and used the pretrained Faster-RCNN model to detect the objects. In this process, the defogging algorithm DCP [8] and the deraining algorithm DDN [18] were also used to compare with the RASWNet. The mean average precision (mAP) and F1-measure values of the object detection results are shown in Table 6. It can be seen that  Faster-RCNN can only achieve a very low average precision for degraded images. DDN slightly improved the average precision of rainy images, DCP improved the average precision of foggy images, and RASWNet improved the average precision of all images by approximately 47%.
The samples of object detection results after using clarity enhancement algorithms are shown in Fig. 14. It can be seen that Faster-RCNN has poor detection ability in degraded images and can only detect one person from the snowy image. After the deraining process by DDN [18], the same VOLUME 8, 2020  detection model can detect two targets from the rainy image and one more person from both foggy and snowy images. After the defogging process by DCP [8], the same detection model detects the same number of persons from the foggy image as the ground truth image and detects one more person from the rainy image. After the clarity enhancement process by RASWNet, the detection results of all images can be improved. The detection result of foggy and snowy images is basically the same as that of the ground truth images, and the detection result of the rainy image is better than that of DDN.

VII. CONCLUSION
In this paper, we build a comprehensive severe weather imaging model that can represent the features of fog, rain streaks, raindrops and snowflakes in an image. Subsequently, we propose an algorithm called RASWNet that can remove all of the severe weather features from a degraded image. Based on the GAN, it uses the visual attention mechanism to locate the regions of fog, rain streaks, raindrops and snowflakes; it uses the GRUs to memorize these regions; and it uses dense blocks to extract features from the image. We verify the effectiveness of each structure in the algorithm model by performing a study of the Stacked autoencoder and an ablation study of RASWNet. We also use various synthetic images and real images to test our algorithm, and we compare it with some commonly used defogging, desnowing and deraining algorithms. The experimental results show that RASWNet is not only better than the commonly used algorithms in its clarity enhancement capacity but also useful in any severe weather condition, and it is suitable for ADAS and monitoring systems. However, RASWNet runs at a slower speed, and in the future, we will increase the speed while maintaining the clarity enhancement ability of the algorithm.