Single Image Raindrop Removal Using a Non-Local Operator and Feature Maps in the Frequency Domain

Taking a photo on a rainy day may result in a photo with raindrops. Images containing raindrops have a significant impact on the visual impression and accuracy when applied to image recognition systems. Thus, an automatic high-quality raindrop removal method is desired for outdoor image processing systems as well as for acquiring good-looking images. Several existing methods have been proposed to tackle this problem, but they often fail to keep global consistency and generate unnatural patterns. In this paper, we tackle this problem by introducing a non-local operator. The non-local operator combines features in distant locations with matrix multiplication and enables consistency in distant locations. In addition, high-frequency components such as edges are more affected in images with raindrops. Inspired by the nature that high-frequency components can be separated from other components in the frequency domain, we also propose to process feature maps in the frequency domain, which are obtained by the fast Fourier transform operation and processed by several convolution layers. Experimental results show that our method effectively removes raindrops and achieves state-of-the-art performance.


I. INTRODUCTION
Nowadays, outdoor image processing systems are widely used in many devices such as driving support systems and surveillance cameras. On rainy days, cameras used for these systems often get images with raindrops. Similar situations will occur when cameras are indoors. Water drops are adhered to glass due to condensation or water splash. Raindrops severely hinder the visibility and quality of images, so they can negatively impact visual recognition systems. Therefore, there is a growing demand for automatic raindrop removal systems that can effectively recover areas occluded by raindrops to produce more visually pleasing images.
Recently, several deep learning methods [1], [2], [3] have been proposed for raindrop removal, and these methods use convolution layers as the base structure. The convolution layer has a limited receptive field, which can efficiently The associate editor coordinating the review of this manuscript and approving it for publication was Wei Wei .  and effectively capture local patterns. However, the convolution layer is not good at capturing non-local patterns in images. Stacking several layers makes it possible to enlarge the receptive field, but it is still ineffective in capturing long-range dependencies. Therefore, these previous methods cannot capture the long-range dependencies sufficiently and output unnatural artifacts in images.
To solve the above problem, we incorporate a non-local operation in our model which helps capture long-range dependencies. Fig. 1 shows an ideal example of raindrop removal using long-range dependencies. Features in distant locations can be used for effective and efficient raidrop removal. The non-local operator incorporates features in distant places using matrix multiplication of the input feature map and its transpose. In our proposed model, we use the structure of Global Context Network [4] (GCNet) as the nonlocal operator.
Furthermore, an image can be decomposed into several components such as backgrounds, foregrounds, and edges. As for images with raindrops, high-frequency components such as edges are usually affected more. Therefore, it is effective for raindrop removal methods to process feature maps in multiple resolutions, and thus many methods employ feature maps in multiple resolutions. In these methods, several smaller feature maps are first generated from prior feature maps and then each is processed by an independent unit.
Each component in an image is better separated in the frequency domain than in the spatial domain. For example, sharp edges and background are mixed in the spatial domain, but these components can be completely separated into high and low frequencies. Thus, we propose to process feature maps in the frequency domain to restore high-frequency details more realistically. Fig. 2 shows an ideal example of inpainting details occluded by raindrops. Fine details are often occluded in images with raindrops and need to be repaired precisely for effective raidrop removal.
This study uses Residual FFT Convolution (Res FFT-Conv) block from Deep Residual Fourier Transform [5] (DeepRFT) as the base architecture of a processing unit to utilize feature maps in the frequency domain. Res FFT-Conv block uses the fast Fourier transform (FFT) to generate feature maps in the frequency domain and processes them with several convolution layers.
Additionally, we use two additional loss functions specialized for restoring high-frequency components to ensure more accurate raindrop removal. One uses images in the frequency domain, and the other uses images processed by the Laplacian filter, which can extract edges.
Our contributions are 1) We introduce a non-local operator to solve the problem that previous methods have in keeping consistencies of pixels in distant locations. 2) We introduce feature maps in the frequency domain and loss functions specialized for high-frequency components to remove raindrops and recover the occluded areas more realistically. Experimental results show that the proposed method is effective for more realistic raindrop removal and achieves competitive performance compared to state-of-the-art methods for raindrop removal.

II. RELATED WORKS A. IMAGE PROCESSING METHODS FOR BAD WEATHER
There are many methods that tackle images occluded by bad weather, such as haze/fog removal [6], [7], [8], [9] and snow removal [10], [11], [12]. Many deraining methods [13], [14], [15] have also been proposed and showed high performance in rain streak removal. However, these methods cannot effectively restore images with raindrops. This failure occurs due to the shape and size difference between raindrops and other obstacles.
Both video-based (e.g. [16], [17]) and single image-based methods have been proposed for raindrop removal. As for video-based raindrop removal, Roser et al. [16] proposes to remove raindrops by detecting raindrops using a photometric raindrop model before repairing the occluded areas using neighboring image frames. You et al. [17] proposes to detect raindrops based on the difference of both the motion speed and the intensity between the pixels with and without raindrops. You et al. [17] also improved the removal procedure by using either images in neighboring pixels or other frames depending on how severely the area is occluded by raindrops.
Though these methods are successful to some extent in removing raindrops in video images, they require multiple frames. To acquire visually pleasing images only from a single image with raindrops, we adopt a single image-based method. Additionally, strategies for single image-based methods can also be applied to individual video frames.
For raindrop removal from a single image, Eigen et al. [18] is the first method that tackled raindrop removal using a couple of CNN layers. Though this method works well on images with small and sparse raindrops, it is ineffective for images with large and dense raindrops.
Qian et al. [1] proposes a GAN [19] based method which is supported by an attention map. The attention map is generated by residual blocks [20] and LSTM units [21], and plays the role as a binary mask that shows how much each position is influenced by raindrops. The output clear image is then generated by several convolution layers from the input image and the generated attention map.
Quan et al. [2] proposes a shape-driven attention module to utilize characteristics of raindrops' shape such as roundness and closedness to better restore a clean image. An edge map extracted from the input image is also used to support the shape-driven attention. The edge map is extracted based on the difference magnitude.
He et al. [3] tackles the problem of restoring images containing both mists and raindrops. This method uses FFA-Net [7], a network proposed for haze removal, as the base structure. To effectively remove raindrops, this method proposes the interpolation-based pyramid attention (IPA) block and adds several IPA blocks to the base structure. The IPA block processes multi-resolution feature maps to capture information and process feature maps more effectively. VOLUME 10, 2022 All in One [22] is a method that can deal with multiple types of bad weather containing fog, raindrop and snow. All in One [22] adopts adversarial learning and its generator consists of multiple encoders specialized for each bad weather and a decoder generalized for multiple terms. The architecture of the encoder combines various kinds of operators corresponding to several fundamental operations for feature searching.

B. MULTI-RESOLUTION PROCESSING
Many methods for image processing including [1], [2], [3] use multi-scale feature maps and this architecture has shown effectiveness for various tasks such as image classification [32], image super-resolution [33], and motion prediction [34].
He et al. [3] introduces multi-scale processing to each processing block and generates feature maps in each resolution by image interpolation. Qian et al. [1] and Quan et al. [2] introduce multi-scale processing to the whole model. The two methods use U-Net [23] as the base structure of their models.
U-Net [23] is the first method that uses convolution layers for image-to-image processing and has demonstrated better performance in image segmentation tasks than previous methods. Following [24] and [25], U-Net adopts multi-scale processing and feature maps at each resolution which are generated by max pooling. There are two processing groups at each resolution, where one is the encoder block and the other is decoder block. U-Net proposes to improve the multi-scale processing procedure by introducing skip connection: each decoder can utilize not only the up-scaled feature maps of those in the lower resolution but also the feature maps in the current resolution. This procedure allows the network to preserve positional homogeneity which is reduced by the convolution layers.
Multi-input multi-output U-Net [26] (MIMO-UNet) has been proposed to improve some of the problems in U-Net [23]. One of the improvements is the asymmetric feature fusion (AFF) module, which generates the flexible flow of feature information in the network architecture. Another is the multi-output single decoder, which is used for multi-scale losses. We utilize MIMO-UNet [26] as the base structure of our proposed network so that the processing units at each scale can deal with a specific component.

III. PROPOSED METHOD
We design our proposed network with the following concepts. 1) We utilize MIMO-UNet [26] as the base structure so that each structure can deal with a specific component and features at each scale are flexibly processed. 2) We introduce a module which processes feature maps in the frequency domain to more realistically restore high-frequency components such as edges. 3) We add the non-local operator to help capture features in distant locations.

4)
We introduce loss functions which are specialized for recovering high-frequency components to ensure precise restoration of high-frequency components. Fig. 3 shows the overall structure of our proposed network. The same as MIMO-UNet [26], there are two processing groups at k ∈ {1, 2, 3}-th resolution, where one is the encoder block (EB k ) and the other is decoder block (DB k ). In Fig. 3, input k denote the input images whose size are (H × W ), (H /2 × W /2), and (H /4 × W /4) for the downsized images, respectively, and output k denote the output images of the corresponding decoder block, respectively. Refer to [26] for the detailed structure of AFF and Shallow Convolutional Module (SCM) block. In the following, we will describe the details of each component.

A. PROCESSING FEATURE MAPS IN THE FREQUENCY DOMAIN
To restore high-frequency details more realistically, we propose to utilize feature maps in the frequency domain so that each component in the feature map can be processed more independently. Concretely, we introduce Res FFT-Conv block as a basic module in MIMO-UNet. Res FFT-Conv block, proposed in [5], uses the FFT to convert feature maps into the frequency domain and processes them with 1 × 1 convolution layers. The lower left of Fig. 3 shows the architecture of a Res FFT-Conv block. Z is the input feature maps and Y res and Y fft are the feature maps from the residual block [20] and the frequency domain, respectively. In the flow of Y fft , feature maps in the frequency domain F(Z) ∈ C H ×W /2×C are first generated by the 2D Real FFT from Z ∈ R H ×W ×C . H , W , and C denotes height, width, and the number of channels of the input feature maps, respectively. After converting F(Z) into R H ×W /2×2C by concatenating its real part and imaginary part, processed feature maps in the frequency domain (f ∈ R H ×W /2×2C ) are generated by processing F(Z) using two 1 × 1 convolution layers and a Rectified Linear Unit (ReLU) activation. Finally, the output feature maps Y fft are generated by the 2D Real IFFT after converting f into C H ×W /2×C by splitting them in the channel direction by two.
To speed up the convergence, we replace convolution layers in our proposed network by Depthwise Over-parameterized Convolution (DO-Conv [27]) layers except for 1 × 1 convolution layers.

B. NON-LOCAL OPERATION
NLNet [28] introduces the non-local operator inspired by the non-local means filter [29]. The non-local operator calculates each output by a weighted sum of all points in the feature map. The output of a non-local operator at point i in a feature map is expressed as: where j denotes a point in the feature map, f (x i , x j ) denotes the similarity of features at point i and j, and g denotes a function which represents features at j. C is used for normalization.
To solve the problem that existing methods have of not being able to capture long-range dependencies and generating unnatural artifacts, we propose to add a module which can capture long-range dependencies and utilize them for processing. Concretely, we insert Global Context (GC) blocks as the non-local operator into immediately following the encoder block at the lowest resolution. We select this position to effectively capture relationships between features at distant locations while suppressing the increase in computational complexity. The lower right of Fig. 3 shows the architecture of a Global Context (GC) block.
In the previous non-local operation including NLNet [28], f (x i , x j ) is calculated by the matrix product of the input feature maps and the transpose of itself for each channel. The GC block, proposed in [4], calculates f (x i , x j ) only by a convolution layer with channel reduction. The calculation of g(x i ) is also simplified in the GC block: the same as the input feature maps. The simplification is based on the idea of sharing the same f (x i , x j ) at all locations and suppresses the increase in computational complexity.
In addition, inspired by Squeeze-Excitation (SE) Net [30], GC block adopts channel-wise processing to enhance the effectiveness of the non-local operator. Since the GC block has a simple structure, the increase in the computational cost of our proposed model is suppressed.

C. LOSS FUNCTIONS
We design the loss function used in the training phase by adding loss functions specialized for recovering high-frequency components to restore high-frequency details more realistically. Also, we set all losses as multi-scale loss so that processing blocks in the smaller scale are trained effectively.
Concretely, we define the total loss L as the weighted sum of three losses: Multi-Scale Charbonnier (MSC) loss, Multi-Scale Edge (MSED) loss, and Multi-Scale Frequency Reconstruction (MSFR) loss.
MSC loss is defined as S k (k ∈ {1, 2, 3}) denote the ground truth image whose size is (H × W ), (H /2 × W /2), and (H /4 × W /4) for the downsized images, respectively.Ŝ k denote the restored output images with the corresponding size. MSC loss is proposed in [14] VOLUME 10,2022 and works similarly to the MSE loss. A small number is added to ensure that the result is not zero. MSED loss is defined as denotes the Laplacian operation. The Laplacian operation extracts edges in the image and therefore MSED loss helps recover edges in the input images.
MSFR loss is defined as F denotes the 2D Real FFT. The FFT divides images into their individual frequency components and therefore MSFR loss is specialized for recovering each frequency component. The total loss function L is denoted as We set λ 1 and λ 2 as 0.05 and 0.01, respectively, and as 0.001.

IV. EXPERIMENTS A. EXPERIMENTAL SETUP
We use Qian et al. [1]'s dataset for training and evaluation of each method. The dataset is composed of pairs of images with and without raindrops by taking pictures of carefully chosen backgrounds using two glasses: one without any drop and one sprayed with water. The dataset has 861 pairs for training and 58 pairs for testing. While training, we augment the training dataset using the random horizontal flip and random crop to (256 × 256). We set the number of the Res FFT-Conv blocks and GC blocks in each group (N and N G ) to 19 and 5, respectively. We decide these numbers by considering the balance between performance and computational cost. We set the batch size to 4 and train for 3,000 epochs. We set the initial learning rate to 0.0001 and decrease it by multiplying 0.5 every 500 epochs.

B. RESULTS
As for quantitative comparison, we use PSNR and SSIM, which are widely used to evaluate image restoration tasks including image raindrop removal. Table 1 shows the quantitative comparison of raindrop removal with several existing raindrop removal methods. As shown in the table, the proposed method is superior to any existing methods in PSNR. Additionally, the proposed method is superior to any existing methods except He et al. [3] in SSIM. Although He et al. [3] is superior to the proposed method in SSIM, the difference is less than 0.10 %. Thus, our proposed method achieves comparably better results in quantitative evaluation when compared to existing raindrop removal methods. Fig. 4 shows the qualitative comparisons of each raindrop removal method. Images in the 1st and 2nd columns are the ground truth and input image, respectively, and the others   are results from each raindrop removal method. Images in the 2nd and 4th rows are enlarged images of those in the 1st and 3rd rows within the red box, respectively. As shown in Fig. 4, unnatural patterns(e.g., whole images in the 2nd row, sky and large wall in the 4th row) are apparent in images of existing methods ( [2] and [3]). On the other hand, our proposed method correctly removes raindrops and outputs images whose appearances are smooth and natural. Furthermore, the proposed method recovers the details (e.g., small windows in the 4th row) more realistically than existing methods.

C. ABLATION STUDIES
We conduct ablation studies to show the effectiveness against the base methods for raindrop removal. The proposed method is compared with two methods, MIMO-UNet+ [26] and DeepRFT+ [5]. MIMO-UNet+ [26] uses residual blocks instead of Res FFT-Conv blocks in our model and does not have GC blocks. DeepRFT+ [5] uses Res FFT-Conv blocks and does not have GC blocks. Both methods are not proposed for raindrop removal, so we trained the two methods using the same dataset as IV-B. We set various learning environments as mentioned in each paper.
We also conduct studies to show the effectiveness of the three proposed components for better raindrop removal. The first is the introduction of GC blocks to capture long-range dependencies. The second is adding Res FFT-Conv blocks in the encoder and decoder blocks to process feature maps in the frequency domain. The third is the MSED loss and MSFR loss to better restore high-frequency components. We compared our proposed method with three models: 1) Removed all GC blocks from the proposed network (w-o GC)  In this study, we use the same learning environments in all models. Table 2 shows the quantitative comparison of the 5 methods mentioned above. As shown in the table, our proposed network performs better in PSNR and SSIM than the two base methods. This shows that our improvements to the network structure, such as inserting the non-local operation, have quantitatively positive effect for raindrop removal. Table 2 also shows that our proposed network with all of the components performs the best in PSNR and SSIM compared to methods without each proposed component. This shows that each component has a positive effect on accurately removing raindrops from images and the introduction of processing feature maps in the frequency domain has the most significant contribution for better raindrop removal. Fig. 5 shows the qualitative comparison of two methods (1. w-o Y fft , 2. w-o GC) for raindrop removal. Images in the 1st and 2nd column are the ground truth and input image, respectively from Qian et al. [1]'s dataset, and the others are results from each raindrop removal method for ablation studies. Images in the 2nd and 4th row are enlarged images of those in the 1st and 3rd row within the red box, respectively. As shown in Fig. 5, though there are no significant differences between the three methods, using GC blocks tend to remove raindrops more clearly and generate more natural images since they are able to effectively capture features in distant locations. Such trend is also seen when we process the feature maps in the frequency domain.

D. FAILURES
Our proposed method shows excellent performance in raindrop removal both quantitatively and qualitatively, and handles various types of raindrops, from small and thin raindrops to large and dense raindrops. However, there are still some failure cases in our proposed model. For example, when more than one raindrop is connected and have an irregular shape (like the first row of Fig. 6). The propose model also fails when trying to inpaint fine patterns (like the second row of Fig. 6).

V. CONCLUSION
In this paper, we proposed a novel method for single image raindrop removal. Our proposed network utilizes feature maps in the frequency domain by introducing the architecture of Res FFT-Conv block. This component allows our network to recover high-frequency components more realistically. Additionally, our proposed network adds GC block to utilize features in distant places. This component allows our network to keep consistency throughout the output image. Experimental results show that these proposed components are effective in removing raindrops and that our proposed network achieves state-of-the-art performance.