Cloud Removal Based on SAR-Optical Remote Sensing Data Fusion via a Two-Flow Network

Optical remote sensing imagery plays an important role in observing the Earth's surface today. However, it is not easy to obtain complete multitemporal optical remote sensing images because of the cloud cover, how reconstructing cloud-free optical images has become a big challenge task in recent years. Inspired by the remote sensing fusion methods based on the convolutional neural network model, we propose a two-flow network to remove clouds from optical images. In the proposed method, synthetic aperture radar images are used as auxiliary data to guide optical image reconstruction, which is not influenced by cloud cover. In addition, a novel loss function called content loss is introduced to improve image quality. The ablation experiment of the loss function also proves that content loss is indeed effective. To be more in line with a real situation, the network is trained, tested, and validated on the SEN12MS-CR dataset, which is a global real cloud-removal dataset. The experimental results show that the proposed method is better than other state-of-the-art methods in many indicators (RMSE, SSIM, SAM, and PSNR).

by clouds. The existing remote sensing image repair approaches can be roughly divided into three types: 1) spatial-based methods, 2) spectral-based methods, and 3) temporal-based methods [8].
Spatial-based methods assume that the different parts of the image have similar geometric structures [9], and the missing part of the image can be completed through the remaining part. This approach is the most traditional approach of the three kinds of methods, and typical spatial-based methods include interpolation [10], [11] and propagated diffusion methods [12]. Due to the lack of other information reference, spatial-based methods are difficult to generate clear images. In some special cases, the premise does not even hold, for this reason, spatial-based methods are almost impossible to reconstruct the thick cloud image because it is completely blocked by clouds. Spectral-based methods use information complementarity between different spectral to reconstruct the image, which requires the existence of a complete band of multispectral data but thick clouds can block spectral information at all wavelengths, so this approach does not apply to cloud removal. Temporal-based methods use remote sensing images at different time points in the same area to provide supplementary information and this method is also known as mosaicking [13], [14]. When the observation conditions of selected remote sensing images of different time points are similar, this approach can usually get an excellent result. For the temporal-based methods, selecting an appropriate time interval is the key to constructing images. If the time interval is too short, the scene may still be covered by clouds. However, if the time interval is too long, the correlation between mutitemporal images may disappear; therefore, temporal-based methods are not easy to be applied because of their strict requirements on data.
In addition to the abovementioned three methods, multisource data fusion is also used to remove clouds in recent years, especially based on synthetic aperture radar (SAR)-optical fusion. Compared with the other methods, mutisource-based approaches have the following significant advantages.
The method based on multisource data is easier to get the reference image with a closer time interval to the cloud-contaminated image. If the reference image is the SAR image, which is not affected by climate, the reference image can even be provided in real time. A shorter time gap means less chance of changes, making reference images more reliable. So, cloud removal approaches based on SAR-optical fusion have nature superiority over other methods. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Although mutisource-based methods have some advantages over other methods, there are still some challenges in realizing the conversion of remote sensing data from different sources. Due to the different imaging methods of remote sensing images from different sources, they cannot be fused directly, so it is important to find a suitable method for converting or fusing SAR and optical images.
Latest remote sensing data fusion methods have been gradually transformed from traditional statistical methods to deep learning. In some recent studies, a two-flow network has been proved to have better performance than a single-flow network for the fusion of heterogeneous data [15], [16]. In this article, we propose a robust two-flow network for cloud removal, and the main contributions of the proposed cloud removal approach are as follows.
1) A novel two-flow network is proposed for cloud removal by fusing optical and SAR remote sensing data. 2) Training a network directly with L1 or L2 loss functions is easy to produce fuzzy images; therefore, a novel loss function called content loss is introduced to mitigate this problem.
3) The proposed network is trained on a big real dataset, the experimental results are more in line with the real situation. The rest of this article is organized as follows. In Section II, we briefly review the development of CNN and the methods of image reconstruction based on CNN. Section III describes the two-flow network and loss function in detail. In Section IV, the characteristics of the SEN12MS-CR dataset are introduced. The experimental results and discussion are given in Section V. Finally, Section VI concludes this article.

A. Convolutional Neural Networks for Cloud Removal
Recent years have witnessed the transition from traditional approaches to data-driven deep learning approaches in the field of computer vision (CV), in which convolutional neural network (CNN) is one of the most widely used architectures, which facilitates many CV tasks. And influenced by this trend, cloud removal methods based on CNN began to emerge in large numbers in recent years and achieved amazing results.
Before CNN was widely used in image restoration, most studies of cloud removal were about prior-based methods. Although prior-based methods can get a good result in some situations, it needs professionals to design different cloud removal methods for different imaging conditions or different sensors. Compared with prior-based methods, CNN-based methods can solve the problem well by the end-to-end learning. Recent studies have divided cloud types into thick clouds and thin clouds, and the CNN-based removal methods are also different for different cloud types. This is because thick clouds can completely block ground information of the images, which leads to CNN-based methods usually needing to use auxiliary data as the additional input to reconstruct high-quality cloud-free images. Recent studies most commonly used multitemporal data as the auxiliary data for thick cloud removal [17], [18], [19], which used CNN to fuse remote sensing images with different time stamps in the same area to get cloud-free images. Although these multitemporalbased cloud methods can produce very clear cloud-free images, they usually have strict requirements for data. To mitigate this problem, Zhang et al. [20] proposed a novel deep learning framework, which can use arbitrary numbers of temporal images to reconstruct images. Except for multitemporal data, multisource data, especially SAR data, can also be used as auxiliary data for thick cloud removal. The studies about SAR-optical fusion-based methods will be discussed in detail in Section II-B. Compared with thick cloud removal, thin cloud removal methods usually focus on suppressing the cloud influence instead of replacing the cloudy pixels. Therefore, the thin cloud removal methods based on CNN can directly obtain the cloud-free images from the cloud-cover images without additional auxiliary data [21], [22]. In addition, because different wavelengths of the spectrum are affected differently by clouds, the spectral-based method is also promising for thin clouds removal. Li et al. [23] proposed a spectral-based method to remove thin clouds by using information complementarity between different spectral. Although these methods have shown the great potential of the CNN for cloud removal, the lack of data has been a major challenge in the field. To solve the problem, Tao et al. [24] proposed a self-paced learning method to reduce the network's dependence on data, which does not need paired images to train the network.

B. Cloud Removal Based on SAR-Optical Fusion
With the development of space technology, researchers can get remote sensing images from different sources more easily than before. Although these images come from different sensors, and the imaging mechanisms of these satellites are different, the corresponding gray values of different images from different sources are linearly dependent. For this reason, it is easy to fuse heterogeneous optical images. However, the linear dispense between SAR images and optical images become very weak due to the special imaging mode and speckle noise in SAR images, which makes the fusion of optical images and SAR images difficult. Eckardt et al. [25] for the first time combined with multifrequency SAR image, in the method, geographical weighting was used to remove cloud image pixel by pixel. Huang et al. [26] used a sparse representation method to remove clouds and SAR images used as auxiliary data. Both of these representative methods require cloud detection and the reconstruction of the part of the image covered by the cloud. This requires professional researchers to design different algorithms for different cloud cover types, which is undoubtedly a big challenge. However, with the development of deep learning in the field of computer vision, this problem can be solved implicitly by end-to-end learning of deep neural networks.
In general, a cloud removal algorithm based on SAR-optical fusion can be regarded as a remote sensing images fusion or restoration task. In recent years, inspired by the success of CNN in the field of natural images, many remote sensing images restoration and fusion methods based on CNN are proposed. For these remote sensing studies based on CNN, fully convolutional network (FCN) [27] and conditional generative adversarial network (cGAN) [28] are the two most widely used CNN architectures. In [21] and [52], a deep residual FCN network is used for cloud removal, which proves that residual learning is beneficial to remote sensing image restoration. For pan-sharpening tasks, a special FCN named a two-flow network is introduced in [15] to fuse remote sensing image information from different sources. When FCN is used for image reconstruction, L 1 loss or L 2 loss is usually used as the loss function, although these two losses can reconstruct the pixelwise information of images, they lack the overall perception of images and, therefore, the lose part of the texture information of images. To solve this problem, cGAN adds a discriminant network to evaluate the truthfulness of generated images, so cGAN-based methods can generate more realistic images than FCN-based methods. The cGAN-based cloud removal method needs to adjust the model so that its inputs are SAR images and cloud optical images, and its target images are cloud-free optical images [29]. Another more direct way to realize a cGAN-based cloud removal method is to convert SAR images into optical images to get cloud-free optical images. In [30] and [31], a special cGAN called cycle-GAN [32] is introduced to translate SAR images into optical images, which can convert two different styles of images to each other, and it can be trained without using paired images. Although the cloud removal methods based on cGAN have great potential, its drawbacks are also obvious: cGAN is difficult to train and lacks robustness when fed with bad data (especially input image covered by a large cloud) [33].
To sum up, cGAN-based methods are difficult to train and do not have sufficient robustness for cloud removal, and FCN-based methods using L 1 loss or L 2 loss are difficult to generate high-quality images. So, a more robust and highperformance method needs to be explored for cloud removal.

A. Two-Flow Network
We use a two-flow network instead of a single-flow network for cloud removal. The difference between them is shown in Fig. 1. The two-flow network adds two prefusion networks to extract image features from optical and SAR images before fusion them, and it is worth noting that the prefusion networks do not share parameters. The two-flow network can extract image features from optical and SAR images by different network parameters, and due to the discrepancy in optical and SAR data distribution, the two-flow network is more effective in image feature extraction. In addition, this special structure enables the network to fuse images with different spatial resolutions during training. The design details of the network will be shown in Section II-B.

B. TF-CRNet Network
The proposed network, called TF-CRNet, is based on a pan-sharpening model presented by Liu et al. [15]. Similar to the SAR-optical fusion algorithm, pan-sharpening requires the fusion of gray values and texture features from two different sources to reconstruct high-resolution images. To reconstruct the cloud-free optical imagery, the TF-CRNet incorporates SAR imagery to guide the optical images covered by thick clouds while preserving the part of the optical image not to be covered by clouds. The specific design of the network is shown in Fig. 2. The details on TF-CRNet are as follows.
1) Convolution: There are three kinds of convolution layers in the network. All convolution kernels have a size of 3 × 3. The stride of the first type convolution is 1, which is used to ensure the feature map size remains unchanged. To reduce the loss of image details caused by the pooling layer, the convolution layer with a stride of 2 is used for downsampling instead of the pooling layer. Different from the former two, transpose convolution is used for upsampling. Transpose convolution is a special convolution operation, which can be regarded as the reverse process of ordinary convolution. In recent studies, transpose convolution is used to replace traditional linear interpolation in FCN, the implementation details of it are described in [35]. 2) Residual learning: Residual learning was first proposed in the field of image classification [36] to solve the problem that the accuracy of the deep network decreases with the increase of depth. It has been pointed out in recent studies [37] that residual learning can effectively solve the problem of the shattering gradient problem in DNN training, residual learning is applied to many CNN architectures to improve performance [21], [38]. In general, the operating principle of a residual block can be described as follows: A common form of the h(x l ) is h(x l ) = x l , a residual block example is shown in Fig. 3. It also is applied to the structure of the TF-CRNet. The convolution kernel size of the residual block is set to 3×3 and the convolution stride size is 1. In addition, the input feature map size and output size of the residual block are the same.

3) Proposed network uses parametric satisfaction linear unit
(PRELU): PRELU [39] to replace RELU, which is most frequently used in DNN. PRELU uses nonzero activation to activate negative values, which can avoid information loss during CNN inference. 4) Concatenation: Inspired by the "crop and copy" operation of U-Net [40], in TF-CRNet, long skip connection architecture is introduced so that the network architecture can reuse DNN low-level feature map semantic information. The TF-CRNet input SAR image with 2 bands and cloud optical image with 13 bands, respectively. The size of these images is 256×256. Downsampling convolution is used to increase the number and reduce the size of feature maps while upsampling convolution is used to reduce the channel number and expand the size of feature maps. Batch normalization structure is added to the model before every convolution operation to accelerate network convergence. Experiments on network configuration on real datasets also verify the effectiveness of these structures.

C. Loss Function
L 1 loss or L 2 loss is usually used for image reconstruction. Although L 2 loss is more robust than L 1 loss, it tends to generate a smooth image and appear blur effect. For this reason, in recent studies, L 1 loss becomes a better choice than L 2 loss [41], [42]. The L 1 loss for cloud-free images construction is defined as follows: T F − CRN et(SAR, optical) is the cloud-free image of the network output. R is the real target image and N is the total number of pixels.
As shown in (2), L 1 loss focuses more on the pixelwise change while ignoring the whole image's texture. To solve the problem, a loss function called content loss was introduced, which was proposed by Johnson et al. [43].
The content loss is calculated by a pretrained VGG network, and in previous studies, the VGG network was usually pretrained on the ImageNet dataset. However, the three-channel input of the general color images is inconsistent with the 13-channel sentinel-2 optical images, so it is essential to retrain another VGG network to calculate the content loss. The new VGG network is trained on the Eurosat dataset, which is a novel dataset based on Sentinel-2 satellite images covering 13 spectral bands and consisting of 10 classes within a total of 27 000 labeled and geo-referenced images [44]. The resulting VGG network has a classification accuracy of 96% on the test dataset. Fig. 4. shows the calculation process of content loss. Content loss can also be defined as follows: As shown in (3), the content loss can be regarded as the L 1 loss of feature maps generated by VGG. R and F denote target and predicted cloud-free images, respectively. N are the total number of pixels and φ j are the activations of the jth layer of the network φ. The content loss is beneficial to image reconstruction because it is calculated by feature maps, which contain some more abstract image information than just pixelwise level information.
For cloud removal tasks, L 1 loss can reduce the loss of spectral information while content loss is used to improve the visual perception of images. Spectral information is very important for quantitative remote sensing, and a better visual perception is important for object recognition. Therefore, the final loss function should take into account these two different application fields, which can be expressed as follows: After a lot of experimental finetuning, λ 1 = 0.5 and j = 7 is the optimal configuration.

D. Evaluation Indicators
Four evaluation indicators are used to evaluate the similarity between the output image of the model and ground truth, which are root-mean-squared error (RMSE), peak signal-to-noise ratio (PSNR), spectral angle mapping (SAM) [45], and structure similarity index measurement (SSIM) [46]. The dynamic range of SSIM and PSNR was set as the maximum pixel value of Sentinel-2 images (10 000). The unit of RMSE is top-of-atmosphere reflectance (ρ TOA ) and SAM is measured in degrees.

E. Data Preprocessing and Training Settings
To remove outliers, the range of sentinel-2 images grey value is clamped to [0, 10 000], and for the Sentinel-1 VV and VH polarizations the clamped range is [−25, 0] and [−32.5, 0], respectively. Then, these images were normalized, respectively, before feeding to the network.
The network's parameters are initialized by Kaiming initialization [47] and optimized by Adam optimization [48]. The initial learning rate is 0.0001 and gradually decays with epochs, and the batch size was set to 16 to take full advantage of batch normalization. Models are implemented in PyTorch and trained on NVIDIA GeForce RTX 2080Ti GPUs.

IV. DATASET
The proposed model was trained, tested, and validated on the SEN12MS-CR dataset [49], which is an evolution of the SEN12MS dataset [50], it is a publicly available, large dataset for cloud removal. SEN12MS-CR dataset consists of coregistered triplet image, which are Sentinel-1 SAR images, 13-band cloud-free, and cloud-covered Sentinel-2 optical images. The dataset sampled 169 nonoverlapping ROIs from four seasons, and the coverage area of these ROIs is about 52×40 km 2 . On average, each ROI consists of more than 700 patches, and the size of each patch is 256×256. To get more patches, adjacent patches have a 50% overlap and the total number of patches sampled is 157 521. The part of each optical patch covered by cloud is about 47.93%±36.08%, which means about half of the optical image is affected by cloud cover, and there is a huge discrepancy in land cover types and geometric structures among these patches, which ensures the diversity of the dataset without causing overfitting of CNN. The pixel values of the optical images are in the range of [0, 10 000], and the two bands of SAR data of sentinel-1 come from two polarization channels (VV and VH) acquired in IW mode.
As far as the author knows, SEN12MS-CR is the first big real dataset constructed for optical-SAR-fusion-based cloud removal networks. The dataset collects a large number of scenes in different regions and seasons around the world and has very complete spectral information for experimenters to use. Some example images of the dataset are shown in Fig. 5. These examples show three different types of cloud cover, namely thin cloud cover, partially covered by thick cloud and completely covered by thick cloud. To avoid the network overfitting the dataset and  affecting the evaluation results of the model, it is necessary to strictly implement the complete separation between the training dataset, verification dataset, and test dataset. Therefore, these three datasets consist of different images from different ROIs and different seasons, in which 139 ROIs are for training, 15 are for verification, and 15 are for testing. The data of the four seasons are evenly distributed in the datasets.

V. EXPERIMENTAL RESULTS AND ANALYSIS
In recent studies, the pix2pix model is one of the most commonly used models, so it is used as a baseline for comparison with TF-CRNet. In addition, the dsen2-cr model [52] as the SOTA method on the SEN12MS-CR dataset is also used as the baseline. The experiment results are shown in Fig. 6. It can be seen that there is an obvious spectral difference between the output images of pix2pix and the ground truth. In contrast, TF-CRNet can better reconstruct images influenced by clouds and cloud shadows than pix2pix. The conclusion can also be verified by comparing their image metrics in Table I, the proposed model is better than the baseline for these four quantitative indexes. By observing the loss function of the pix2pix model, it is not difficult to analyze the reasons for the results. The loss function of pix2pix can be defined as The loss function of pix2pix contains two items, namely L 1 loss and cGAN loss, in which the cGAN loss is calculated by the discriminator network. For a cloud-removal network, the generated image should be as close to the real image as possible, which is important for remote sensing applications. However, the cGAN loss is usually used to evaluate the change of image style, and it is not sensitive to the change of image color, which is not conducive to the true restoration of the ground information. In addition, in the processing of training, mode collapse and failure of convergence rarely happen for the proposed network, which is unlike pix2pix and other cGANs.
Similar to the proposed method, the dsen2-cr model also uses L 1 loss function to get better results than the pix2pix model. Although the dsen2-cr model shows comparable performance to the proposed method, its most indicators are slightly lower than the proposed model. Moreover, the number of parameters of the dsen2-cr model is much larger than the proposed model.

A. Influence of the Content Loss
To improve images' quality, the content loss is introduced in the proposed method. In this section, an ablation experiment is designed to verify its effectiveness of the content loss, and   Table II. We compare the effect of using content loss versus not using it on the quality of the generated image by four indicators, and it goes without saying that using the content loss function is better than not using it. The only exception is the RMSE metric, the reason is that the optimization goal of the L 1 loss function is to reduce the difference between a single pixel value and true value, and the RMSE is also a pixelwise metric.
As is shown in Fig. 7, the differences between content loss and without content loss are shown in a real case. The road in Fig. 7(d) generated by the model without content loss is very blurry. However, the model with content loss can utilize the attentional mechanisms of CNN, which allow it to restore significant areas of the image in Fig. 7(e).
It is undoubtful that the content loss is helpful for restoring the texture of optical images. However, as shown in Fig. 7(f), if train the network only with content loss, the color information is completely lost. So, using both L 1 loss and content loss and finding the appropriate weights for them are essential to train the cloud removal network.
In order to explore the influence of the weight of the two loss functions on the results, we designed a control experiment and showed the results in Fig. 8. As λ 1 varies from 10 −2 to 10 3 , which was shown in Fig. 8, the model achieves the best result when λ 1 is around 10 −1 . After more precise finetuning, the value of λ 1 is set to 0.5 finally. According to Table II and Fig. 8, it can be concluded that content loss can enrich the texture information of the predicted images, which is important for generating highquality images.
In addition, as shown in (3), the depth of feature maps (j) used by content loss can also affect the training results. The value of j usually has two options, such as j = 4 or j = 7, and the results of taking different values of j are tabulated in Table III. The results show that when j = 7, all quantitative indexes except RMSE are better than those when j = 4. And it can be seen from (3) that the content loss is the L 1 loss of feature maps, and when j = 0, the L 1 loss is exactly equivalent to content loss. In addition, it is not difficult to see from Fig. 4 that content loss and L 1 loss are closer when j is smaller. Therefore, the results can verify that L 1 loss is more beneficial to pixelwise metric RMSE, which is consistent with the results in Table II.

B. Influence of the Cloud Mask Loss
In [29], a cloud mask loss was used to make the cloud removal approach more focused on recovering the part of the image covered by clouds. In [51], an intensity normalization is used to make the cloud-free pixels that can be used as a reference to further improve the estimated results. These two methods can make the model focus more on restoring the part of the image covered by clouds and get better experiment results on the synthetic dataset. In this article, we also explore whether the trick works for a real dataset. For this, a cloud-shadow mask loss is designed for the proposed method so that it can pay more attention to the part of the image affected by the cloud, it can be defined as  (6), M is the mask matrix of the same spatial dimension as the images, the cloud pixels and shadow pixels are set for value 1 and uncorrupted pixels are set for value 0. R is the target image and F is the predicted image. To detect clouds and cloud shadows of the optical images, a detection method proposed in [52] is used.
Taking cloud-shadow mask loss into consideration, the total loss can be defined as In (7), the value of λ 1 is still 0.5 and λ 2 = 0.001. As the results tabulated in Table IV, although λ 2 is set to a very small value, cloud-shadow mask loss also has a negative effect on the predicted images in the real dataset.
It can be easily concluded from Table IV that mask loss does not play an active role in the real dataset, the reason for this is that the detection of cloud and shadow in a real dataset is much more difficult than in a synthetic dataset. In the real situation, the cloud removal network needs to remove both clouds and cloud shadows, which are difficult to simulate in a synthetic dataset. So, although the mask loss does work well on the synthetic dataset, it does not work on the real dataset as easily as it. The result also shows that the network trained on the real dataset is more practical.

C. Two-Flow Network Versus Single-Flow Network
The difference between a two-flow network and a single-flow network is shown in Fig. 1. In contrast to the proposed network that feeds images into two different branches separately, the input of a single-flow network is a 15-band image, which is the stack of SAR images and optical images on the channel dimension. To demonstrate the superiority of the proposed method, we designed a controlled experiment to compare the two-flow network with the one-flow network. It is worth noting that in order to more fairly compare their performance, these two networks have the same kernel size, layer number, skip connection, residual block, and active function. In addition, the hyperparameters and loss function are also the same for training. The qualitative results of these two architectures are compared in Table V. The results show that the evaluation indexes of the two-flow network are all higher than the single-flow network except SAM. As can be seen, the results indicating the overperformance of two-flow architecture than a single-flow one.

VII. CONCLUSION
This article proposes a cloud removal network based on the fusion of optical-SAR images and achieves an ideal result. The network is trained on a real dataset to ensure that the results are more in line with the real situation. To get more realistic images, a novel loss function is designed to optimize the network. By comparing four quantitative indicators, the results show that the proposed method is superior to the baseline. In addition, ablation experiments were designed to verify the influence of different settings on experimental results, which also confirmed the feasibility and validity of the proposed method. Finally, we also explore the influence of the cloud-shadow loss, and the results show that for the real dataset, a cloud-shadow loss does not always play a positive role. Due to the sentinel-1 and sentinel-2 data being readily available, the proposed method is helpful for the more temporally observation of the Earth observation.
Although the proposed method shows superior performance on the dataset, it is still difficult to generate a very clear image.
In the future, we will take advantage of the design of a superresolution network to make the network restore more details of images.