Multi-Sensor Data Fusion for Cloud Removal in Global and All-Season Sentinel-2 Imagery

This work has been accepted by IEEE TGRS for publication. The majority of optical observations acquired via spaceborne earth imagery are affected by clouds. While there is numerous prior work on reconstructing cloud-covered information, previous studies are oftentimes confined to narrowly-defined regions of interest, raising the question of whether an approach can generalize to a diverse set of observations acquired at variable cloud coverage or in different regions and seasons. We target the challenge of generalization by curating a large novel data set for training new cloud removal approaches and evaluate on two recently proposed performance metrics of image quality and diversity. Our data set is the first publically available to contain a global sample of co-registered radar and optical observations, cloudy as well as cloud-free. Based on the observation that cloud coverage varies widely between clear skies and absolute coverage, we propose a novel model that can deal with either extremes and evaluate its performance on our proposed data set. Finally, we demonstrate the superiority of training models on real over synthetic data, underlining the need for a carefully curated data set of real observations. To facilitate future research, our data set is made available online

O N average about 55 % of the earth's land surface is covered by clouds [1], impacting the aim of missions like Copernicus to reliably provide noise-free observations at a high frequency, a prerequisite for applications relying on temporally seamless monitoring of our environment, such as change detection or monitoring [2], [3], [4], [5].The need for cloud-free earth observations hence gave rise to a rapidly growing number of cloud removal methods [6], [7], [8], [9], [10], [11], [12].While the aforementioned contributions share the common aim of dehazing and declouding optical imagery, the majority of methods are evaluated on narrowlydefined and geo-spatially distinct regions of interest.Not only is this specificity posing challenges for a conclusive comparison of methodology but, furthermore, may cloudremoval performance on a particular region of interest poorly indicate performances on other parts of the globe or at different seasons.Moreover, it would be desirable for a cloud removal method to be equally applicable to all regions on earth, at any season.This generalisability would allow for large-scale earth observation without the need for costly re-designing or retraining for each individual scene that a cloud removal method is meant to be applied to.
This concern is sustained by previous analysis demonstrating that landcover statistics differ across continents [13] and cloud-coverage is highly variable depending on meteorological seasonality [1].A major reason for these issues still remaining open nowadays is the current lack of available large-scale data sets for both training and testing of modern cloud removal approaches.In this work, we address this issue by curating and releasing a novel large-scale data set for cloud removal containing over 100,000 samples from over 100 regions of interests distributed over all continents and meteorological seasons of the globe.Specifically, we address the challenge of cloud removal in observations from Copernicus mission's Sentinel-2 (S2) satellites.While optical imagery is affected by bad weather conditions and lack of daylight, sensors based on synthetic aperture radar (SAR) as mounted on Sentinel-1 (S1) satellites are not [14] and thus provide a valuable source of complementary information.Recent advances in cloud removal combine multi-modal data with deep neural networks recovering the affected areas [6], [7], [15], [12].However, many networks are trained on synthetic data or on real data while making strong assumptions on the type and amount of cloud coverage.Moreover, the majority of methods do not explicitly model the amount of cloud coverage and treat each pixel similarly, thereby making unneeded changes to cloud-free areas.In this work we address the problem of cloud removal in optical data by means of SAR-optical data fusion.To redeem the current lack of sufficiently-sized and heterogeneous earth observation data for cloud removal we release a novel largescale global data set of co-registered optical cloudy, cloud-free and SAR observations to train and test declouding methods on.Our data set consists of over a hundred thousand samples, allowing the training of large models for cloud removal and capturing a diverse range of observations from all continents and meteorological seasons.In addition, we propose a novel generative architecture that reaches competitive performance, as evidenced by two very recently proposed metrics of generated image goodness and diversity.Finally, we show that synthetic data utilized in previous studies is a poor substitute for real cloud coverage data, underpinning the needs for the novel data set proposed in our work.

A. Related Work
The first deep neural architecture to reconstruct cloudcovered images combined near-infrared and red-green-blue (RGB) bandwith optical imagery by means of a conditional generative adversarial network (GAN) [6], motivated by infrared bandwith being to a lesser extent impacted by cloud coverage.Subsequent studies replaced the infrared input with SAR observations [7], [15] due to SAR microwaves not being affected by clouds at all [14].While the early work of [6], [7] provide a proof-of-concept solely on synthetic data of simulated Perlin noise [16], the networks of [15], [8] were first to demonstrate performances on real-world data, although focusing primarily on removal of filmy clouds.Comparable to these studies, we investigate the benefits of SAR-optical data fusion for cloud removal.Unlike the prior work, we address declouding on a carefully curated data set of real imagery sampled over all continents and meteorological seasons, relying neither on synthetic data nor making any strong assumptions about the type and percentage of cloud coverage.Building on the previous studies, the models of [8], [17] replace the conditional GAN by a cycle-consistent architecture [18], relaxing the preceding models' requirements for pixelwise corresponding training data pairs.While [8] relies solely on cloudy optical input data at inference time, only SAR observations are utilized in [17].Similar to these two networks, the model we propose uses a cycle-consistent GAN architecture.We combine cloudy optical with SAR observations and extend on the previous models by incorporating a focus on local reconstruction of cloud-covered areas.This is in line with very recent work [19], [12] that proposed an auxiliary loss term to encourage the model reconstructing information of cloud-covered areas in particular.The network of [12] is noteworthy for two reasons: First, for departing from the previous generative architectures by using a residual network (ResNet) [20] trained supervisedly on a globally sampled data set of paired data.Second, for adding a term to the local reconstruction loss that explicitly penalizes the model for modifying off-cloud pixels.Comparable to [12] our network explicitly models cloud coverage and minimizes changes to cloud-free areas.Unlike the model of [12] our architecture follows that of cycle-consistent GAN and has the advantage of not requiring pixel-wise correspondences between cloudy and non-cloudy optical training data, thereby also allowing for training or fine-tuning on data where such a requirement may not be met.Complementary to the SAR-optical data fusion approach to cloud removal, recent contributions proposed integrating information of repeated observations over time [10], [11].The work indicates promising results but trades temporal resolution for obtaining a single cloud free observation, whereas our approach predicts one cloud-free output per cloudy input image and thus allows for sequenceto-sequence translation.Moreover, current multi-temporal approaches make strong assumptions about the maximum permissible amount of cloud-coverage affecting individual images in the input time series, which is required to be no more than 25 or 50 % of cloud coverage for the method of [10] and 10-30 % in the work of [11].Our curated data set evidences that such strict requirements on the percentage of cloudiness may oftentimes not be met in practice.Consequently, our model makes no assumptions on the maximum amount of tolerable cloud coverage per observation and can gracefully deal with samples ranging from cloud-free to widely obscured skies, thanks to minimizing changes to cloud-free pixels and using SAR observations unaffected by clouds.The network architectures are as in [18]-except for the generator G S1→S2 , which is modified as detailed in the main text and in Fig. 3.The losses L adv , L cyc , L idt and L aux are defined in section II-C of the main text.

II. METHODS
We propose a novel model to recover cloud-occluded information in optical imagery.Our network explicitly processes a continuous-valued mask of cloud coverage computed on the fly as described in section II-A to preserve cloud-free pixels while making data-driven adjustments to cloudy areas.The continues-valued assignment of each pixel in the processed cloud mask can be interpreted as the likelihood of the pixel being cloud-covered, according to the cloud detector algorithm of [21].Our model explicitly processing cloud coverage information is in contrast to previous generative architectures that are agnostic to cloud-coverage [6], [8] and networks that only utilize binary cloud mask information [12] as opposed to more fine-grained continuous-valued masks proposed in this work.A cycle-consistent generative architecture detailed in section II-B allows for training without the need for co-registered cloudy and non-cloudy observations of strict pixelwise oneto-one correspondences, as compared to earlier approaches that required strict pixel-wise alignments [15], [7].We adapt the architecture to integrate SAR with optical observations and propose a new auxiliary cloud map regression loss that enforces sparse reconstructions to minimize modification on cloud-free areas as described in section II-C.

A. Cloud detection & mask computation
To evaluate the cloud coverage statistics of our collected data set and model cloud coverage explicitly while reconstructing cloud-covered information, we compute cloud probability masks m.The masks m are computed online for each cloudy optical image and contain continuous pixel values within [0, 1], indicating for a given pixel its probability of being cloud-covered.We compute m via the classifier s2cloudless of [21], which demonstrated cloud detection accuracies on par with the multi-temporal classifier MAJA [22], running on single-shot observations.While s2cloudless originally applies classification to compute a sparsified binary cloud mask, we wish to obtain a continuous-valued cloud map.We therefore take the intermediate continuous-valued representation of the pipeline of [21], then apply a high-pass filter to only keep values above 0.5 intensity and finally convolve with a Gaussian kernel of width σ = 2 to get a smoothed cloud map with pixel values in [0, 1].We note that m may alternatively be computed by a dedicated deep neural network [23], but our solution is lightweight and thus perfect to support methods running on very large data sets, at almost no additional computational cost in either memory or run time.Exemplary samples of cloud probability masks are presented in appendix A.

B. Architecture
The model proposed in this work follows the architecture of cycle-consistent GAN [18], i.e. we use two generative networks G S1→S2 and G S2→S1 that translate images from the source domain of S1 to the target domain of S2 and vice versa.Distribution Ṡ1 (or Ṡ2) denotes the target when the generator performs a within-domain identity mapping, preserving the input image's sensor characteristics.For each domain there exists an associated discriminator network, denoted as D S1 and D S2 respectively, classifying whether a given image is a sample from the domain's true distribution S1 (or S2) or from the synthesized distribution Ŝ1 (or Ŝ2).An overview of our model ensemble is given in Fig. 2. While we keep the network G S2→S1 as in the original work, we apply spectral normalization [24] to both discriminators and make adjustments as follows: G S1→S2 receives an image from domain S1 as input and is additionally conditioned on the corresponding cloudy image from S2 as well the cloud probability mask m.For our cloud-removal network we keep the encoder-decoder architecture of the generator but add a long-skip connection [20] such that the output is given by where S2 res denotes the residual mapping learned by the generator.To demodulate the effects of the output non-linearity on the long-skipped pixels the inverse hyperbolic tangent is applied to the cloudy input image from S2 before the residual mapping.Furthermore, we insert a regression layer taking the residual maps S2 res as input and returning a prediction m of the cloud map m.The purpose of the regressor is to enforce a meaningful relation between the learned S2 res and the conditioning m, making the residual maps sparse.Here, sparseness refers to the residual maps being (close to) zero over non-cloudy areas, as opposed to having widespread small values which would indicate many unneeded changes made to cloud-free pixels.We enforce sparseness of the residual maps by formulating an L1 loss on the cloud mask regression as defined in section II-A.The loss term effectively acts as a regularizer on changes made to non-cloudy areas, penalizing unnecessary adjustments.The regression layer consists of a [3×3] convolutional kernel mapping the generated threedimensional image to a single-channel map and thus adds little to the overall number of learnable parameters.The architecture of generator G S1→S2 is depicted in Fig. 3, and details on its parametrization are provided in Table I.Discriminator D S2 is as well conditioned on the cloud probability maps m.Importantly, we forward the (unpaired) non-cloudy optical images to the discriminator D S2 , which learns the non-cloudy patch-wise statistics and thus implicitly forces G S1→S2 to synthesize cloud-free images.In sum, our main contribution with respect to architectural changes is two-fold: First, we adjusted the generator predicting cloud-free optical images to learn a residual mapping by introducing a long-skip connection forwarding optical information, removing the previous need to reconstruct (even cloud free) pixels from scratch.Second, our generator learns to constrain modifications to cloudcovered pixels while keeping clear areas unchanged, which is encouraged by introducing a novel layer regressing the cloud coverage map by the learned residual map.

C. Losses
We adjust the losses such that regions regressed as cloudfree in map m remain untouched while cloudy areas are recovered given the information from domain S1.The losses minimized by the generators are where λ adv = 5.0, λ cyc = 10.0, λ idt = 1.0 and λ aux = 10.0 are hyper-parameters to linearly combine the individual losses within L all .The loss weightings are set similar to those in [18], with minor adjustments made manually.L adv is the adversarial loss originally proposed in LSGAN [25], implementing a least-squares error function on the classifications of the discriminators D S1 and D S2 .L cyc and L idt are as introduced in [18] but weighted pixel-wise with the cloud map m.The purpose of the cycle-consistent loss L cyc is to regularizing the mapping S1 → S2 by requiring S2 → S1 being able to reconstruct the original input again (likewise for the direction S2 → S1 → S2), constraining the potential mappings between both domains.The idea behind L idt is to motivate generators to perform an identity mapping and limit unneeded changes in case the provided input is a sample of the target domain.L aux is the loss associated with the cloud map regression in G S1→S2 , introduced to enforce sparseness of the learned residual feature maps S2 res such that the non-cloudy pixels of S2 experience little to no adjustments.Our modified generator architecture, the usage of probabilistic cloud maps and the adjusted losses are showcased in context of a cycleconsistent GAN ensemble, but we remark that they may as well be used within alternative models such as conditional GAN [26] or ResNet architectures [20].
TABLE I: Architecture of our generator G S1→S2 .The architecture is divided into 4 components as illustrated in Fig. 3, information flow is from left to right across components and top to bottom within components.Symbols: R (ReLU), N (instance normalization), C (convolution), T (transposed convolution).For (transposed) convolution the parametrization is (kernel height × kernel width, number of filters, stride, padding size).The architecture of generator G S2→S1 is as the 9-ResNet block generator in [18] and the two discriminators are kept as the PatchGAN discriminators in [18].

A. Data
To conduct our experiments we gather a novel large-scale data set called SEN12MS-CR for cloud removal.For this purpose, we build upon the openly available SEN12MS data set [27] of globally sampled co-registered S1 plus cloud-free S2 patches and complement the data set with co-registered cloudy images close in time to the original observations.SEN12MS-CR consists of 169 non-overlapping regions of interest (ROI) evenly distributed over all continents and meteorological seasons.The ROI have an average size of approximately 52 × 40 km 2 ground coverage, corresponding to complete-scene images of about 5200 × 4000 px 2 .Each complete-scene image is checked manually to ensure freedom of noise and artifacts.The cloud-free optical images of four exemplary ROI observed in four different meteorological seasons are depicted in Fig. 4 to highlight the heterogeneity of landcover captured by SEN12MS-CR.Each scene in the data set is subsequently translated into Universal Transverse Mercator coordinate system and then partitioned into patches of size 256 × 256 px 2 with a spatial overlap of 50% between neighboring patches, yielding an average of over 700 patches per ROI.Each patch consists of a triplet of orthorectified, geo-referenced cloudy and cloud-free 13-band multi-spectral Sentinel-2 images, as well as the correspondent Sentinel-1 image (see Fig. 1 for examples of SAR, cloud-free and cloudy optical patch triplets).Paired images of the three modalities were acquired within the same meteorological season to limit surface changes.The Sentinel-2 data is from the Level-1C topof-atmosphere reflectance product.Finally, each patch triples is automatically controlled for potential imaging artifacts and exclusively artifact-free patches are preserved to constitute the final cleaned-up version of SEN12MS-CR.
Evaluating the cloudiness of each patch with the algorithm of [21] as described in section II-A yields a mean cloud coverage of circa 47.93 ± 36.08 percent, i.e. about half of all optical images' information is affected by clouds and the amount of coverage varies considerably.This amount of coverage is notably close to the approximately 55 percent of global cloud fraction over land that have previously been observed empirically [1].The distribution of cloud coverage is shown in Figure 5 and is relatively uniform over the entire domain, with slightly more samples showing (almost) no clouds or being entirely cloud-covered.Note that the computed cloud probability masks are not used to filter any observations or actively guide the data set creation in any manner, they are solely used post-hoc to quantify the distribution of cloudiness.For the sake of comparability across models in our experiments and for further studies, we define a train split and a split of hold-out data which is reserved for the purpose of testing.The train split consists of 114325 patches sampled uniformly across all continents and seasons and is open to be entirely used for training or in parts for training and validating.The test split consists of 7893 geospatially separated images sampled ten different ROI distributed across all continents and meteorological seasons, capturing a heterogeneous subset of data.

B. Experiments and Results
A total of three experiments are conducted.First, we train our network and extend it by adding supervised losses for the model to benefit of paired non-cloudy and cloudy optical observations in our data set at training time.We systematically vary the amount of available supervision to investigate its effects on model performance.Second we evaluate it against a set of baseline models.Third, we re-train the architectures from the previous experiment on synthetic data of generated cloudy observations and evaluate them on real data in order to quantify to which extent models trained on simulated data are capable to generalize to real-world scenarios.To our knowledge, neither of these experiments have previously been conducted in depth.All experiments were conducted on a machine of 8 Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz 16GB of DIMM DDR4 Synchronous 2667 MHz RAM and an NVIDIA GeForce RTX 2080, running Ubuntu 18.04.Computation clock time for the training procedure may vary according to the overall task load but is estimated to be about 7 days for model ours-0 and about 10-12 days for model ours-100.
1) Metrics to quantify the goodness of cloud removal: In order to evaluate model performances quantitatively, we utilize the recently developed metrics of improved precision and recall [28], proposed in the context of generative modeling and improving on previous metrics such as Inception score or Frecht Inception distance [29], [30].Improved precision and recall are measures of goodness quantifying similarities between two sets of images in a high-dimensional feature embedding space.Precision is a metric of sample quality, assessing the fraction of generated images that are plausible in the context of the target data distribution.In our context, a generated image is plausible if its high-dimensional feature embedding is sufficiently close to the high-dimensional feature embedding of a cloud-free target image.The distance between both embeddings is sufficiently small if theres no fixed number of neighbors closer to the target embedding than the query embedding is.For the formalities behind this metric and a motivation of the chosen parametrization please see Appendix B. Recall measures the diversity in generated images and the extent to which the distribution of target data is covered.Analogous to the metric of precision, a target image is recalled if its high-dimensional feature embedding is sufficiently close to the high-dimensional feature embedding of a generated cloud-free image.Note that this allows to interpret recall as a measure of generated image diversity as the metric can score high only if the generated samples are spread out in the feature embeddings space and provide sufficient coverage of the distribution of target images, capturing the heterogeneity of the target images.To summarize, in the context of our data set of section III-A, precision specifies the closeness of cloudrecovered information to its cloud-free counterpart, whereas recall captures how well the de-clouded images capture the heterogeneity of the test data (e.g. its diversity in land-cover, seasonality, etc).
While we emphasize the benefit of both measures to disentangle image quality and image heterogeneity we also define the F1 score as where X, Y are sets of images to be compared and PR, RC denote the functions of precision and recall, respectively.In contrast to the first two experiments, the generation of synthetic data in the third experiment guarantees a one-to-one pixelwise correspondence between cloudy and ground-truth cloud free images (i.e.perfect co-registration, no atmospheric disturbances other than the simulated noise, control for no landcover and daylight changes between both observations), ensuring that pixel-wise metrics are well-defined.Therefore, complementary to the previous measures of goodness, we additionally assess performances on synthetic data in the third experiment by means of mean absolute error (MAE), root mean squares error (RMSE), Peak Signal-to-Noise Ratio (PSNR), structural similarity (SSIM) [31] and Spectral Angle Mapper (SAM) [32] as given by where x, y are images to be compared with pixel-values x c,h,w , y c,h,w ∈ [0, 1], dimensions C = 3, H = W = 256, means µ x , µ y , standard deviations σ x , σ y , covariance σ xy and small numbers 1 , 2 to stabilize the computation.MAE and RMSE both are pixel-level metrics quantifying the mean deviation between target and predicted images in absolute terms and units of the measure of interest, respectively.PSNR is an image-wise metric to measure how good of a reconstruction in terms of signal-to-noise a recovered image is to a clear target image.SSIM is a second image-wise metric quantifying structural differences between the target and predicted images.It is designed to capture perceived change in structural information between two given images, as well as differences in luminance and contrast [31].The SAM metric is an image-wise measure quantifying the spectral angle between two images, measuring their similarity in terms of rotations in the space of spectral bands [32].Further technical information with respect to the metrics utilized in our experiments to quantify goodness of predictions is provided in appendix B.
2) Quantifying the benefits of paired data: First, we train the architecture described in section II without using any pixelwise correspondences, as in a manner conventional for cycleconsistent GAN.For our generative model we consider the VV and VH channels of images from the S1 domain and add a third mean(VV,VH) channel to satisfy the dimensionpreservation requirement of cycle-consistent architectures.For images from the S2 domain all multi-spectral information is used when computing cloud probability maps while the S1-S2 mapping uses exclusively the three RGB channels.All images are value-clipped and rescaled to contain values within [−1, 1], while the cloud probability map values are within [0, 1].Valueclipping is within ranges [−25; 0] and [0; 10000] for S1 and S2, respectively.Notably, before training we perform an imagewise shuffling of the optical data of paired cloudy and cloudfree observations to remove the pixelwise correspondences satisfied when cloudy and cloud-free patches would be available as sorted tuples.That is, the optical cloudy and non-cloudy patches presented at one training step may be no longer strictly aligned or could reflect differences in landcover and atmosphere, reflecting practical challenges commonly encountered when gathering data for remote sensing applications.We train our network on a 10000 images multi-region subset of the training split introduced in section III-A.Network weights w are initialized by sampling from a Gaussian distribution w ∼ N (µ = 0, σ 2 = 0.02).The optimizer as well as the hyperparameters for the optimizer and loss weightings are set as in [18]: We use ADAM with an initial learning rate lr = 0.0002, momentum parameters β = (0.5, 0.999) for computing sliding averages of the gradients as well as their squares and a small constant of 10 −8 added to the denominator to ensure numerical stability of the optimizer.Instance normalization [33] is applied to the generators as in the original architecture [18], with adjustments detailed in Fig. 3 and Table I.Spectral normalization [24] is applied to the discriminators as in [34] in order to prevent mode collapse during training [35].The networks are trained for n iter = 50 epochs at the initial learning rate of lr , then for another n decay = 25 epochs with a multiplicative learning rate decay given by lr decay (n current ) = 1.0 − max(0, 1 + n current − n iter )/(n decay + 1), where n current denotes the current epoch number.The gentle learning rate decay over a long period of epochs serves to ensure a well-behaved optimization process during training [18], [35].All our generator networks are trained on center-cropped 200 × 200 px 2 patches but tested on full-sized 256 × 256 px 2 patches of the hold-out split, as the generator architecture is fully convolutional.As proposed in [36] and implemented in [18] we maintain two pools of the last 50 generated images to update the discriminators with a random sample from the respective image buffers such that oscillations during training are reduced [18], [35].Representative qualitative outcomes are depicted in Fig. 1.The results highlight that our model can reconstruct cloud-covered areas while preserving information that is not obscured.A quantitative evaluation of the described model (ours-0) is given in Table II  Second, we re-train the model as described above, but on paired cloudy-cloudless optical observations in order to assess the benefits of paired training data, as provided by our data set.To let the cycle-consistent architecture described in section II benefit of paired training data, we combine the losses defined in section II-C with cost functions defined on paired images: First, a pixel-wise L1 loss penalizing prediction errors between generated and paired target images as in [37].Second, perceptual losses for features and style [38], as evaluated on the features extracted at ReLU layers 11, 20 and 29 of an auxiliary pre-trained VGG16 network [39].We re-train our network with these losses and systematically vary the percent of paired cloudy and cloud-free optical data available.The paired patches are equally spaced across the training split at the beginning of the training procedure and patch pairings are fixed across epochs.During training, the presentation of paired and unpaired samples occurs in a random order.Table II shows the different models' performances.The base model trained on unpaired data (ours-0) performs worst while the model fully trained on paired samples (ours-100) achieves the best performances.In general, the more paired samples are available the better the model performs.All models metrics beat the lowerbound performances established by the raw data, except on the recall metric.The full models perform better than the ablation models without the cloud-sensitive loss and cloud probability masks.Model ours-100 performs best in terms of precision and F1 score.Note that the results depict a pronounced tradeoff between precision and recall, analysed in detail in [28].

3) Model ablation experiment:
To put the results of the previous experiment into perspective and further evaluate the factors benefiting robust reconstruction of cloud-covered information we conduct an ablation study.Specifically, we investigate the effectiveness of the novel cloud detection mechanism explained in section II-A and the local cloud-sensitive loss introduced in section II-C.For this purpose, we re-train the model ours-0 as described in section II, but omit the cloudsensitive terms by fixating the values of all pixels in the cloud probability masks m to 1.0.The effect of this is that the ablated model is no longer encouraged to minimize changes to areas free of cloud coverage, thus potentially resulting in unneeded changes.As additional baselines, we evaluate the goodness of simply using the S1 observations (VV-or VH-polarized) as well as cloud-covered S2 images as predictions and comparing against their cloud-free counterparts.III, re-trained on synthetic cloud data (either Perlin-simulated or copy-pasted) and tested on synthetic as well as real data.Both models, when trained on synthetic data, perform much better on synthetic test data than on real test data.Importantly, the test performance of models trained on synthetic and tested on real data is considerably poorer than that of the same architectures trained on real data (reported in Table III).
clouding performance of baseline models and our models (0 and 100 percent paired data from Table II).Our network of 100 % paired data performs best in terms of precision and F1 score.
The raw S1 and S2 observations perform relatively poorly, except for the cloudy optical images scoring high on image diversity due to random cloud coverage.While it may be useful to consider the raw data as baselines it is necessary to keep in mind that modalities like SAR may be at a disadvantage when directly comparing against the cloud-free optical target images.4) Assessing the goodness of synthetic data: To compensate for the lack of any large-scale data set for cloud removal, previous work simulated artificial data [6], [7], [40], [41], [10] of synthetic cloudy optical images.This raises the question of the goodness of the simulated observations, i.e. how good of an approximation such simulations are to any real data.In this experiment we consider the two architectures ours-0 and ours-100 from Table III and re-train them on synthetic data to subsequently evaluate the re-trained models on the real test data and assess if performance generalizes to real-world scenarios.Two approaches of generating synthetic data are evaluated: (1) Perlin: We generate cloudy imagery via Perlin noise [16] and alpha-blending as in the preceding studies of [6], [7], [40].This approach has the limitation of adding Perlin noise to all of the multi-spectral bandwiths evenly, due to lack of a better physical model of multi-spectral cloud noise.Since cloud detectors trained on real observations are expected to fail in such a case, we subsitute the cloud map of section II-A by the synthesized alpha-weighted Perlin noise.(2) Copy: We generate cloudy imagery by taking the ground-truth cloudfree optical observations and combine them via alpha-blending with clouded observations as in the approach of [10].Different from [10] we benefit from our curated data set and alphablend paired cloudy-cloudless observations, whereas the prior study mixed two unrelated images.Moreover, we alphablend weighted by the cloud map of section II-A, whereas the original study alpha-blended via sampled Perlin-noise.
We believe these modifications better preserve the spectral properties of real observations and keep cloud distribution statistics closer to that of real data as shown in Fig. 6 and  7. Furthermore, this allows for synthesizing coverage ranging from semi-transparent to fully occluded clouds, which would be less straightforward on unpaired observations.Exemplary observations generated by both simulation approaches as well as empirical observations are presented in Fig. 6.
The outcomes of this experiment are presented in Table IV.For all data simulation approaches, training a network on generated data and subsequently evaluating it on synthetic test data is overestimating the performances on the corresponding real test data.This observation holds for both models evaluated in the experiment.The models display a drop in performance when moving from synthetic to real testing data.The drop being considerably smaller in the case of copy-paste data than for Perlin noise data may be due to the copy-pasted data closer resembling the real data and its underlying statistics of cloud coverage and spectral distributions.In this context it is instructive to investigate spectral distortions by means of SAM, which indicates that models trained and tested on synthetic data are considerably poorer to predict spectral distributions on Perlin-simulated data as compared to the copypasted observations, which is arguable more alike to real data in terms of its spectral properties.The findings in this experiment underline the need for synthetic data to closely capture the properties of real data, yet even when real and synthetic observations may be hardly distinguishable by eye (as the examples shown in Fig. 6) there persist important discrepancies unaccounted for that hinder models trained on synthetic sampled to perform equally on real data.

IV. DISCUSSION
The contribution of our work is in providing a large-scale and global data set for cloud removal and developing a new model for recovering cloud-covered information to highlight the data set's benefits.With over 55 % of the earth's land surface covered by clouds [1] the ability to penetrate cloud coverage is of great interest to the remote community in order to obtain a continuous and seamless monitoring of our planet.While the focus in this work is on providing the first globally sampled multi-modal data set for general-purpose cloud removal, future research should also address the benefits of cloud removal approaches for particular applications common in remote sensing.An example application is in semantic segmentation, which necessitates clear-view observations for accurate land-cover classification.Another, in the context of having consecutive observations over time, would be change or anomaly detection where cloud removal methods may be beneficial particularly for the purpose of early-stage detection, which could otherwise be delayed in the presence of clouds.A limitation of our proposed cloud removal model is its restriction to work on a subset of the optical observation's spectral bands.While this constraint is required due to the choice of the network architecture as necessitated by our experiments conducted, we are certain that it will be beneficial for future research to consider the full spectral information.To allow for this, our curated global data set is released with all available information for both modalities, including the full spectrum of bands for the optical observations.

V. CONCLUSION
We demonstrated the de-clouding of optical imagery by fusing multi-sensory data, proposed a novel model and released the to our knowledge first global data set combining over a hundred-thousand paired cloudy, cloud-free as well as co-registered SAR sample triplets.Statistical analysis of our data set shows a relatively uniform distribution of cloud coverage, with clear images occurring just as probable as wide and densely occluded ones-indicating the need for flexible cloud removal approaches to potentially handle either case.Our proposed network explicitly models cloud coverage and thus learns to retain cloud-free information while as well being able to recover information of areas covered by wide or dense clouds.We evaluated our model on a globally sampled test set and measure the goodness of predictions with recently proposed metrics that capture both prediction quality and coverage of the target distribution.Moreover, we showed that our model benefits from supervised learning on paired training data as provided by our large-scale data set.Finally, we evaluated the goodness of synthetically generated data of cloudy-cloudless image pairs and show that great performance on synthetic data may not necessarily translate to equal performance on real data.Importantly, when testing on real data then networks trained on real observations consistently outperform models trained on synthetic observations, indicating the existence of properties of the real observations not modeled sufficiently well by the simulated data.This underlines the needs for a set of real observations numerous enough to train large models on, as provided by the data set released in this work.In further studies, we will address the fusion of multi-temporal and multi-sensory data, combining and comparing across both currently segregated approaches.To support future research and make contributions comparable, we share our global data set of paired cloudy, cloud-free as well as co-registered SAR imagery and provide our test data split for benchmarking purposes.

APPENDIX A CLOUD DETECTION
We present exemplary cloudy optical observations and cloud maps in Fig. 7.The cloud masks are as predicted by our cloud detection pipeline detailed in section II-A.The illustrated examples show that our proposed method can reliably detect clouds and provide continuous-valued cloud masks.

APPENDIX B IMPROVED PRECISION AND RECALL
We provide a definition of improved precision and recall in line with the definitions in [28].For further details the interested reader is referred to the original publication.
Definition: Improved precision and recall [28].Let X r ∼ P r and X g ∼ P g denote paired samples drawn from the real and generated distributions of cloud-free images, where P g is the distribution learned by the generator network whose quality is to be assessed.Each sample is mapped via an auxiliary pre-trained network M2 in a high-dimensional feature space to obtain latent representations φ r = M (X r ) and φ g = M (X g ) such that the two sets of samples are mapped into two feature sets Φ r and Φ g .A distribution P ∈ {P r , P g } is approximated by computing pairwise distances between feature embeddings of the observed samples Φ ∈ {Φ r , Φ g } and, centered at each feature φ ∈ Φ, forming a hypersphere with a radius corresponding to the distance to its k-th nearest neighbor embedding N k (φ).Hence, whether an embedded sample φ falls on manifold Φ or not is given via The fraction of samples that fall on the the paired distribution's manifold are then defined in [28] as We set parameters |Φ| = 7893 corresponding to the size of the test split of SEN12MS-CR and k = 10, because every sample has up to 50% overlap with its neighboring samples.This setting removes the paired target itself plus its eight overlapping samples when computing N k (φ).

APPENDIX C CLOUD COVERAGE STATISTICS ON TEST SPLIT
In addition to the cloud coverage statistics on the entire data set, as reported in section III-A, Fig. 8 provides the empirically observed distribution of cloud coverage on the data set's test split.Even though the histogram of the test split is less smooth than that of the complete data set due to the test split being much smaller, both distributions are considerably alike.

APPENDIX D EXEMPLARY PROBLEMATIC CASES
For the sake of completeness we discuss cases that we consider challenging for cloud removal approaches, specifically our method, and present exemplary data and predictions of such cases in Fig. 9.We consider the following challenges: our cloud detection algorithm there exist cloud masks where, even for completely cloud-free images, pixels are assigned a non-zero (albeit rather low) probability of being cloudy.
(3) Correct reconstruction of cloud-covered information.In particular for the case of complete coverage by large and dense clouds, this is a very challenging problem.We observed cases where the information reconstructed by our model did not match the target image's, for instance urban-like landcover was predicted in place of agricultural areas.

Fig. 1 :
Fig. 1: Exemplary raw data and de-clouded images.Rows: S1 data (in grayscale), S2 data (in RGB), predicted Ŝ2 data, cloud-free (target) S2 data.Columns: three different samples.The outcomes show that our model learns to preserve optical data of cloudless areas while replacing cloudy regions by the translation from the SAR domain.

Fig. 2 :
Fig.2: An overview of our model ensemble, based on cycle-consistent Generative Adversarial Networks[18].The model consists of two generative networks G S1→S2 and G S2→S1 that translate images from the source domain of S1 to the target domain of S2 and vice versa.Distribution Ṡ1 (or Ṡ2) denotes the target when the generator performs a within-domain identity mapping, preserving the input image's sensor characteristics.For each domain there exists an associated discriminator network, denoted as D S1 and D S2 respectively, classifying whether a given image is a sample from the domain's true distribution S1 (or S2) or from the synthesized distribution Ŝ1 (or Ŝ2).The network architectures are as in[18]-except for the generator G S1→S2 , which is modified as detailed in the main text and in Fig.3.The losses L adv , L cyc , L idt and L aux are defined in section II-C of the main text.

Fig. 3 :
Fig.3: Detailed architecture of the generator G S1→S2 of Fig.2.The generator receives S1, m and S2 as input, the latter of which is long-skip forwarded and modified by the learned residual map S2 res .The result is passed via a non-linearity as input to the next network, or treated as output.In parallel, S2 res is regressing m to enforce sparseness of the residual map.

Fig. 4 :Fig. 5 :
Fig. 4: Cloudless S2 imagery of four exemplary ROI, illustrating the diversity of SEN12MS-CR.The four different scenes are of four different meteorological seasons from the test split of the data set.On average a ROI is split into over 700 patch samples, each observation of size 256 × 256 px 2 .

( 1 )Fig. 8 :
Fig. 8: Statistics of cloud coverage of test split of SEN12MS-CR.As for the statistics on the complete data set, an average of circa 50% of occlusion is observed. .

TABLE II
: Effect of percentage of paired trained data on performance of cloud removal model.The more paired training data is available, the better the resulting performances.

TABLE III :
Cloud-removal performance of baseline methods and our models on test split of SEN12MS-CR.Rows S1 VV an VH refer to the raw S1 image, channels VV and VH respectively, compared to the gray-scale cloud-free S2 image.S2 cloudy refers to the raw cloudy S2 image, compared to the RGB cloud-free S2 image.
Table III reports the de-

TABLE IV :
Cloud-removal performance of models ours-0 and ours-100 from Table