SEN12MS-CR-TS: A Remote Sensing Data Set for Multi-modal Multi-temporal Cloud Removal

About half of all optical observations collected via spaceborne satellites are affected by haze or clouds. Consequently, cloud coverage affects the remote sensing practitioner's capabilities of a continuous and seamless monitoring of our planet. This work addresses the challenge of optical satellite image reconstruction and cloud removal by proposing a novel multi-modal and multi-temporal data set called SEN12MS-CR-TS. We propose two models highlighting the benefits and use cases of SEN12MS-CR-TS: First, a multi-modal multi-temporal 3D-Convolution Neural Network that predicts a cloud-free image from a sequence of cloudy optical and radar images. Second, a sequence-to-sequence translation model that predicts a cloud-free time series from a cloud-covered time series. Both approaches are evaluated experimentally, with their respective models trained and tested on SEN12MS-CR-TS. The conducted experiments highlight the contribution of our data set to the remote sensing community as well as the benefits of multi-modal and multi-temporal information to reconstruct noisy information. Our data set is available at https://patrickTUM.github.io/cloud_removal


I. INTRODUCTION
T HE majority of our planet's land surface is covered by haze or clouds [1].Such atmospheric distortions impede the capability of spaceborne optical satellites to reliably and seamlessly record noise-free data of the Earth's surface.The M. Schmitt is with the Remote Sensing Technology Institute, German Aerospace Center, 82234 Wessling, Germany, and also with the Chair of Earth Observation, Bundeswehr University Munich, 85577 Neubiberg, Germany .(e-mail: michael.schmitt@unibw.de).
X. X. Zhu is with Data Science in Earth Observation, Technical University of Munich, 80333 Munich, Germany and also with the Remote Sensing Technology Institute, German Aerospace Center, 82234 Wessling, Germany.(e-mail: xiaoxiang.zhu@dlr.de).
presence of clouds is detrimental to typical remote sensing applications, for instance land cover classification [2], semantic segmentation [3], [4] and change detection [5], [6].Columns: Samples at two different time points.Rows: S1 data (in grayscale), cloudy S2 data (in RGB), predicted cloudfree Ŝ2 data, reference cloud-free S2 data of a later point in time.The results highlight that our network is able to integrate multi-modal and multi-temporal information to predict a clearview sequence of multi-spectral observations, even in the presence of heavy cloud coverage.
The work at hand aims to combine both preceding approaches and thus considers the challenge of cloud removal in optical satellite imagery by integrating information across time and within different modalities.For this purpose, we curate a new data set called SEN12MS-CR-TS, which contains multitemporal and multi-modal satellite observations.Specifically, SEN12MS-CR-TS consists of 1-year long time-series of coregistered radar Sentinel-1 (S1) as well as multi-spectral Sentinel-2 observations (S2) acquired in a paired manner, covering regions of interest (ROI) from all over the world.We highlight the benefits of the proposed data set by training and testing two different models on our data set: First, a multi-modal multi-temporal 3D-Convolution Neural Network that predicts a cloud-free image from a sequence of cloudy optical and radar images.Second, a sequence-to-sequence translation model that predicts a cloud-free time series from a cloud-covered time series.Both approaches are evaluated experimentally, with their respective models trained and tested on SEN12MS-CR-TS.The conducted experiments highlight the contribution of our curated data set to the remote sensing community as well as the benefits of multi-modal and multitemporal information to reconstruct noisy information.

A. Related Work
As the presence of clouds in optical satellite imagery poses a severe hindrance for remote sensing applications, there has been plenty of preceding research on cloud removal methods [7], [8], [9], [20], [10], [3], [12], [13], [14].The focus of this overview is on data sets for cloud removal methods.Much of the early work on cloud removal considered data of simulating cloudy observations [3].Copying cloudy pixel values from one image to another clear-view one [3] captures the spectral properties of naturally cloudy observations more faithfully than synthetic noise (e.g.Perlin noise [21]) [7], [8], [20], but neither precisely reproduce the statistics of satellite images containing natural cloud occurrences [14].Consequently, our data set contain cloud-free as well as naturally occurring cloud-covered optical satellite recordings.The SEN12MS-CR data set [14] provides a globally distributed collection of co-registered mono-temporal Sentinel-1 as well as cloudy and cloud-free Sentinel-2 observations.Our data set is an extension of SEN12MS-CR in the sense that we collect repeated measures per ROI and thereby provide a time-series of co-registered S1 and S2 observations, gathered such that matched observations of both modalities are no more than two weeks apart.In comparison to the preceding data set, ours allows integrating information not solely across different sensors, but also across different points in time distributed throughout the year.Similarly, the work of [12] allows for time-series cloud removal by providing a collection of tritemporal RGB(NIR)-channel optical data and corresponding models.Our contribution extends this work by providing true multi-modal data recorded by two distinct sensors, synthetic aperture radar Sentinel-1 measurements as well as 13-bands multi-spectral Sentinel-2 observations.Furthermore, the length of each time series is increased considerably, from 3 to 30 samples.Finally, [12] exclude observations with greater than 30 percent cloud coverage from their data set, which deviates from real conditions.Our approach aims to model the complete spectrum of cloud coverage, including conditions commonly encountered by remote sensing practitioners.In sum, our work and its main contribution, a large-scale multimodal multi-temporal data set for cloud removal in optical satellite imagery, build on a history of research and improve upon the current state of image reconstruction in remote sensing by providing a novel, carefully curated data set.

II. DATA
This work introduces SEN12MS-CR-TS, a multi-modal and multi-temporal data set for training and evaluating global and all-season cloud removal methods.The data set consists of 53 globally distributed regions of interest, curated as detailed in section II-A.The ROI are over 4000×4000 px 2 each, covering about 40 × 40 km 2 of land such that the total surface area covered by the data set is over 80, 000 km 2 .Of all collected regions of interest, 40 are defined as a training split and 13 as a hold-out split to evaluate cloud removal approaches on.For every ROI we collect 30 co-registered and paired S1 and S2 full-scene images evenly spaced in time throughout the year of 2018.Each acquired image is inspected and qualitycontrolled manually.The cloud-free Sentinel-2 (RGB-channel) observations of four example ROI illustrating the diversity of our data set are illustrated in Fig. 6.The spatial distribution of all ROI is depicted in Fig. 2 and highlights the global sampling of our data set.The empirical distribution of the cloud coverage of all optical observations is computed as detailed in section II-C and the statistics are presented in Fig. 4 and 5 for the train and the test splits, respectively.Importantly, the data set is curated without excluding any interval of cloud coverage such that the collected observations also reflect conditions of high cloud coverage as commonly encountered in practice [1].The data is made available under https:// patrickTUM.github.io/cloud removal.It is about 2 Tb in size and compatible with the SEN12MS-CR data set [14].That is, no train ROI of SEN12MS-CR are part of our data set's test ROI and vice versa.

A. Data collection
All curated data are recorded via the SAR Sentinel-1 and multi-spectral Sentinel-2 (level 1-C top-of-atmosphere reflectance product) instruments of ESA's Copernicus mission.The recorded observations are acquired via Google Earth Engine [22] and a custom semi-automatic processing pipeline.We randomly sample the geospatial locations of 53 regions of interest from SEN12MS-CR [14].To minimize mosaicing, observations of cells covered by a single pass are collected.The samples are referenced within the World Geodetic System 1984 (WGS84) coordinate system.For every ROI, 30 time intervals are evenly spaced throughout the year of 2018.For every time interval a co-registered, geo-referenced and fullscene S1 image as well as a paired full-scene S2 image (level Fig. 2: Spatial distribution of the ROI constituting SEN12MS-CR-TS.Areas belonging to the training split are denoted in blue, regions of the testing split are colored in green.The ROI of SEN12MSCR [14], non-overlapping and compatible with our data set, are depicted in gray.Any graphical overlap of the semi-transparently plotted dots is rendered in darker tones so close-by dots can easier be discerned.Fig. 3: Example data, preprocessed as stated in section II-B.Rows: S1 data (in grayscale), S2 data (in RGB), binary cloud masks (as per s2cloudless [19]).Columns: Samples of five different time points.The illustrations show that the observed region is affected by variable atmospheric disturbances and covered by a dynamic extent of clouds, changing over time.The detected cloud coverage at the individual time points is 36, 49, 23, 48 percent, with an average of about 39 percent across all illustrated samples.While some pixels are clear at least at one point in the series and may thus be reconstructed by integrating across time, others are cloud-covered throughout the sequence and require spatial context or cloud-robust sensor information to be reconstructed.

B. Preprocessing
To prepare the collected raw data and translate it into a format that neural networks for cloud removal can handle the following preprocessing steps are taken: Each band of every observation is upsampled to 10m resolution (i.e. to the native resolution of Sentinel-2's bands 2,3,4 and 8).Every full-scene image is sliced into non-overlapping patches of dimensions 256 × 256px 2 .The S1 observations are processed via the Sentinel-1 toolbox [23] (including border and thermal noise removal, radiometric calibration, orthorectification) and decibeltransformed.An example patch-wise tuple of paired S1 and S2 data is illustrated in Fig. 3. Input patches to any ResNet model [24] are preprocessed in line with the pipeline of [13] as follows: the VV, VH channels of S1 observations are valueclipped in the ranges [−25; 0], [−32.5;0] and rescaled to the interval [0; 2], while S2 patches are value-clipped to [0; 10000] and normalized to the range [0; 5].For all other networks with a different backbone architecture, preprocessing is done as follows: each patch is value-clipped and then rescaled for every pixel to take normalized values within the unit range of [0, 1].The modalities S1 and S2 are value-clipped within the intervals of [−25; 0] and [0; 10000], respectively.This way, we follow the preprocessing protocol of the preceding work and avoid any unnecessary adjustments, for the sake of simplicity.For evaluation, the pixel values of all input patches, target images and predictions are re-mapped to the unit interval [0, 1], where the goodness of predictions is assessed according to the metrics stated in section IV-A.

C. Cloud detection & mask computation
In order to analyse the statistics of cloud coverage in SEN12MS-CR-TS and to model the spatio-temporal extent of clouds we compute binary cloud masks m.For each optical image, the masks m are computed on-the-fly via the cloud detector of s2cloudless [19], which provides a binary mask of pixel-wise values in {0, 1} that indicate cloud-free and cloud-covered pixels, respectively.The cloud mask accuracy of s2cloudless is reported to be on par with the multi-temporal classifier MAJA [25], but the considered detector can be applied on mono-temporal satellite observations.Note that, alternatively to s2cloudless, the masks m may be computed via a dedicated neural network for cloud detection [26], [27].However, s2cloudless has proven to be lightweight and provide sufficient performance at little extra computational cost in run time or memory, making it an appealing cloud detector to be applied on a large-scale date set such as SEN12MS-CR-TS.Example cloud detections are illustrated in Fig. 3.

III. METHODS
We consider two distinctively different methods to highlight the benefits of our curated data set and the diverse tasks it allows to approach.The first method is a neural network reconstructing cloud covered pixels in time series of multimodal data to predict a single target image acquired at a cloud-free time point.The second approach introduces a neural network that performs sequence-to-sequence cloud removal, that is, it predicts a time series of cloud-free observations the size of the cloudy input sequence.

A. Multi-temporal multi-modal cloud removal
For multi-temporal multi-modal cloud removal we consider a deep neural network that builds on the generator of [12].Our model receives a sequence of t = 1, ..., n input tuples (S1, S2) t and predicts a cloud-removed multispectral image Ŝ2.The architecture of the proposed model uses a ResNet [24] backbone, with Siamese residual branches processing the individual time points until their information gets integrated.That is, we replaced the pairwise concatenation of 2D feature maps in [12] by stacking features in the temporal domain, followed by 3D convolutions.Moreover, as the first part of the generator of [12] is effectively a single time point cloud removal sub-network (as each time point is processed individually up to this point), we substitute this component by the established ResNet-based [24] cloud removal network of [13].Subsequently, the feature maps are stacked in the temporal dimension and 3D convolutions are applied to integrate information across time.The output of the network is a single cloud-free image prediction Ŝ2.A schematic overview of the described architecture is shown in Fig. 7.

B. Internal Learning for sequence-to-sequence cloud removal
The sequence-to-sequence cloud removal method [28] follows the 3D Encoder-Decoder architecture of [29], constituted of an encoder as well as a decoder component.Both compo-nents are arranged symmetrically in the style of U-Net [30] and linked via skip connections between paired layers.The input to the network is a sequence of multi-temporal S1 samples and its output is a sequence of multi-temporal cloud-removed S2 predictions.With regards to its input-to-output mapping, the proposed architecture resembles earlier SAR-to-optical translation method [31], [32].Similar to these earlier domain translation approaches, our network learns information of the target domain (i.e. the optical imagery) via the supervision signal.Different from these approaches, the internal learning framework described below removes clouds and directly learns to denoise the target image sequence.
The architecture of the network is summarized in Fig. 8. Note, that the key difference between the given model and the sequence-to-point method of section IV-B (depicted in Fig. 7) is in the output dimensions: Whereas the sequenceto-point architecture maps a sequence of n cloudy inputs to a single cloud-removed prediction, the sequence-to-sequence approach preserves the temporal information by mapping to a time series of n cloud removed outputs.Moreover, the point estimator receives tuples of S1 and S2 inputs, whereas the network of Fig. 8 is driven solely by S1 data (or Gaussian noise, as proposed in [33], [29]).Finally, the sequence-topoint network of Fig. 7 builds on the Siamese architecture of [12] with a ResNet backbone [13] plus 3D convolutions, whereas the sequence-to-sequence approach of Fig. 8 follows a 3D convolutional variant of U-Net [30], as proposed in [29].
The training procedure of the sequence-to-sequence network follows that of internal learning for image inpainting [33], [29], which is formalized in Algorithm 1.In this framework, for a given target sequence, a neural network is trained from scratch directly on the target sequence (without any need for additional or cloud-free training data) in order to reconstruct its noisy pixels.The observations exhibit spatio-temporal regularities and patterns (i.e.signal in the data), which is first modeled and learned by the network.The irregularities in the sequence (i.e.noise in the target data) are only internalized after, similar to a conventionally-trained network overfitting to noise on training data.The internal learning approach exploits this signal-noise dichotomy and teaches a model to reconstruct cloud-covered pixels in the target sequence of S2 observations, without need for any external or cloud-free training data.In detail, a neural network is initialized and trained from scratch directly on the target sequence.At each iteration, the model receives input driving its activations (e.g.Gaussian noise or S1 recordings) and predicts a sequence Ŝ2.The predictions Ŝ2 are compared against the target sequence S2 (e.g. according to a cost function L all as in eq.IV-D) and the network learns to reproduce the cloud-free pixels.The training stops before the network overfits to internalizing the cloudy pixels.
With respect to its application and functionality, our sequence-to-sequence neural network resembles classical lowrank & sparse signal decomposition methods [34], [35], [36], Return Ŝ2 [37]: First, while neural networks are typically trained on a dedicated training data set separated from the test observations, numeral signal decomposition methods can be directly utilized on the data of interest.Similarly, our model can be directly applied on the test data.Second, unmixing of signals is very generic and can be applied to matrices as well as tensors.In comparison, the deep image prior approach applies to single images as well as time series [33], [29], too.Finally, the decomposition itself is into a low-rank part and a sparse component.The low-rank part denotes the data's compact representation and regularities.That is, spatial, spectral or temporal (auto-)correlations such as the land cover mapped by a satellite.The sparse component consists of the irregular part of the data which has only a few non-zero entries, such as the appearance of clouds.In comparable terms, the internal learning technique allows our network to discover the regularities in the data and generalizing it to cloud-covered samples, before overfitting to the noise.

IV. EXPERIMENTS AND RESULTS
This methods details the experimental design and the corresponding results on the considered cloud removal methods Fig. 8: A conceptual illustration of the 3D Encoder-Decoder architecture G seq2seq employed in the sequence-to-sequence cloud removal model [28].The network is based on the architecture of [29] and consists of encoder and decoder parts arranged symmetrically in the style of U-Net [30], with skip connections between paired layers.Input to the network is a batch of multi-temporal S1 observations.The output is a predicted batch of multi-temporal multi-spectral S2 observations.For the ablation model considered in section IV-D, Gaussian noise is used as an input as in [33], [29].
as well as their ablation variants.Section IV-A specifies the measures of goodness used to assess the quality of the individual techniques' predictions.Section IV-B introduces the baselines compared against the proposed model of IV-B on the sequence-to-point cloud removal task.Sections IV-C and IV-D detail the experiments and outcomes for the sequence-to-point and sequence-to-sequence cloud removal tasks, respectively.

A. Metrics
We evaluate quantitative performance in terms of normalized root mean squares error (NRMSE), Peak Signal-to-Noise Ratio (PSNR), structural similarity (SSIM) [38] and Spectral Angle Mapper (SAM) [39], defined as with images x, y compared via their respective pixel-values x c,h,w , y c,h,w ∈ [0, 1], dimensions C = 3, H = W = 256, means µ x , µ y , standard deviations σ x , σ y , covariance σ xy as well as constants 1 , 2 to stabilize the computation.NRMSE belongs to the class of pixel-level metrics and quantifies the average discrepancy between target and predicted pixels in units of the measure of interest.PSNR is evaluated on the whole image and quantifies the signal-to-noise ratio of the prediction as a reconstruction of the target image.SSIM is another image-wise measure that builds on PSNR and captures the structural similarity of the prediction to the target in terms of perceived change, contrast and luminance [38].The SAM measure is a third image-level metric that provides the spectral angle between the bands of two multi-channel images [39].For further analysis, the pixelwise NRMSE is evaluated in three manners: 1) over all pixels of the target image (as per convention), 2) only over cloud-covered pixels (visible in neither of any input optical sample) to measure reconstruction of noisy information, as well as 3) only over cloud-free pixels (visible in at least one input optical patch) quantifying preservation of information.The pixel-wise masking is performed according to the cloud mask given by the detector of [19].

B. Baseline Methods
To put the performances of our proposed model and ablations into context, we consider the following baseline methods.First ("least cloudy"), taking the least-cloudy input observation and forwarding it without further modification to be compared against the cloud-free target image.This provides a measure of how hard the cloud removal task is with respect to the extent of cloud-coverage present in the data.Second ("mosaicing"), we perform a mosaicing method that averages the values of pixels across cloud-free time points, thereby integrating information across time.That is, for any pixel, if there is a single clearview time point then its value is copied, for multiple cloudfree samples the mean is formed and in case no cloud-free time point exists, then a value of 0.5 is taken as a proxy.This is to avoid any extreme values, such as cloudy pixels of high-intensity.The mosaicing technique provides a measure of how much information can be reconstructed across time, from multi-spectral optical observations exclusively.Third, ResNet refers to a residual neural network as described and trained in sections IV-B as well as IV-C.The architecture is based on the model of [13], and serves as a relevant baseline because parts of this model are used as Siamese residual branches within our model, as detailed in section IV-B.It provides an estimate of how well a point-to-point cloud removal model can perform as a baseline.Fourth, the baseline STGAN denotes the "Branched ResNet generator (IR)" architecture of [12].It is a sequenceto-point cloud removal model, and the architecture of our own sequence-to-point neural network closely follows its design, as detailed in section .In sum, the purpose of assessing these baselines is to analyze whether trivial solutions to the multi-modal multi-temporal sequence-to-point cloud removal problem exist, and how any more sophisticated deep learning approach compares against these methods and our proposed model trained on SEN12MS-CR-TS.

C. Sequence-to-Point Cloud Removal
This section details the training specifics of the sequence-topoint cloud removal architecture introduced in section IV-B.As detailed in section IV-B, up to the temporal concatenation layer, we use a version of the ResNet-based [24] cloud removal network of [13] and pre-trained it on SEN12MS-CR [14] according to the training specifics of [13].All our considered sequence-to-point cloud removal networks and ablation models share this pre-trained single-temporal cloud removal network as a starting point for the sake of comparability and in order to reduce the duration of training.The networks are trained for a total of 10 epochs on 1 tuple of patches per location for every ROI in the training split.For training, the input S2 patches are filtered to display within 0 to 50 percent of cloud coverage.The target S2 patch is selected to be the sample showing the minimum cloud coverage over the given time series, i.e. it is not necessarily temporally preceding or following the input patches.For the first 25000 steps in the training procedure, the networks are trained with the initial ResNet Siamese components frozen, exclusively optimizing the subsequent 3D-convolution layers.After the steps with the pre-trained weights frozen and once the deeper layers have been calibrated to the initial network's latent feature maps, the full network is trained end-to-end for the remainder of the process.During training, the network minimizes the loss L all with λ L1 = 100 according to [12] and λ perc = 1 as hyperparameters weighting the individual pixel-wise loss L L1 and the perceptual loss L perc .The perceptual loss is computed by means of an auxiliary VGG16 network [40] resulting in sharper image reconstructions [41].In comparison to other VGG16 pre-trained on classical computer vision data sets such as ImageNet [42] and thus limited to RGB channel data, we pre-trained a VGG16 for landcover classification on the SEN12MS data set [43] according to the training protocol of [2].The proposed sequence-to-point cloud removal network (and its ablation variants) are optimized via ADAM [44], with a learning rate of 0.0002 and momentum parameters [0.5, 0.999] as in [12].A batch size of 1 tuple of samples per iteration is used for training.
To evaluate performances on the test split, samples containing S2 observations from the complete range of cloud coverage (between 0 and 100%) are considered for input.Table I compared the results of our proposed model with the baselines detailed in section IV-B.The results show that the proposed network outperforms the baselines in the majority of metrics, except for PSNR (where mosaicing comes first) and the NRMSE (clear) preservation metric (where the "least cloudy" approach performs best).This demonstrates that a deep neural network approach can typically outperform trivial solutions to the multi-modal multi-temporal cloud removal problem.Exemplary outcomes for the considered baselines on four different samples from the test split are presented in Fig. 9.The considered cases are cloud-free, partly-cloudy, cloud-covered with no visibility except for a single time point and cloud-coverage with no visibility at any time point.The results show that the considered models typically outperform the simple heuristics.One exceptional case is least cloudy in the absence of clouds, which manages to accomplish a faithful prediction in such settings.Moreover, the illustrations underline that multi-temporal and multi-modal data may benefit image reconstruction: While most methods perform well in the cloud-free or partly-cloudy cases, multi-source integration is needed if individual time points contain dense cloud coverage over wide areas.When all input data is covered by thick clouds, then this poses a severe challenge for all approaches considered.To analyze the benefits of including S1 SAR data, we perform an ablation study and compare a multi-sensor model against one only utilizing multi-spectral S2 input.Table II compared the results of the multi-modal model with an ablation version not using S1 SAR data.The comparison illustrates the benefits of including SAR data when reconstructing cloud-covered pixels.Next, we conduct an ablation experiment to assess the additional benefits of utilizing the introduced perceptual loss.Table III compared the results of our proposed model with an ablation version not using the perceptual loss (i.e.setting λ perc = 0 in eq 1).The outcomes imply that the usage of a perceptual loss results in cloud-removed predictions of a higher quality.Finally, we consider the extension of the proposed model into networks integrating 4 and 5 time points of input information.Table IV compared the performance of our model as a function of input time points (n = 3, 4, 5).
The results indicate that considering longer time series may provide further improvements in terms of reconstructing cloudcovered information.In a final experiment on sequence-topoint cloud removal, Table V reports the performance of our proposed model(n=3, with S1 and perceptual loss) as a function of cloud coverage.That is, for a given interval of cloud coverage, all n = 3 input images are sampled to contain a corresponding extent of clouds.The outcomes show that image reconstruction performance is highly dependant on the percentage of cloud coverage.While performance decrease is not strictly monotonous with an increase in cloud coverage, a strong association persists.Fig. 9: Exemplary predictions and cloud-free target images for all baselines reported in Table I.Columns: Four different samples from the test split.The considered cases are cloud-free, partly-cloudy, cloud-covered with no visibility except for a single time point and cloud-covered with no visibility in any time point.Rows: Predictions of least cloudy, mosaicing, ResNet, STGAN, ours (n=3) as well as the cloud-free reference image.The results show that the considered models outperform the simple heuristics.Moreover, the illustrations underline that multi-temporal and multi-modal data may benefit image reconstruction.The outcomes show that image reconstruction performance is highly dependant on the percentage of cloud coverage.While performance decrease is not strictly monotonous with an increase in cloud coverage, a strong association persists.

D. Sequence-to-Sequence Cloud Removal
A key characteristic of training the sequence-to-sequence cloud removal model described in section III-B is the model being trained directly on the time series of images one aims to removes clouds from, without the use of any external training data as in [33], [29].More specifically, the training procedure teaches the network to replicate cloud-free pixels and inpaint cloud-covered ones in the target sequence S2 according to the cost function L all formulated in [29] as where λ L2 = 1 and λ perc = 0.01 refer to hyper-parameters that linearly combine the terms constituting L all .L 2 is a pixelwise reconstruction loss evaluated over the cloud-free pixels via an auxiliary VGG16 network [40] as explained before.The pseudo-code formalizing the intrinsic learning procedure is given in Algorithm 1 described in section III-B and further justifications are stated in the original work of [33].For a given target sequence, the network is trained for 20 passes with batches of n = 5 samples consisting of temporally adjacent images, for 100 iterations per pass.The network is optimized via ADAM [44] with a learning rate of 0.01 and the hyperparameters of Algorithm 1 set as stated in [29].
To quantitatively evaluate the considered model on SEN12MS-CR-TS, we propose the following protocol for a sequence-to-sequence cloud removal task: For a given target sequence, the least cloud-covered S2 observation is identified and denoted as a target image S2 t .The most cloudy S2 sample is observed and denoted as a source image S2 s .The cloud-covered pixels of S2 s according to a cloud mask m are alpha-blended with the cloud-free pixels of S2 t similar to the approach of [3].Finally, the cloud-removed prediction Ŝ2 t is then compared against the originally cloud-free S2 t in order to get a measure of goodness of cloud removal.Table VI shows the results of the proposed network on the sequence-to-sequence cloud removal task following the aforementioned protocol.Furthermore, the considered model is compared against an ablation model, conditioned on random Gaussian noise as in [33], [29] in place of the meaningful S1 input observations.Example outcomes of sequence-tosequence cloud removal on a given ROI are depicted in Fig. 1.Furthermore, Fig. 11 provides a qualitative comparison between the predictions conditioned on SAR versus no prior information, underlining the benefits of multi-modal information.The results highlight that the internal learning approach can learn to reconstruct cloud-covered pixels on a very limited amount of data.Furthermore, the results demonstrate that including SAR data results in performance benefits over the single-sensor baseline.

V. DISCUSSION
The main contribution of this work is in curating and providing SEN12MS-CR-TS, a multi-modal multi-temporal data set for cloud removal in optical satellite imagery.Our large-scale data set covers a heterogeneous set of ROI sampled from all over Earth, acquired in different seasons throughout the year.Given that the contained observations cover clearview, filmy, as well as non-transparent dense clouds, the objective of reconstructing cloud-covered information poses a challenging task for the considered methods and future approaches.For the sake of demonstrating the usefulness of the presented data set, we propose a sequence-to-point as well sequence-to-sequence cloud removal network.The considered methods are evaluated in terms of pixel-wise and imagewise metrics.We provide evidence that taking time series information into account is facilitating the reconstruction of cloudy pixels and that including multi-sensor measurements does further improve the goodness of the cloud-removed predictions, justifying the design of SEN12MS-CR-TS to include multi-temporal and multi-modal data.The major difference to the preceding mono-temporal SEN12MS-CR data set [15] for cloud removal is that SEN12MS-CR-TS features a time series of 30 samples per ROI.This allows for developing methods that integrate information across time to more faithfully reconstruct cloud-obscured measurements.The sensitivity to temporal information may be particularly valuable for future research investigating the benefits of cloud removal to timesensitive applications, such as change detection.On the other side, there is a trade-off in terms of size, and while SEN12MS-CR-TS is more than twice as large as its mono-temporal precursor, the latter contains about two times as many ROI sampled over all continents.However, both data sets are fully compatible; meaning that holdout ROI of one belong to the test split of the other data set and vice versa.As there is no geospatial overlap across splits between both data sets, they can be combined for training or validation purposes.Finally, the two data sets exhibit a comparable extent of cloud coverage-about 50 and 48 % respectively, both covering the full spectrum from semitransparent haze to thick and dense clouds.A discrepancy between both data sets is in SEN12MS-CR having between 25 and 50 % overlap between neighboring patches (following the design of [43]), whereas SEN12MS-CR-TS has no intersection between adjacent samples.SEN12MS-CR contains 122,218 patch triplets of S1, cloudy S2 and cloud-free S2 data, whereas SEN12MS-CR-TS consists of 30 time samples for each of the 15578 patch-wise observations, for every S1 and S2 measurement.Due to the differences in pre-processing the two data sets are not co-registered patch-wise but, importantly, they share a common definition of ROI as well as train and test splits.This way, they are compatible with one another such that SEN12MS-CR-TS can be utilized for time-series cloud removal while SEN12MS-CR can provide further geospatial coverage of additional ROI on individual time points.Thanks to the different designs of both data sets, they may prove beneficial facilitating a variety of downstream tasks, such as semantic segmentation [43], scene classification [2] or change detection [5], even in the presence of clouds.
Beyond the design of our novel data set, additional contributions of this work are in introducing the internal learning approach to cloud removal in optical satellite data, as well as demonstrating that SAR-to-optical cloud removal performs better than the original noise-to-optical translation framework.While our data set aims to provide a global distribution of samples, we think that the internal learning approach to cloud removal may be of particular interest for remote sensing practitioners focusing on a single a spatially confined ROI, as no further external data is necessary.

VI. CONCLUSION
As a large extent of our planet is covered by haze or clouds at any given point in time, such atmospheric distortions pose a severe constraint to the ongoing monitoring of Earth.To approach this challenge, our work presented SEN12MS-CR-TS, a multi-modal and multi-temporal data set for training and evaluating global and all-season cloud removal methods.Our data set contains Sentinel-1 and Sentinel-2 observations from over 80, 000 km 2 of landcover, distributed globally and recorded through the year.The globally distributed ROI each are large-sized, and capture a heterogeneous mass of landcover.We demonstrated the practicality of SEN12MS-CR by considering two methods: First, a model for sequenceto-point cloud removal.Second, a network for sequence-tosequence cloud removal which, to our knowledge, provides the Fig. 10: Illustration of baseline methods for the sequence-to-sequence cloud removal task.The presented results show a cloudy image to be de-clouded, as well as the predictions via Riemannian Robust Principal Component Pursuit (RPCP) [45], Nonnegative Matrix Factorization Incremental Subspace Learning (NMFISL) [46], Probabilistic Non-negative Matrix Factorization (PNMF) [47], Manhattan Non-negative Matrix Factorization (MNMF) [48] and Online Stochastic Tensor Decomposition (OSTD) [49].The results indicate that the presence of large and dense clouds poses a severe challenge for the considered methods.Most baselines de-cloud the image except for some residual artifacts, some techniques display discolorization.For comparison with ours (no S1), ours (with S1) and the cloud-free target image, see Fig. 11 first case a model preserving temporal information is proposed in the context of cloud removal.Both methods benefited from the presence of co-registered and paired SAR measurements contained in our data set.The conducted experiments highlight the contribution of our curated data set to the remote sensing community as well as the benefits of multi-modal and multi-temporal information to reconstruct noisy information.SEN12MS-CR is made public to facilitate future research in multi-modal and multi-temporal image reconstruction.

ACKNOWLEDGMENT
The authors would like to thank ESA and the Copernicus program for making the Sentinel observations accessed for this submission publicly available.Furthermore, we would like to thank Rewanth Ravindran for assisting us in the data curation process.

APPENDIX A TEMPORAL COINCIDENCE OF PAIRED OBSERVATIONS
Full-scene observations of Sentinel-1 and Sentinel-2 are collected within a 14-days time window in a paired manner, as specified in section II-A.To further analyze the temporal distance within paired data, Fig. 12 illustrates the empirically observed coincidences within SEN12MS-CR-TS.The mean time differences across all paired observations is 2.61 (± 2.41), which is considerably smaller than the interval bound and implies a close proximity between paired samples.
This work was partially supported by the Federal Ministry for Economic Affairs and Energy of Germany in the project "AI4Sentinels-Deep Learning for the Enrichment of Sentinel Satellite Imagery" (FKZ50EE1910).The work of X. Zhu is jointly supported by the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement No. [ERC-2016-StG-714087], Acronym: So2Sat), by the Helmholtz Association through the Framework of Helmholtz AI (grant number: ZT-I-PF-5-01) -Local Unit "Munich Unit @Aeronautics, Space and Transport (MASTr)" and Helmholtz Excellent Professorship "Data Science in Earth Observation -Big Data Fusion for Urban Research"(grant number: W2-W3-100), by the German Federal Ministry of Education and Research (BMBF) in the framework of the international future AI lab "AI4EO -Artificial Intelligence for Earth Observation: Reasoning, Uncertainties, Ethics and Beyond" (grant number: 01DD20001) and by German Federal Ministry of Economics and Technology in the framework of the "national center of excellence ML4Earth" (grant number: 50EE2201C).(Corresponding Author: Xiao Xiang Zhu) P. Ebel and Y. Xu are with Data Science in Earth Observation, Technical University of Munich, 80333 Munich, Germany.(e-mail: patrick.ebel@tum.de).

Fig. 1 :
Fig. 1: Example observations and cloud-free predictions.Columns: Samples at two different time points.Rows: S1 data (in grayscale), cloudy S2 data (in RGB), predicted cloudfree Ŝ2 data, reference cloud-free S2 data of a later point in time.The results highlight that our network is able to integrate multi-modal and multi-temporal information to predict a clearview sequence of multi-spectral observations, even in the presence of heavy cloud coverage.

Fig. 6 :
Fig. 6: Four different regions contained in SEN12MS-CR-TS, highlighting the diversity of sampled landcovers.The depicted S2 observations (RGB channels) are cloud-free samples of their respective time series.The average ROI covers about 40 × 40 km 2 and is split into over 700 patch samples, each patch of size 256 × 256 px 2 .

Fig. 7 :
Fig.7: A conceptual illustration of the sequence-to-point cloud removal architecture G seq2point .The network is based on the architecture of[12] and consists of n siamese ResNet branches[13] doing single time point cloud removal on n individual time points.Subsequently, the feature maps are stacked in the temporal dimension and 3D convolutions are applied to integrate information across time.The output of the network is a single cloud-free image prediction.

Fig. 11 :
Fig. 11: Illustrations on the effect of prior guidance via SAR information.Columns: SAR input to the SAR-conditioned model, cloud-free prediction of the model conditioned on Gaussian noise, cloud-free prediction of the model conditioned on SAR information, cloud-free observation as a reference image.The structural information provided by the SAR input provides a strong prior to the model, guiding it towards learning to remove clouds in the cloudy input time series.
Temporal Distance between paired S1 and S2 samples

Fig. 12 :
Fig.12: Histogram of temporal differences between paired observations.The mean time differences across all paired observations is 2.61 (± 2.41), indicating a close proximity between paired samples.

TABLE I :
Quantitative evaluation of the proposed sequence-to-

TABLE II :
[39]arison of the proposed sequence-to-point model including SAR observations versus an ablation version without SAR observations in terms of normalized root mean squared error (NRSME), peak signal-to-noise ratio (PSNR), structural similarity[38](SSIM) and the Spectral Angle Mapper (SAM)[39]metric.The comparison illustrates the benefits of including SAR data when reconstructing cloudcovered pixels.

TABLE IV :
Quantitative evaluation of the proposed sequenceto-sequence model with varying numbers of time points [39] 3, 4, 5)in terms of normalized root mean squared error (NRSME), peak signal-to-noise ratio (PSNR), structural similarity[38](SSIM) and the Spectral Angle Mapper (SAM)[39]metric.Our multi-temporal network with SAR guidance outperforms the multi-temporal ablation model without prior SAR information.