U-TILISE: A Sequence-to-Sequence Model for Cloud Removal in Optical Satellite Time Series

Satellite image time series in the optical and infrared spectrum suffer from frequent data gaps due to cloud cover, cloud shadows, and temporary sensor outages. It has been a long-standing problem of remote sensing research how to best reconstruct the missing pixel values and obtain complete, cloud-free image sequences. We approach that problem from the perspective of representation learning and develop U-TILISE, an efficient neural model that is able to implicitly capture spatio-temporal patterns of the spectral intensities, and that can therefore be trained to map a cloud-masked input sequence to a cloud-free output sequence. The model consists of a convolutional spatial encoder that maps each individual frame of the input sequence to a latent encoding; an attention-based temporal encoder that captures dependencies between those per-frame encodings and lets them exchange information along the time dimension; and a convolutional spatial decoder that decodes the latent embeddings back into multi-spectral images. We experimentally evaluate the proposed model on EarthNet2021, a dataset of Sentinel-2 time series acquired all over Europe, and demonstrate its superior ability to reconstruct the missing pixels. Compared to a standard interpolation baseline, it increases the PSNR by 1.8 dB at previously seen locations and by 1.3 dB at unseen locations.


I. INTRODUCTION
M ODERN satellite images have made it possible to continuously and systematically monitor the Earth's surface.Remotely sensed imagery, and products derived from it by image classification [1]- [3], segmentation [4], [5], or regression [6]- [8], have become an important data source for applications ranging from environmental monitoring [9]- [11] to agricultural management [12]- [14].Moreover, multitemporal satellite image sequences provide unprecedented opportunities to explore the temporal dynamics of natural processes, such as the evolution of land cover phenology [15].Among Earth observation missions that systematically revisit the same locations and capture such sequences, most acquire images in the optical and near-infrared part of the electromagnetic spectrum.On the one hand, optical images are particularly suitable for visual interpretation by humans.On the other hand, spectral signatures in that part of the spectrum contain information that makes it possible to distinguish between different land cover types and to quantify the health and vitality of vegetation.Notably, several indicators for vegetation density and productivity are based on (heuristic, non-linear) combinations of spectral intensities between Corinne Stucker and Konrad Schindler are with the Chair of Photogrammetry and Remote Sensing, ETH Zürich, 8093 Zürich, Switzerland (e-mail: stuckerc@ethz.ch;schindler@ethz.ch).
Unfortunately, the effective availability of optical satellite images is considerably lower than the nominal revisit frequency of the satellites.Data gaps routinely occur, either due to occlusions or because no image is captured during an overpass for technical reasons, such as sensor maintenance or conflicting imaging requests.The primary cause for data gaps is the weather, i.e., clouds, haze, and cloud shadows that partially or fully obscure the observed scene.A study over 12 years of optical data acquired by the MODIS sensor concluded that, on average, clouds occlude 67% of the Earth's surface and 55% of the land surface at any point in time [19].The large proportion of data gaps, which moreover are irregularly distributed, calls for measures to ensure the usability of monitoring systems in the presence of frequent clouds.
A first, rudimentary approach adopted by several satellite processing pipelines is to discard images affected by clouds before further analysis.Processing solely cloud-free image observations may make data handling and visual inspection more convenient, but it also discards a lot of data that may still be usable, since often only a moderate fraction of an image or time series is affected by clouds.Moreover, [20] has shown that learning-based image classifiers trained on curated, cloud-free training datasets do not perform all that well when applied to images with even a small amount of clouds, let alone to images with moderate or severe cloud cover.More recent pipelines operate on all available input images, regardless of their degree of cloud cover, and learn to ignore uninformative pixels at an algorithmic level, for instance, via data-driven attention mechanisms [1].An alternative strategy is to remove clouds and cloud shadows in advance, such that the subsequent processing is no longer affected by them, rather than (explicitly or implicitly) ignore them during image analysis.An advantage of such a two-step approach is that multiple image analysis pipelines tailored to different tasks can all use the same cloud-free images.These pipelines can then be more efficient and often also more robust because they need not each devote a significant share of their model capacity to the detection and handling of data gaps [21].
In the last few years, encoder-decoder style neural networks have become the prevalent approach to recover missing data in optical satellite images.Several methods have tried to directly learn the mapping from a cloudy to a cloudless RGB image [22]- [24], others even try to translate Synthetic Aperture Radar (SAR) images to multi-spectral optical images [25]- [29].A recent trend has been to combine the two inputs and perform SAR-optical data fusion [30]- [32].SAR systems employ active illumination with much longer wavelengths, which are unaffected by clouds and shadows and may provide complementary information to guide the reconstruction of occluded content.However, one must bridge a considerable domain gap to infer optical reflectance values from SAR amplitudes.Both are likely to change at land cover boundaries, so SAR may help to restore image gradients, but it can hardly be expected to offer much information about the actual spectral values, colors, and texture details.In contrast, sequences of optical images depicting the same location 1 exhibit strong correlations along the temporal dimension.These spatio-temporal patterns provide evidence about the spatial structure and the spectral properties.Importantly, this holds not only for static or gradually changing land cover but also includes more complex temporal dynamics, like seasonal variations.The idea to leverage multi-temporal images for cloud removal is rather obvious; but most existing approaches [31], [33]- [36] collapse the multi-temporal, cloudy input into a single gap-free image, often additionally relying on spatially and temporally coregistered SAR observations to guide the reconstruction [31].
Here, we focus on full sequence-to-sequence translation: our goal is to convert the cloudy input time series into a gap-free product with the same time steps, but containing a clean, cloud-free image at every frame (no matter whether the original frame was cloud-free, cloudy, or entirely missing).To that end, we introduce U-TILISE,2 a neural image sequence model that captures spatio-temporal relationships between the spectral intensities in an image time series, and is, therefore, able to impute missing pixels.U-TILISE operates in three dimensions, with 2D convolutions to encode multi-scale local relationships in space and 1D (self-)attention to encode nonlocal relations in time.By design, its output has the same spatial and temporal extent as the input, such that it jointly reconstructs complete, gap-free time series.The model supports additional, auxiliary input channels and is therefore, in principle, able to use SAR amplitudes, too.But empirically this brings only a tiny improvement (see appendix).
We experimentally test U-TILISE on the EarthNet2021 dataset [37], which contains thousands of 30-frame, 4-channel (R, G, B, NIR) time series of Sentinel-2 data.Compared to standard interpolation between the temporally nearest unoccluded observations, our model improves the peak signal-tonoise ratio (PSNR) of the reconstructed spectral values by 1.8 dB at previously observed locations, and by 1.3 dB at unseen locations.
The remainder of this article is organized as follows: first, we provide an overview of cloud removal methods in Section II, with a focus on learned (mono-temporal as well as sequence-based) approaches.In Section III, we introduce the U-TILISE model, set out its components (Sections III-C and III-D), and describe the associated training and inference procedures (Section III-E).Next, we explain the data (Section IV) and the experimental setup (Section V) used in our evaluation.Section VI reports and discusses experimental results, followed by a conclusion (Section VII).Complementary experiments with additional SAR observations on top of optical time series are given in the appendix.

II. RELATED WORK
Reconstructing missing pixels in remotely sensed imagery has been a long-standing research problem.Early efforts toward thin cloud and haze removal build on physical relations [38]- [40] or signal processing considerations [41], [42] to describe the process of light transmission and interaction with clouds.Methods designed to recover image content occluded by thick clouds have been based on tensor factorization [43]- [45], on mosaicking of multi-temporal images [46]- [49] or they adopt statistical image processing methods originally developed for single-image inpainting [50]- [52].In the following, we concentrate on data-driven, learning-based methods for cloud removal.For completeness, we note that video inpainting methods like [53] share conceptual similarities with satellite time series imputation, but they are beyond the scope of the present literature review.

A. Mono-temporal cloud removal
A natural formulation of cloud removal is as an image-toimage translation task, where the mapping from the cloudy input to the cloud-free output is learned in a data-driven manner.For instance, [22] employ a conditional generative adversarial network (cGAN) [54] to map from a cloudy to a cloudless RGB satellite image.The mapping is conditioned on the NIR channel of the input, arguing that near-infrared wavelengths partially penetrate clouds and may thus capture information about the observed scene that is unavailable in the visible spectrum.In [23], the NIR channel is replaced with conditioning on a SAR image, as clouds are completely transparent at radar wavelengths.The works of [22], [23] both do not go beyond a proof-of-concept; the underlying neural networks are trained and evaluated exclusively on synthetically generated images with clouds simulated by Perlin noise [55].It has since been shown that such simulations generalize poorly to real cloudy images [56].To side-step the need for training examples where cloudy and cloud-free images are in exact, pixel-wise correspondence, [24], [56] rely on a cycle-consistency loss [57].In this way, the networks can be trained directly on images with real data gaps, eliminating the potential domain gap between training and test data.While the method in [24] is limited to thin clouds, [56] do not impose any restrictions on the maximum permissible cloud coverage or density.Furthermore, [56] combine explicit modeling of cloud densities with a residual learning strategy to better preserve the pixel values in cloud-free image regions.The methods mentioned so far have in common that they are limited to images with three optical bands.To address that limitation, several SAR-to-optical image translation approaches [25]- [29] learn the mapping from a SAR image to the full stack of multispectral bands, often also using cGANs.
Recent advances in learned cloud removal tend to rely on image fusion, i.e., they synergistically use the cloudy optical and a cloud-free SAR image to impute the missing pixels in the former.In [30], a Sentinel-2 image and a temporally close SAR image of the same scene are stacked together along the channel dimension and fed into a neural network that regresses a residual reflectance value at every pixel.Those per-pixel corrections are then added to the input to remove the missing data.[58] combine SAR-to-optical translation and SAR-optical data fusion in a cascaded fashion.First, a GAN is trained to map the SAR input to an optical image.That synthetic image is stacked with the original SAR data and the cloudy optical input and fed into a second GAN, trained to map the multimodal input to a cloud-free optical image.
The recent [32] found that stacking optical and SAR observations into a multi-modal image and processing them together does not optimally exploit the two inputs, as feature extraction from SAR is aggravated by speckle noise.Instead, the authors propose separate embedding branches per modality, together with an attention-based mechanism that gradually and selectively fuses features from the two branches.
SAR-optical data fusion approaches have demonstrated that complementary information in the form of SAR observations can help to compensate for missing data in optical images.Still, fusing optical and SAR data remains challenging due to the large domain gap between the two modalities.One must also keep in mind that SAR can hardly contribute to restoring actual spectral reflectance information, like different hues or fine-grained textures.Its role is to add spatial context, such as land-cover boundaries, which appear as gradients also in the SAR amplitude.An alternative approach is to inject contextual information from other optical sensors, as, for instance, in [59].Clearly, this will greatly reduce the domain gap, but on the other hand, there is no guarantee that a temporal close and largely cloud-free image can be found.

B. Sequence-based cloud removal
Sequence-to-point methods [31], [33]- [36], [60], [61] consume a multi-spectral time series with data gaps and output a single, gap-free image.In most cases, that output does not have a well-defined time stamp but rather is seen as representative of the entire time period between the start and end dates of the time series.Even if the output is associated with a specific time, e.g., the middle frame, the method would have to be run iteratively to reconstruct an entire time series.Typically, the input time series only have three to five images.Optionally, the reconstruction can additionally be guided by a single SAR image [61] or by a SAR time series (approximately) aligned frame-to-frame with the input [31].Some of these sequenceto-point approaches impose tight restrictions on the maximum cloud coverage per image, e.g., [33] require at most 10-30% cloud cover, and [60] require the first and last of three frames to be cloud-free (0% cover).Furthermore, these methods often assume short temporal intervals with minimal land cover changes over time since they accumulate spectral information along the temporal dimension to create the output image.That assumption rules out systematic land cover dynamics, in particular seasonal cycles of vegetation and agriculture.
To our knowledge, [62]- [64] are the only published sequence-to-sequence models, i.e., they output a cloud-free, multi-spectral image for every frame of the input time series.In [63], the sequence model is parameterized as a recurrent neural network with a two-layer GRU [65] architecture and used to learn the mapping from a SAR time series to a time series of Landsat images, only for pixels belonging to a specific land cover class, namely, rice fields.This yields reconstructions of limited quality (PSNR < 28 dB), possibly due to the well-known difficulties of learning multi-layer recurrent models [66].A rather different approach is taken by [62], who adopt a recent video inpainting technique [67] that extends the deep image prior (DIP) [68] to videos.The initial input is a SAR time series (as opposed to random noise in the original DIP) such that the network effectively performs a mapping from SAR to optical time series.We point out that while the DIP employs a neural network parameterization, it is not a learned model.The convolutional network structure, which favors lower amplitudes for high-frequency signals, serves as a hard-wired low-level prior of optical image statistics.It does not store any a priori information extracted from training data.Instead, the network weights are optimized individually for every input sequence at inference time.The method most related to our proposed U-TILISE model is [64], an adversarial approach that internally splits the computation into a first, coarse round of imputation and a subsequent refinement network.The coarse imputation model is a 3D spatio-temporal encoder-decoder architecture with separate backbones for the optical and SAR inputs, followed by a transformer-style attention mechanism in the bottleneck to fuse the latent embeddings of the two encoder branches.The model does not implicitly learn to ignore cloudy observations; instead, the contribution of cloudy input pixels is suppressed by explicitly modulating the learned attention masks according to the given cloud masks before applying them to the latent embeddings coming out of the optical encoder branch.

III. METHOD
To impute missing image content, we design U-TILISE, a learnable image sequence model in the form of a neural network.Once trained, the weights of that network encode a prior over spatio-temporal patterns of multi-spectral reflectance.When a time series with data gaps is fed into the network, the prior fills in the missing values to obtain a complete, gap-free time series.In contrast to existing cloud removal pipelines, our approach does not rely on auxiliary SAR observations to guide the imputation.Instead, we exploit spatial and temporal patterns within the multi-spectral time series itself to reconstruct the spatio-temporal evolution of the depicted land cover.Furthermore, our model jointly reconstructs all images in a given time series, as opposed to pipelines that reconstruct a single frame, considered representative of the entire sequence.

A. Problem formulation
Let X ∈ R T ×C×H×W denote a multi-spectral time series, represented as a 4D tensor with T the temporal length, C the number of spectral bands, and H × W the spatial extent.Our goal is to regress a reflectance for every spatio-temporal location, so as to obtain a complete, gap-free multi-spectral

C 1x1 Conv1D
Blockwise temporal weighted sum Masking Overview of the proposed model: U-TILISE is a neural sequence-to-sequence model that takes as input a multi-spectral satellite image time series in which missing reflectance values are masked and outputs a complete and gap-free time series with the same dimensions.U-TILISE employs a convolutional U-Net architecture [69] over the spatial and spectral dimensions and a transformer-style self-attention mechanism along the temporal dimension.The attention masks operate on the (spatial) bottleneck between the encoder and decoder parts of the U-Net as well as on the skip connections (after suitable upsampling).
time series Ŷ ∈ R T ×C×H×W .Our model assumes that the spatio-temporal locations to be imputed are marked by a binary mask M ∈ R T ×1×H×W , where the value 1 denotes pixels with a valid observation and 0 denotes missing data values.Note that we do not impose any assumptions or requirements on the mask M : it may denote any type of data gaps, including clouds and cloud shadows, but also frames that are entirely missing, for instance, due to sensor maintenance.

B. Overview
Fig. 1 gives a graphical overview of our method.At its core is U-TILISE, a learned sequence-to-sequence model that captures spatio-temporal relationships between the spectral intensities, and thereby is able to reconstruct realistic, complete, and gap-free multi-spectral time series.For efficiency, the mask M ∈ R T ×1×H×W of missing pixels is not separately fed into the model but imprinted directly on the multi-spectral input X ∈ R T ×C×H×W by setting all masked pixels to the maximum intensity 1. U-TILISE consists of three components.First, a shared multi-scale convolutional encoder transforms every image of the masked sequence into a latent embedding.Next, an attention-based temporal encoder combines the perframe embeddings across time to impute missing values in the latent sequence representation.Last, a shared convolutional decoder reconstructs every image from its latent embedding to obtain a gap-free time series with the spatial, spectral, and temporal dimensions of the input.

C. 3D spatio-temporal sequence-to-sequence model
U-TILISE builds on recent advances in learned time series processing.Its architecture is inspired by U-TAE [4], a model originally developed for crop mapping, which maps a time series to a (mono-temporal) panoptic segmentation.In a nutshell, U-TAE combines convolutions for multi-scale spatio-spectral encoding with a lightweight non-local temporal attention mechanism [2].Intuitively, the latter learns to focus on the most salient observations within a given time series.By design, U-TAE collapses the input along the temporal dimension to produce a mono-temporal output.We take inspiration from the design of modern transformer models [70], [71] and extend the architecture to a full 3D spatio-temporal sequenceto-sequence model that preserves the temporal dimension.
U-TILISE consists of (i) symmetric multi-scale spatiospectral encoding and decoding modules in the style of U-Net [69] and (ii) a lightweight temporal encoding module based on multi-head self-attention [72], see Fig. 1.We now describe each of these components in more detail.
1) Spatial encoder: The spatial encoder gradually transforms the masked time series of size (T × C × H × W ) into a multi-scale latent embedding via a sequence of convolutional blocks.Each block comprises a 3 × 3 convolutional layer with stride 1 and d filter channels, followed by a rectified linear unit (ReLU) as non-linear activation function and a residual 3 × 3 convolution with stride 1, d ′ filter channels, and ReLU activation.Between the convolutional blocks, a strided convolution equipped with ReLU activation decreases the spatial resolution of the intermediate embeddings by a factor of 2. After encoding all images in the time series (individually and in parallel), we temporally stack their latent representations to produce a multi-temporal sequence embedding with dimensions (T × D × H/8 × W/8), with D the channel depth of that embedding.
2) Temporal encoder: The temporal encoder operates individually on the spatial locations (low-resolution "pixels") of the latent embedding.For each such pixel, it captures the pair-wise dependencies between the values in all pairs of different frames and uses them to fill in missing information.The temporal encoder is based on the Lightweight Temporal Attention Encoder (L-TAE) of [2], which, in turn, is a simplified version of the multi-head self-attention mechanism of the transformer architecture [72].Unlike [2], we employ data-driven queries to preserve the temporal dimension of the input.Moreover, we use residual skip connections as in the original transformer model [72].We retain the computational simplifications of [2] and use a channel grouping, where the G attention heads process mutually exclusive subsets of D/G channels of the embedding.The learned attention scores are directly applied to the embedding vectors that come from the encoder (without first modulating them with a fully-connected layer).Following recent findings about neural sequence-to-sequence models, we prefer pre-normalization with the groupnorm scheme [73].Furthermore, we employ GELU activations [74] rather than the classical ReLU activations in the multi-layer perceptron (MLP) of the attention block.
3) Spatial decoder: After the latent representation has been passed through the attention module, the spatial decoder progressively restores multi-spectral images from the individual per-image embeddings.These images have the same spectral and spatial resolution as the input to the network but no more missing values.The structure of the spatial decoding blocks is the same as for the spatial encoding blocks, except that fractionally strided, transposed convolutions with stride 1  2 replace the strided convolutions.Once the native spatial resolution of the input has been reached, a final convolutional block maps the latent embedding to the spectral space.The final layer uses sigmoid activations instead of ReLU, so as to regress reflectances in the range [0, 1].Finally, the reconstructed frames are stacked along the temporal dimension to recover the complete, gap-filled time series.
4) Skip connections: Skip connections from encoder to decoder levels of equal spatial resolution are a key component of the U-Net [69] architecture to propagate high-frequency details and localization information that is lost during spatial downsampling operations.We adopt the same strategy as in [4] and temporally weight the information transferred between corresponding layers of the spatial encoder and decoder.The attention masks learned by the temporal encoder serve as weights, which we spatially upsample to the adequate spatial resolution using bilinear interpolation.The temporally weighted output of the encoding layers is processed with a shared 1 × 1 convolutional layer followed by ReLU activation before channel-wise concatenation with the output of the corresponding decoding layers for further processing.

D. Sinusoidal positional encoding
By itself, the self-attention mechanism is agnostic to the sequence order.To provide positional information, we follow the standard procedure for transformers [72] and add a positional encoding (PE) to the input of the temporal encoder before applying self-attention: PE (t, k) consists of fixed sinusoidal functions with predefined wavelengths and describes the position of the t th observation in the sequence, with D the channel depth of the embedding and k the coordinates of the positional encoding.We set τ = 1 000, as in [2].Contrary to [72], we do not directly encode the ordinal position t in the sequence.Instead, we encode the observation date day(t), expressed as the number of days since the 1 st of January of the respective calendar year.This strategy has proved beneficial for learned time series processing [2], [75], since it preserves information about seasonal patterns (e.g., lighting conditions or phenology of the vegetation) and accounts for irregular temporal sampling.

E. Training and inference
It is not possible to quantitatively assess the performance of time series imputation for real data gaps due to the lack of ground truth reflectances.Therefore, we train and evaluate U-TILISE with simulated data (generated by masking cloudfree frames with real cloud masks taken from other sequences, cf.Section IV-C) and examine its capability to generalize to sequences with actual data gaps due to real clouds.We train U-TILISE in a supervised manner by minimizing the pixelwise absolute differences between the imputed time series Ŷ and the corresponding ground truth time series Y : where N denotes the number of training sequences and T i the length of the i th sequence.We train U-TILISE for a fixed temporal length of T = 10, i.e., the input is a time series that comprises at most 10 images.Shorter sequences are padded with no-data frames.During training, longer sequences are randomly cropped if T i > T .At test time, we retain the full time series and process it in one shot if T i ≤ T , or in sliding window fashion if T i > T .

IV. DATA
We evaluate our method on EarthNet2021, a large, publicly available dataset of Sentinel-2 time series.Note, however, that our method is sensor-agnostic and can adapt to the properties of different multi-spectral imaging sensors, given suitable training data.

A. EarthNet2021 dataset
The EarthNet2021 [37] dataset was originally designed for satellite image forecasting, conditioned on future meteorological variables.It includes more than 32 000 Sentinel-2 time series collected over Central and Western Europe from November 2016 to May 2020.Each time series consists of 30 images with Level-1C top-of-atmosphere (TOA) reflectances.The images are acquired in a regular temporal interval of five days, where acquisition dates without an observation are encoded as images of NaN values.Every image is composed of the four spectral bands B2 (blue), B3 (green), B4 (red), and B8 (near-infrared) and covers a spatial extent of 128×128 pixels (2.56×2.56km in scene space), resampled to the resolution of 20 m.For every observation, the dataset further includes a pixel-wise cloud probability map3 obtained via the S2Cloudless algorithm [76] and a binary cloud and cloud shadow mask based on heuristic rules similar to [30].
In our experiments, we reserve ≈ 20% of the training sequences for validation, where training and validation tiles are mutually exclusive.For testing, we use the iid and ood test splits.Time series in the iid test split stem from the same Sentinel-2 tiles as the training data and the ones in the ood test split from previously unseen locations.

B. Preprocessing
We adopt the preprocessing protocol of prior works [37] and value-clip the optical images to the range [0, 10 000], followed by normalization to the unit range of [0, 1].
To train a system for cloud removal that regresses time series rather than a single image, we found experimentally that pixel-wise supervision for every spatio-temporal location is crucial for learning seasonal changes and land surface dynamics over time.Since obtaining such ground truth for time series with real data gaps is impossible, we resort to cloudfree time series and introduce synthetically generated data gaps during training and evaluation, as described in Section IV-C.Starting from a time series with real data gaps, we first identify all images with partially occluded pixels or images that are occluded/missing entirely by applying a threshold to the cloud probability maps (if available) or the binary cloud masks.We choose the threshold in a conservative manner to minimize the number of undetected data gaps.We then remove all images with data gaps to produce cloud-free time series that exhibit a valid 4 observation for every spatio-temporal location.Second, we discard time series with less than five remaining images, as we deem such sequences too short for learning spatio-temporal patterns.See the appendix for a summary of the number of time series, their temporal lengths, and the temporal resolution before and after filtering images with data gaps.

C. Simulation of data gaps
Realistically simulating cloud cover in satellite images is notoriously difficult.Synthetic images generated with existing physics-inspired simulation methods, like the well-known Perlin noise model [55], do not match the radiometry of real data well enough: it has been shown that cloud removal methods trained with such synthetic images do not generalize well to images containing actual clouds [56].To create time series with artificial data gaps for training and evaluation, we thus refrain from rendering synthetic clouds.Instead, we adopt a strategy commonly employed for image inpainting [77]- [79] and completely mask out invalid pixels by setting them to the maximum intensity value 1.In this way, one only has to realistically simulate binary cloud masks, which is straightforward: all one needs to do is randomly sample real cloud masks from other acquisition times and/or locations within the same Sentinel-2 tile and apply them to a gap-free time series.With that strategy, we obtain time series with data gaps of realistic shapes and sizes and with known ground truth reflectances at all masked pixels in a controlled, fully automatic manner.

V. EXPERIMENTS A. Setup
Sequence-to-point methods typically use a temporally close cloud-free image of the same location to quantify the fidelity of the synthesized output.[31], one of the few existing sequence-to-sequence approaches, regresses a time series of multi-spectral Sentinel-2 images but restricts the evaluation to a single time step, namely, the one with the lowest cloud cover.Such single-frame evaluation protocols are, in our view, problematic.On the one hand, they do not actually measure the quality of the regressed time series, as the temporal aspect is completely ignored.On the other hand, metrics computed only from the least cloudy image will likely be too optimistic, since the imputation task becomes more challenging with increasing occlusions.
We argue that the evaluation should take into account all images with missing pixels, irrespective of the degree of cloud cover.As ground truth reflectances are unavailable for time series with real data gaps, we quantitatively evaluate U-TILISE on cloud-free sequences with synthetically added data gaps.Additionally, we apply the learned model (without further finetuning) to time series featuring real data gaps to qualitatively assess performance in the true application setting.
We follow the procedure described in Sections IV-B and IV-C to generate time series with synthetic data gaps.Unless stated otherwise, we randomly trim the cloud-filtered time series to a maximum length of T = 10 images during training.At test time, we process the full-length sequences in sliding window fashion if their length exceeds T .To simulate data gaps, we randomly superimpose at most 50% of the images per time series with cloud (and cloud shadow) masks drawn randomly from the dataset, with a minimum of one masked image per sequence.When processing sequences with real data gaps, the actual cloud masks are used to mark missing pixels.

B. Implementation details
U-TILISE is implemented in PyTorch [80].For training, we use an NVIDIA GeForce RTX 2080 Ti GPU for time series with four spectral bands and an NVIDIA TITAN RTX GPU for time series with 13 spectral bands.Source code and pretrained models are available at https://github.com/prs-eth/U-TILISE.
In our experiments, we use 64 filter channels for the spatial encoding and decoding convolutional layers, except at the lowest spatial resolution, where we use 128 filter channels.Accordingly, the temporal encoder has a latent feature dimension of 128.Temporal self-attention employs G = 4 heads, with a dimension of four for the data-driven keys and queries.To augment the training data, we randomly rotate all images in a sequence by α ∈ {0 • , 90 • , 180 • , 270 • } and randomly flip them along the x and y axes.
We train with the Adam optimizer [81] with hyperparameters {β 1 =0.9, β 2 =0.999}, a batch size of three, and no weight decay.During the first 250 epochs, the initial base learning rate of 2 • 10 −4 is halved every 50 epochs.Training is stopped once the L 1 loss (cf.Section III-E) on the validation set has converged.In our experiments, this took about 1 000 epochs, or 20 days of training on a single GPU.The computational cost for applying the trained model is low: the forward pass for a time series consisting of (at most) 10 images takes ≈ 0.02 seconds.Longer sequences are processed in sliding window fashion, which on average takes 0.13 seconds for a 30-frame sequence.

C. Evaluation metrics
We adopt a suite of metrics commonly used to evaluate cloud removal and inpainting methods: mean absolute error (MAE), root mean square error (RMSE), spectral angle mapper (SAM) [82], peak signal-to-noise ratio (PSNR), and structural similarity index (SSIM) [83]: where Ŷ i denotes the i th predicted time series with C spectral bands and T i frames, Y i the corresponding ground truth time series, and (x, y) and t the spatial and temporal coordinates.
In MAE, RMSE, and PSNR are popular metrics that quantify the pixel-wise reconstruction error.MAE and RMSE are expressed relative to the TOA reflectance range (ρ TOA , recall that reflectances have been rescaled ρ TOA ∈ [0, 1]).PSNR is in decibel (dB).SAM measures the spectral fidelity of the reconstructed pixels, defined as the average angle (in degrees) between predicted and ground truth spectral vectors.SSIM is a unitless per-image metric that measures the overall structural similarity between a reconstructed image and the corresponding ground truth.We compute the pixel-based metrics over all imputed pixels (according to the input mask).Similarly, to compute the average SSIM, we only take images into account that contain masked pixels, denoted as T i m .To generate the cloud-free reference time series for evaluation, we define a global threshold on the cloud probability maps (if available) or the binary cloud masks.That threshold may not be ideal for every single image; consequently, small clouds or haze may remain undetected.To alleviate the impact of these remaining data gaps (in the ground truth) on quantitative analysis, we compute the pixel-based metrics only for those imputed pixels (t, x, y) for which a cloud-free reference reflectance is available according to the original cloud masks, denoted as Ω i for the i th time series.

D. Baseline methods
We compare U-TILISE against several natural baselines that operate independently on every spatial location over time.The simplest baseline (last) imputes missing pixels by copying the last valid observation before the current frame.The next baseline (closest) copies either the last or the next following observation, depending on which one is closer in time.The third baseline linearly interpolates between the last and next observations (according to the absolute time span in days), an approach frequently employed in operational practice [84].

A. Imputation quality of U-TILISE
We begin by evaluating our method on Sentinel-2 time series augmented with synthetic data gaps.Quantitative results are shown in Table I, visual examples are given in Figs. 2  and 3. U-TILISE generates coherent and gap-free optical time series that capture the natural evolution of the depicted land cover.It can handle occlusions of various shapes and sizes, recover images that suffer from severe occlusions, and reconstruct complete times series from input sequences that include multiple consecutive frames of missing data.Note how the model has implicitly learned to adapt to radiometric variations within the time series, 5 such that imputed pixels seamlessly blend into the surrounding image content, even if the frames that show the same regions unoccluded are radiometrically different.Besides naturally adapting reflectance values, U-TILISE is able to recover plausible color transitions and impute missing information in scenes with non-trivial temporal dynamics (Fig. 2, row 5).Furthermore, it preserves high reflectance values not associated with clouds (Fig. 3, row 3).Abrupt scene changes, such as harvested agricultural fields, are sometimes missed (Fig. 3, row 7), which is natural since their exact timing depends on weather conditions that the model has no access to.
Quantitatively, the predicted reflectances agree well with the true values across all optical bands.See Fig. 4. U-TILISE yields a MAE below 1% of the intensity range and a PSNR on the order of 38 dB.Recall, these errors are averaged only over imputed pixels and not inflated by the trivial reconstruction of observed values (cf.Table I, iid test split).The SSIM, averaged over all images with data gaps, amounts to 0.97.Notably, there is only a moderate performance penalty when applying the model to previously unseen locations (cf.PSNR of almost 37 dB and a SSIM of 0.96, this still amounts to high fidelity and visual quality.

B. Comparison to baselines
Table I compares U-TILISE against the three baselines.We first discuss the performance on iid test sequences, corresponding to the imputation of new sequences acquired at previously seen locations.As expected, the last baseline, which copies the last valid observation, yields the largest reconstruction errors.Instead, cloning the temporally closest observation improves the reconstruction quality and the spectral fidelity markedly, reducing MAE, RMSE, and SAM by more than 10% and increasing PSNR by about 1 dB; mostly because the last heuristic degrades for longer data gaps where the same location is repeatedly occluded.Linear interpolation between the most recent and the next available observation brings further gains.MAE, RMSE, and SAM drop by another ≈ 14%, and PSNR improves by 1.5 dB to 36.0 dB.U-TILISE consistently outperforms all baselines by a significant margin.Compared to linear interpolation, the MAE, RMSE, and SAM values decrease by 20%, and the PSNR increases by more than 1.5 dB.We observe similar trends when evaluating the ood test set of previously unseen locations.We note in passing that the ood test set is objectively more difficult: all methods perform slightly worse on it, although the baselines do not involve any learning and can, by definition, not overfit to specific geographic locations.U-TILISE predicts a reflectance value for every spatiotemporal location in the output sequence, including pixels with valid input observations.In principle, those predicted values could deviate from the actual, observed values at cloud-free pixels. 6For a complete evaluation, we thus also evaluate the spectral fidelity at pixels with valid input reflectances (Table I, columns 8-9).By construction, all baselines achieve the same, maximal performance, since they do not alter valid reflectance values.U-TILISE, on the other hand, must learn to preserve the reflectances at unoccluded (spatio-temporal) locations.It does that astonishingly well, with a MAE around 1 5000 of the intensity range-less than the radiometric sensitivity of Sentinel-2.The estimates of U-TILISE are also qualitatively superior to those of the baselines, especially in the presence of significant spectral changes in time (cf.Fig. 2) and of remaining atmospheric effects (cf.Fig. 3, rows 1-4).A striking example is the southernmost, circular field in Fig. 2, for which U-TILISE smoothly transitions from dark to bright brown and then changes abruptly to dark green, which agrees well with the true evolution of the depicted scene.In contrast, copying or simple interpolation lead to evident visual artifacts.

C. Learned multi-temporal attention
The attention masks of the temporal self-attention mechanism encode the contribution of each input pixel to the regression of the output pixels. 8This makes them a useful visual cue to determine on which input frames the learned model bases its predictions.Figs. 5 and 6 depict the attention masks for the first time series in Fig. 3, demonstrating that U-TILISE indeed discovers, in a data-driven manner, which observations are most useful for its task.Not only is the attention low in data gaps (cf.Fig. 6), the model has also learned to preferentially attend to temporally close observations; to use information from unoccluded regions of the current frame, presumably to Fig. 6.Self-attention in the temporal encoder.For the time series in row 1. Rows 2-9 show the attention masks for one of the four heads, with rows corresponding to frames of the output (in temporal order).The masks are bilinearly upsampled to the native spatial resolution of the input and colorcoded from black (no attention) to yellow (maximum attention).Note how the attention progressively moves through time to focus on unoccluded inputs and unoccluded pixels of partially occluded inputs, while it borrows information temporally nearby frames where needed.match its radiometry; and to let the attention heads specialize on different portions of the sequence (cf.Fig. 5).

D. Importance of temporal encoding
We go on to study the influence of different network configurations, starting with the temporal encoder.To this end, we conduct two ablation experiments: (i) we replace the temporally weighted skip connections between corresponding layers of the spatial encoder and decoder with ordinary  skip connections, and (ii) we remove the temporal encoder altogether, resulting in a standard U-Net architecture that processes each frame of the input sequence independently.
As expected, we observe a significant deterioration in all evaluation metrics when U-TILISE cannot exploit the time series to fill in missing image content (Table II,  and more than 20% in terms of SAM (Table II, 1 st row).PSNR improves by another 1.7 dB.

E. Choice of positional encoding
Next, we evaluate the influence of the positional encoding scheme (cf.Section III-D) on the model output.We found experimentally that information about the image order and the relative temporal distance between observations is crucial to reconstruct realistic, gap-free time series.Without any positional encoding, the performance of U-TILISE drops significantly; MAE, RMSE, and SAM increase by ≈ 40% and PSNR drops by almost 3 dB (Table II, 6 th row).Injecting ordinal information improves MAE and RMSE by 13% and SAM by 17%, while also improving the perceptual similarity to the target sequence (Table II, 5 th row).Encoding the temporal offset from the first observation in the sequence rather than the ordinal position brings another improvement of ≈ 20% in MAE, RMSE, and SAM and boosts PSNR by 2 dB (Table II, 4 th row).We find only a tiny difference between encoding the temporal distance to the first observation or encoding the acquisition date of every observation relative to the 1 st January of the respective calendar year.We speculate that the latter strategy suffers from a bias inherent in our training data: most of the EarthNet2021 time series are captured between May and October; likely, the dataset does not offer sufficient variability to extract an expressive prior over seasonal patterns.

F. Influence of input sequence length T
To test the sensitivity of U-TILISE to the number of images processed as one sequence, we define two model variants by varying the temporal window length T .In detail, we retrain U-TILISE once with training sequences that are randomly trimmed to a maximum temporal length of five images (T = 5), and once with time series that comprise up to 15 images (T = 15).At test time, we always process the full-length time series, employing a sliding window scheme if the temporal length T i is larger than the maximum length T used during training.
We find that all evaluation metrics remain relatively stable when varying the temporal window of U-TILISE (Table II, rows 7-8), with fluctuations below 2% compared to our default setting (T = 10).For a more fine-grained analysis, we thus separately measure the performance for time series consisting of (i) at most 9 images, (ii) 10 to 14 images, or (iii) 15 or more images.As shown in Fig. 8, the MAE at imputed pixels slightly reduces with longer time series and temporal window T , indicating that U-TILISE can exploit long temporal context if needed.

G. Number of attention heads
U-TILISE is fairly robust with varying numbers of temporal attention heads.We find marginal quantitative gains when adding more heads (Table II, rows 9-11), and also only small differences in visual quality (Fig. 7(c)).

H. Time series with real data gaps
In the last experiment, we use the trained U-TILISE model, unaltered, to impute missing pixels in time series with real data gaps.This scenario corresponds to the practical application case, where the masked pixels are truly unobserved.Of course, this also implies that the resulting outputs can only be assessed through visual inspection, since no ground truth exists for the masked areas.Note that the application to real cloudy time series may, from a machine learning perspective, involve some degree of generalization.As clouds do not have sharp boundaries, the unmasked regions just outside the cloud mask may, in some cases, still be affected by thin clouds and haze.
During training, where cloud-free images are synthetically masked, the model has not been exposed to such a situation.Fig. 9 depicts imputation results for two representative 30-frame time series from the EarthNet2021 test set.The original, observed time series suffer from severe data gaps from clouds, shadows, haze, and missing images.Furthermore, the second example exhibits a rather long period without any valid observations.Despite these challenges, U-TILISE creates realistic, gap-free time series of high visual quality.The reconstructed time series do occasionally suffer from remaining clouds or haze, if they were missed by the cloud masking algorithm (Fig. 9, 5 th last frame of the second example).

VII. CONCLUSION
We have presented U-TILISE, a learned sequence-tosequence model for data imputation in optical satellite image time series.The model combines 2D convolutions over the spatial and spectral dimensions and 1D self-attention across time into an efficient prior over multi-spectral and multitemporal reflectance patterns.Given an optical time series in which invalid reflectance values are masked, U-TILISE creates a coherent time series with a clean, cloud-free image at every time step of the input.
In a series of experiments, we have shown that U-TILISE reconstructs gap-free Sentinel-2 sequences with high accuracy.It removes clouds and cloud shadows of various shapes and sizes, manages to recover multiple consecutive frames of missing data, and generalizes to previously unseen geographical locations.On the EarthNet2021 dataset, the average MAE within the data gaps is in the order of 1% of the intensity range, and the PSNR is ≈ 38 dB.
A limitation of the current approach is that it relies on cloud (and cloud shadow) masks as auxiliary input.Detecting cloudy pixels in remotely sensed imagery is challenging and still not completely solved.Mistakes of the preceding cloud detector limit the performance of U-TILISE, since it implicitly learns to preserve input reflectances that are valid according to the masks.Furthermore, the model may of course reconstruct radiometrically plausible but incorrect transitions in cases where the available information is too sparse to determine when a sudden change has occurred, such as, for instance, a harvesting event during a multi-frame data gap.
An interesting future research direction is to eliminate the need for external cloud masks and instead design the model such that it implicitly performs the detection of invalid pixels.Another useful extension for users of the U-TILISE output will be to integrate a probabilistic deep learning scheme and supplement the output with spatially and temporally resolved uncertainty estimates.

APPENDIX COMPLEMENTARY EXPERIMENTS
Some authors [31], [61] have advocated the use of coregistered SAR observations, which are largely unaffected by the atmosphere, to support optical image imputation.It is a natural idea to also augment U-TILISE with a time series of spatially and temporally co-registered SAR images.Technically, this is straightforward: we add the two-channel SAR images (ortho-rectified VV/VH log-amplitude) as additional input channels that need not be reconstructed and, accordingly, increase the filter depth of the first layer by two.We train and test this multi-modal variant of U-TILISE in a simulated setting using the SEN12MS-CR-TS dataset [31], a multimodal and multi-temporal dataset specifically designed for multi-modal cloud removal.Unfortunately, it turns out that the dataset is not only considerably smaller than EarthNet2021 (see Table III), but its sequences also exhibit comparatively lower temporal variability and dynamics and do not allow a conclusive comparison.We refrain from drawing firm conclusions about the impact of complementary SAR observations and instead conclude that a more informative dataset must be created to investigate the issue, possibly by augmenting EarthNet2021 with SAR observations.

SEN12MS-CR-TS dataset
SEN12MS-CR-TS [31] comprises about 15 000 globally sampled Sentinel-2 time series (Level-1C with top-ofatmosphere reflectances) from 2018 with a spatial extent of 256×256 pixels (2.56×2.56km in scene space).Each time series contains 30 images, with varying temporal spacing of 5 to 15 days between consecutive observations.The images encompass all 13 spectral bands, upsampled to 10 m resolution.Furthermore, every optical image is paired with a spatially co-registered, temporally close (but not synchronous) C-band SAR image with two channels representing the σ 0 backscatter coefficients in the VV and VH polarizations, in units of decibels (dB).The temporal offset between paired optical and SAR observations is three days on average, although, for 5.5% of the pairs, the temporal difference is over one week.The dataset also includes pixel-wise cloud probabilities and binary cloud masks, produced with the S2Cloudless detector [76].

Experimental setup
We adopt the preprocessing protocol of [31], [32] and valueclip the optical images to the range [0, 10 000] and the SAR images to [−25, 0], followed by normalization to the unit range [0, 1].Like for EarthNet2021, we extract cloud-free optical time series for training and evaluation (cf.Section IV-B).In rare cases, SEN12MS-CR-TS sequences exhibit data gaps of several consecutive months.To limit potential land cover and seasonal changes to a reasonable range, we temporally trim the time series such that the temporal spacing between adjacent valid frames is at most four weeks.The resulting time series are, on average, shorter than those of EarthNet2021, and they have about 50% larger temporal spacing between consecutive frames (cf.Table III).
After removing real data gaps, we introduce synthetic gaps into the optical images (cf.Section IV-C) and concatenate the resulting, masked optical time series with the (unmodified) SAR time series along the channel dimension to produce multi-modal input for U-TILISE.

Training details
We use the Adam optimizer [81] with hyper-parameters {β 1 =0.9, β 2 =0.999}, batch size 3, and a weight decay of 10 −5 .The base learning rate of 2 • 10 −4 is reduced by 50% every 80 training epochs.Due to the larger spatial dimensions of the input time series (256×256 pixels, compared to 128×128 pixels in EarthNet2021), we add an additional convolutional block in the spatial encoder and decoder, such that the (spatial) dimension of 16×16 pixels in the bottleneck is the same as for EarthNet2021.

Results
We provide quantitative results in Table IV and a visual example in Fig. 10.As already observed with EarthNet2021, linear interpolation between the most recent and the next available observation performs significantly better than the more widely used baselines that replicate either the last or the temporally closest observation.U-TILISE achieves marginally better error metrics with SAR guidance than without, but the differences (< 0.05% of the intensity range in MAE and RMSE, < 0.2 • in SAM, < 0.2% in SSIM) are negligible and well within the stochastic fluctuations of deep network training.Moreover, the linear interpolation baseline is on par with both variants, and all three results remain well below the fidelity achieved on EarthNet2021 (Table I), although the numbers are not directly comparable since SEN12MS-CR-TS includes 13 bands of which 9 have been upsampled to 10 m GSD, whereas EarthNet2021 consists of 4 bands that were downsampled to 20 m GSD.Upon inspection, we find that SEN12MS-CR-TS contains many sequences where the land cover is static and largely homogeneous.The results we obtain on SEN12MS-CR-TS neither confirm nor rule out a possible benefit through SAR guidance.We believe that a larger dataset with more non-linear temporal variations will be needed to carry out a conclusive comparison.Government.The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.

Fig. 2 .
Fig.2.Visual comparison of U-TILISE with selected baselines for a Sentinel-2 time series of the EarthNet2021 iid test split with predominantly gradually changing land cover.We show the true-color RGB composite for every image in the time series.The number above an image denotes its temporal distance to the previous observation in the sequence or its MAE, evaluated over all pixels that have been masked in the corresponding input image and across all spectral bands (R, G, B, and NIR).

Fig. 3 .
Fig. 3. Visual comparison (true-color RGB composites) of two Sentinel-2 time series of the EarthNet2021 iid test split, gap-filled using either U-TILISE or linear interpolation over time.Rows 1-4 depict a static scene, while rows 5-8 show a scene with sudden land cover changes due to agricultural activities by humans.The number above an image indicates its temporal distance to the previous observation in the sequence or its MAE, evaluated over all pixels that have been masked in the corresponding input image and across all spectral bands (R, G, B, and NIR).

Fig. 5 .
Fig.5.Self-attention in the temporal encoder.We show the attention scores for imputing the 5 th image in the example time series (highlighted with a red frame in row 1), displayed separately for each of the four attention heads (rows 2-5).The attention masks are bilinearly upsampled to the native spatial resolution of the input time series and color-coded from black (no attention) to yellow (maximum attention).

Fig. 7 .
Fig. 7. Visual comparison of different U-TILISE variants for the 8 th time step in the sequence from Fig. 2 (the 2 nd totally masked image).(a) U-TILISE with ordinary skip connections; (b) U-TILISE with default parameter settings; (c) U-TILISE with 16 attention heads; (d) ground truth.

Fig. 8 .
Fig. 8. MAE as a function of the full sequence length T i and the temporal window T of U-TILISE.

Fig. 9 .
Fig. 9. Imputation results for two time series of the EarthNet2021 dataset with real data gaps due to clouds, cloud shadows, and missing frames (shown in black).Example 1 is from the ood test split and example 2 from the iid test split.

Fig. 10 .
Fig. 10.Visual comparison of two U-TILISE variants on the SEN12MS-CR-TS dataset.On the left are the inputs for an exemplary sequence.For multispectral data, the RGB channels are displayed; SAR data are rendered as two-channel image composites (VV/VH amplitude).Numbers indicate the temporal spacing from the preceding image.U-TILISE prediction from only optical time series (top) and from combined optical and SAR input (bottom) are visually indistinguishable.Numbers are mean absolute errors over all masked pixels, across all 13 spectral bands.

ACKNOWLEDGMENT
This research is based upon work supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via Contract #2021-21040700001.The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the U.S.

Table I ,
ood test split).The MAE increases by 27% to roughly 0.01 and the SAM by 10% from 1.9 degrees to 2.1 degrees.Yet, with a

TABLE I QUANTITATIVE
COMPARISON OF DIFFERENT IMPUTATION METHODS FOR TIME SERIES OF THE EARTHNET2021 DATASET, SEPARATELY EVALUATED FOR GEOGRAPHIC LOCATIONS SIMILAR TO THE TRAINING SET (iid) AND FOR PREVIOUSLY UNSEEN LOCATIONS (ood).COLUMNS 3-7 REPORT METRICS COMPUTED OVER ALL PIXELS (OR IMAGES, IN THE CASE OF SSIM) WITH MISSING INPUT DATA, WHEREAS COLUMNS 8-9 MEASURE THE QUALITY OF OUTPUT PIXELS (OR IMAGES) WITH VALID INPUT VALUES.

TABLE II QUANTITATIVE
RESULTS OF DIFFERENT U-TILISE VARIANTS.WE ABLATE THE INFLUENCE OF THE TEMPORAL ENCODER AND TEMPORALLY WEIGHTED SKIP CONNECTIONS IN THE SPATIAL U-NET (ROWS 2-3), THE STRATEGY USED FOR POSITIONAL ENCODING (ROWS 4-6), THE NUMBER OF IMAGES T PROCESSED IN ONE SHOT (ROWS 7-8), AND THE NUMBER OF ATTENTION HEADS G OF THE TEMPORAL ENCODER (ROWS 9-11).THE METRICS ARE COMPUTED OVER ALL PIXELS (OR IMAGES, IN THE CASE OF SSIM) WITH MISSING DATA IN THE INPUT TIME SERIES.MAE [ρ TOA ] ↓ RMSE [ρ TOA ] ↓ SAM [°] ↑ PSNR [dB] ↑ SSIM [-] MAE: 0.0096

TABLE III ACQUISITION
DETAILS OF THE TWO SENTINEL-2 DATASETS.WE LIST THE NUMBER OF SEQUENCES, THEIR AVERAGE LENGTH, AND THE TEMPORAL RESOLUTION; SEPARATELY FOR THE TRAINING, VALIDATION, AND TEST PARTS (FOR EARTHNET2021, FURTHER DIVIDED INTO iid AND ood TEST SETS).WE ALSO SHOW THE STATISTICS FOR THE SIMULATED TRAINING SEQUENCES, AFTER REMOVING IMAGES WITH ACTUAL CLOUDS.

TABLE IV QUANTITATIVE
COMPARISON OF U-TILISE WITH CONVENTIONAL BASELINES FOR THE SEN12MS-CR-TS DATASET.THE METRICS ARE COMPUTED OVER ALL PIXELS (OR IMAGES, IN THE CASE OF SSIM) WITH MISSING DATA IN THE INPUT TIME SERIES.WE TRAIN AND EVALUATE U-TILISE ONCE ONLY WITH OPTICAL TIME SERIES AND ONCE WITH ADDITIONAL SAR INPUT.Method ↓ MAE [ρ TOA ] ↓ RMSE [ρ TOA ] ↓ SAM [°] ↑ PSNR [dB] ↑ SSIM [-]