Robust Spatiotemporal Fusion of Satellite Images: A Constrained Convex Optimization Approach

This article proposes a novel spatiotemporal (ST) fusion framework for satellite images, named robust optimization-based ST fusion (ROSTF). ST fusion is a promising approach to resolve a tradeoff between the temporal and spatial resolution of satellite images. Although many ST fusion methods have been proposed, most of them are not designed to explicitly account for noise in observed images, despite the inevitable influence of noise caused by the measurement equipment and environment. Our ROSTF addresses this challenge by formulating noise removal and ST fusion as a unified optimization problem. First, we define observation models for satellite images that may be contaminated with random noise, outliers, and/or missing values. Next, we introduce certain assumptions that naturally hold between the observed images and the target high-resolution image. Then, based on these models and assumptions, we formulate the fusion problem as a constrained optimization problem and develop an efficient algorithm based on a preconditioned primal–dual splitting method (P-PDS) for solving the problem. The performance of ROSTF was verified using simulated and real data. The results show that ROSTF performs comparably to several state-of-the-art ST fusion methods in noiseless cases and outperforms them in noisy cases.


I. INTRODUCTION
T HE analysis of temporal image series is necessary and important in many remote sensing applications, such as vegetation/crop monitoring and estimation [1], evapotranspiration estimation [2], atmosphere monitoring [3], landcover/land-use change detection [4], surface dynamic mapping [5], ecosystem monitoring [6], soil water content analysis [7], and detailed analysis of human-nature interactions [8].These applications require time series of high spatial resolution images to properly model the ground surface.In addition, time series of high temporal resolution images are also needed to capture the changes in the ground surface that occur over short periods of time.
However, there is a trade-off between the temporal and spatial resolution of satellite sensors, and no single sensor can satisfy both requirements.For example, the Landsat sensors can acquire images with a high spatial resolution of 30-m, but they have a revisit period of up to 16 dates.On the other hand, the Moderate resolution Imaging Spectroradiometer (MODIS) sensors can acquire images for the same scene at least once per date, but the images are at a low spatial resolution of 500-m [9].Therefore, the simultaneous acquisition of image series of high spatial and high temporal resolution is a major challenge in the remote sensing community [10].The direct solution to this challenge is simply to estimate an unobserved high spatial resolution (HR) image from the single corresponding low spatial resolution (LR) image by superresolution [11], [12].However, this is too difficult because the spatial resolution gap between the two satellite images is often quite large.
Spatiotemporal fusion (ST fusion) addresses this challenge by referring to pairs of HR and LR images on reference dates that are temporally close to the target date.Specifically, the unobserved HR image on the target date is estimated by combining detailed spatial structure extracted from the HR images on the reference dates and spectral changes captured from the differences between the LR images on the reference and target dates.In ideal situations where a large number of reliable reference images are available, accurate ST fusion would be easy to achieve because the correct spatial structure and spectral changes are readily available.In real-world applications, however, such situations are very rare.Therefore, to achieve the desired ST fusion in real-world applications, the following two requirements are important.
• Minimal reference dates: ST fusion methods that can be used with minimal reference dates are preferable.In many remote sensing applications, only one pair on a reference date may be available due to cloud contamination, time inconsistency of image acquisition, or other reasons.In addition, in some cases, it is very time-consuming to prepare another pair of images.Therefore, ST fusion methods using a single pair of HR and LR images on a reference date are applicable to much more cases than those using multiple pairs, although such a situation is obviously challenging [13].• Robustness to noise: Due to the measurement equipment and/or environment, satellite images are often contaminated with various types of noise, such as random noise, outliers, and missing values [14], [15].An estimated HR image that remains noisy would significantly affect subsequent processing.Therefore, it is crucial to develop a method that is robust to noise.

A. Prior Research
Many ST fusion methods have been proposed over the past decades.They are roughly categorized into five groups [16]: unmixing-based, weight-function-based, Bayesian-based, learning-based, and hybrid methods.Unmixing-based methods estimate the pixel values of an HR image through unmixing the pixels of the input LR images based on the linear spectral mixing theory [17], [18].Weight-function-based methods generate an HR image by combining information of all the input images based on weight functions [9], [19].Bayesian-based methods use Bayesian estimation theory to fuse the input images in a probabilistic manner [20], [21].In the Bayesian framework, ST fusion can be considered as a maximum a posteriori (MAP) problem, i.e., the goal is to obtain an HR image on the target date by maximizing its conditional probability relative to the input HR and LR images.Learning-based methods model the relationship between observed HR and LR image pairs and then predict an unobserved HR image using machine learning algorithms such as sparse representation learning [22], [23], random forest [24], and neural networks [25]- [31].Hybrid methods integrate two or more techniques from the above four categories [32], [33].
Some of the unmixing-based, weight-function-based, and Bayesian-based methods allow an arbitrary number of reference dates, i.e., they can handle the cases with a single reference date.However, they are sensitive to noise because they estimate an HR image at the pixel level based entirely on reference images, which may be noisy.Thus, if the input images are noisy, the output image will be severely degraded.
On the other hand, in the context of learning-based methods, a robust spatiotemporal fusion network (RSFN) [28] is established that accounts for Gaussian noise.RSFN automatically filters noise and prevents predictions from being corrupted by using convolutional neural networks (CNNs), generative adversarial networks (GANs), and an attention mechanism.Specifically, RSFN uses the attention mechanism to ignore noisy pixels in two reference HR images and focus on clean pixels.This method only works in situations where noisy pixels in the two reference HR images do not appear at the same location.In real-world measurements, however, such situations are rare because noise in general contaminates the entire image, not just parts of it.Furthermore, as mentioned above, RSFN requires two reference dates.

B. Contributions and Paper Organization
Now, a natural question arises: How to achieve robust ST fusion with only a single reference date?In this paper, we propose a novel ST fusion framework for satellite images, named robust optimization-based ST fusion (ROSTF), to estimate an HR image on the target date while simultaneously denoising an HR image on a single reference date.
Before formulating our optimization problem, we newly define two observation models (they will be detailed in Sec.III-A).
• The first model describes the relationship between an observed noisy image and the oracle noiseless image.The model is designed under the assumption that the observed images are not only contaminated with random noise but also with outliers and missing values.Specifically, random noise is modeled by Gaussian noise while outliers and missing values are modeled by sparse noise.• The second model represents the relationship between an HR image and the corresponding LR image, based on a super-resolution model [34].We also introduce the following two assumptions about satellite images (they will be detailed in Sec.III-B).
i) The reflectance may change significantly between the reference and target dates, but the land structure (the locations of the edges) does not.This is a very natural assumption in the context of ST fusion.ii) An HR image and the corresponding LR image have similar brightness.This assumption is necessarily true if the HR and LR sensors have similar spectral resolution, such as Landsat and MODIS [9].Based on the observation models and assumptions, we formulate the fusion problem as a constrained convex optimization problem and develop an efficient algorithm based on the preconditioned primal-dual splitting method (P-PDS) [35] with an operator-norm-based design method of variable-wise diagonal preconditioning, named OVDP [36], which can automatically determine the appropriate stepsizes for solving the optimization problem.
The main contributions of the paper are given as follows.1) Robustness to Random Noise, Outliers, and Missing Values: As described above, no existing ST fusion methods can handle random noise, outliers, and missing values, although this type of noise contaminates satellite images due to the measurement equipment and environment.Thanks to the formulation that incorporates the first observation model developed in Sec.III-A, ROSTF is robust to such mixed noise.2) A Single Reference Date: Assumption i) is very simple but effective for ST fusion.By incorporating this assumption as a constraint in the optimization problem, we realize the mechanism to promote the estimated HR image on the target date and the denoised HR image on the reference date to be consistent with the assumption, i.e., to have edges at similar locations.Thanks to such a mechanism, ROSTF performs well even based on a single reference date.3) Facilitation of Parameter Adjustment: The objective function of the optimization problem of ROSTF consists only of image regularization terms to promote spatial piecewise smoothness.The other components, corresponding to data fidelity based on the two observation models and our two assumptions, are imposed as hard constraints.Such a formulation using constraints instead of adding terms to the objective function has the advantage of simplifying parameter setting [37]- [41]: the appropriate parameters in the constraints do not depend on each other and can be determined independently for each constraint.
4) Automatic StepsizeAdjustment: In order to solve our optimization problem for ST fusion, we develop an efficient algorithm based on P-PDS with OVDP.The appropriate stepsizes of the standard PDS [42] (and most other optimization methods) would be different depending on the problem structure, which means that we have to select them manually.On the other hand, P-PDS with OVDP can automatically determine the appropriate stepsizes based on the problem structure and thus our algorithm is free from such a troublesome task.
In the following sections, we first cover the mathematical preliminaries for our method in Sec.II and then move on to the establishment of our method in Sec.III.In Sec.IV, we demonstrate the performance of ROSTF and the effectiveness of each key component of ROSTF through comparative experiments and ablation studies, respectively.Experimental results show that ROSTF performs comparably to several state-of-the-art ST fusion methods for noiseless images and better than them for noisy images, thanks to the effective work of each key component.
The preliminary version of this work, without considering sparse noise, mathematical details, comprehensive experimental comparison, deeper discussion, or implementation using P-PDS with OVDP, has appeared in conference proceedings [43].

A. Notations
Let R be the set of all real numbers.Vectors and matrices are denoted by bold lower and upper case letters, respectively.We treat a multispectral image with spatial resolution N 1 × N 2 and spectral resolution (the number of bands) B as a vector x = ( The hyperslab with the center ω and radius α is denoted as Also, the ℓ p -norm ball (p = 1, 2) with the center c and radius ε is denoted as The indicator function ι C : R N → (−∞, ∞] of a nonempty closed convex set C is defined as

B. Proximal Tools
The optimization problem of ROSTF that will be formulated in Sec.III-B consists of nonsmooth convex functions.To solve such a problem, we introduce the notion of the proximity operator of index γ > 0 of f ∈ Γ 0 (R N B ) as follows: Thanks to Moreus's identity [44], the proximity operator of f * is efficiently calculated as Below, we show the specific proximity operators of the functions that we use in this paper.The proximity operator of the ℓ 1,2 -norm is given by The proximity operator of the indicator function of the hyperslab in (1) is expressed as follows: where η 1 = ω − α and η 2 = ω + α.The proximity operators of the indicator functions of the ℓ 2 -norm ball and the ℓ 1 -norm ball in (2) are calculated by and a fast ℓ 1 -ball projection algorithm [45], respectively.

C. P-PDS with OVDP
The standard PDS [42] is a versatile and efficient proximal algorithm that can solve a wide class of nonsmooth convex optimization problems without using matrix inversion.However, it is troublesome to manually set the appropriate stepsizes of the standard PDS.Therefore, we adopt P-PDS [35] with OVDP [36], a method that automatically determines the appropriate stepsizes according to the problem structure.

III. PROPOSED METHOD
In what follows, we focus on the cases with a single reference date.Specifically, we consider a situation where both HR and LR sensors observe the same scene on the single reference date, but on the target date, only the LR sensor observes that scene and not the HR sensor.Let the HR image on the reference date, the LR image on the reference date, and the LR image on the target date be h r ∈ R N h B , l r ∈ R N l B , and l t ∈ R N l B , respectively.Our method, ROSTF, estimates the desired noiseless HR image on the target date, denoted by h t ∈ R N h B , based on these three observed images, while simultaneously denoising h r .A general diagram of ROSTF is shown in Fig. 1.

A. Observation Models
Let h ∈ R N h B and l ∈ R N l B be a noiseless HR image and a noiseless LR image, respectively, taken on the same date.We introduce observation models for a noisy HR image h ∈ R N h B and a noisy LR image l ∈ R N l B .To be more specific, we consider that the observed satellite images h and l are possibly contaminated with random noise, outliers, and missing values.Random noise added to h and l is modeled by Gaussian noise n h and n l with standard deviation σ h and σ l , respectively, while outliers and missing values affecting h and l are modeled by sparsely distributed noise s h and s l with the superimposition ratio r h and r l , respectively.By modeling the noise in this manner, the observation models for h and l are described as Here, σ h > σ l and r h > r l generally hold since HR images often contain more severe noise than LR images.This is because the amount of light received per pixel decreases as the number of pixels increases [46].On the other hand, l can be approximated by the image obtained by blurring and down-sampling h, known as a typical super-resolution model [34], as follows: where is the down-sampling matrix, and m ∈ R N l B is the modeling error.Such a model has been used in the ST fusion literature [48].

B. Problem Formulation
We introduce the following two assumptions about the noiseless HR and LR images on the reference and target dates, i.e., h r , l r , h t , and l t .
i) The reflectance may change significantly between the reference and target dates, but the land structure does not.This is a very natural assumption in ST fusion and is implicitly accepted in previous studies.If the land structure has not changed significantly, the edges of h r and h t appear at almost the same locations, implying that the difference between D h r and D h t tends to be small.We measure the similarity of these edges using the ℓ p (p = 1 or 2) norm as ∥D h r − D h t ∥ p .ii) The HR and LR images on the same date have similar average brightness per band.For example, the difference in average brightness of the b-th band of h t and l t , i.e., is expected to be small.This is necessarily true if the HR and LR sensors have similar spectral resolution, as is the case with Landsat and MODIS [9].
Based on these assumptions and the observation models in ( 13) and ( 14), we formulate the fusion problem as the following constrained convex optimization problem: where λ is a balancing parameter.The variables h r and h t correspond to the estimates of h r and h t , respectively, and s hr , s lr and s lt correspond to the estimates of sparse noise superimposed on h r , l r and l t , respectively.Each term in the objective function and each constraint have the following roles.
• The two terms in the objective function promote spatial piecewise smoothness of h r and h t , with the hyperspectral total variation (HTV) [49] as regularization.• The first constraint encourages D h r and D h t to be similar based on Assumption i).The parameter α controls the degree of similarity.Hereafter, the constraint is referred to the edge constraint.• The second constraint is designed based on Assumption ii).Since l t is contaminated by noise, we do not use the average brightness of [l t ] b itself, i.e., 1 ⊤ [l t ] b /N l , but the parameter c b , which is determined based on 1 ⊤ [l t ] b /N l and the noise intensity.The parameter β b controls the strength of this constraint.Hereafter, the constraint is referred to the brightness constraint.• The third constraint serves as data-fidelity based on the observation model in (13).The parameter ε h depends on Gaussian noise intensity on the HR image, i.e., σ h .• The fourth and fifth constraints are to ensure that the solutions follow the observation model in (14).The parameter ε l depends on Gaussian noise intensity on the LR images, i.e., σ l .• The last three constraints characterize the sparse noise using the ℓ 1 -norms.The parameters η h and η l depend on sparse noise intensity on the HR image and the LR images, respectively, i.e., r h and r l .Using constraints instead of adding terms to the objective function in this way simplifies the parameter setting [37]- [41]: we can determine the appropriate parameters for each constraint independently because they are decoupled.The detailed setting of these parameters is discussed in Sec.IV-C.
The algorithm for solving (16) is summarized in Algorithm 1.The stepsizes are determined based on OVDP [36] as follows:

D. Detailed Computations and Their Complexity
Table I shows the computational complexity (in O-notation) of each operation used in Algorithm 1, where the operated vector is According to Table I, the computational complexity of each step in one iteration of Algorithm 1 is as follows.

IV. EXPERIMENTS
We demonstrate the effectiveness of our ST fusion method, ROSTF, through comprehensive experiments using simulated and real data for two sites.Our experiments aim to verify the following three items.
• ROSTF is as effective as state-of-the-art ST fusion methods in noiseless cases and performs better than them in noisy cases.We conducted comparative experiments on four cases of noise contamination.The experimental results for simulated data and real data are presented in Sec.IV-D and Sec.IV-E, respectively.• Key components of ROSTF, such as the assumptionbased constraints and the denoising mechanism, work effectively as expected.To measure their influence, ROSTF without each key component is compared with the original ROSTF in terms of fusion accuracy and convergence speed in Sec.IV-F.• ROSTF is practical in terms of computational time.For a fair comparison, only non-deep learning based methods are compared to ROSTF in Sec.IV-G.Note that there are two options for ROSTF: p = 1 or 2, where p corresponds to the choice of the norm in the first constraint in (16), i.e., the edge constraint.Hereafter, ROSTF with p = 1 and ROSTF with p = 2 are referred to as ROSTF-1 and ROSTF-2, respectively.

A. Data Description
We tested our methods both on real data and simulated data.In the case of satellite observations, radiometric and geometric inconsistencies exist between two different image sensors.This means that the fusion capability of each method cannot be accurately evaluated in experiments using real data because these inconsistencies affect performance, as also adressed in [32].Therefore, we generated simulated data based on the observation models and verified the performance of each method using the simulated data in addition to the real data.Specifically, in experiments using simulated data, the simulated LR images were generated from the corresponding real HR images according to (14) with m = 0 while the real HR images were used as HR images.
We used MODIS and Landsat time-series images for the following two sites in our experiments.
Site1: The first site is located in the Daxing district in the south of Beijing city (39.0009

B. Compared Methods
Our method was compared with STARFM [9], VIP-STF [33], RSFN [28], RobOt [23], and SwinSTFM [31].Table II shows the characteristics of these methods and ROSTF.Note that unlike the other methods, RSFN requires the preparation of input images obtained on two reference dates.Since our experiments assume a situation where there is only a single reference date, the same HR-LR image pair was input as two different reference image pairs for RSFN.
As the parameters of these existing methods, we used the values recommended in each reference.It should be noted that RSFN and SwinSTFM require much more data than our and other existing methods due to training and validation processes.In the experiments, we trained and validated RSFN and SwinSTFM using a different set of images from those used for the tests described in Sec.IV-A.Specifically, 24 groups from Site1 and two groups from Site2 were used for training, and one group from Site1 and two groups from Site2 were used for validation.

C. Experimental Setup
Our method, ROSTF, is implemented using MATLAB.The source code is available on the GitHub 1 platform.In these experiments, the spatial spread transform matrix B in ( 14) was implemented as a blurring operator with the averaging filter of size k.The downsampling matrix S in ( 14) was designed to take the top-left pixel value in the k × k window.The parameter k was set to 20 to account for the difference in spatial resolution between the Landsat and MODIS images.The balancing parameter λ in ( 16) was set to 1 since h r and h t have the same number of elements and are expected to be piecewise smooth to the same degree.The parameters appearing in the constraints in (16) were set as shown in Table III.We set the maximum number of iterations for Algorithm 1 to 50000 and the stopping criterion of Algorithm 1 was set to ∥ h To verify both the pure fusion capability and the robustness against noise of the existing and our methods, we conducted the following four combinations of Gaussian noise with dif-1 https://github.com/MDI-TokyoTech/ROSTFferent standard deviations and sparse noise (salt-and-pepper noise) with different rates.
Case1: The observed HR and LR images are noiseless, i.e., σ h = σ l = r h = r l = 0 in (13).Case2: The observed HR images are contaminated with Gaussian noise with standard deviation σ h = 0.05 while the observed LR images are noiseless, i.e., σ h = 0.05, σ l = r h = r l = 0 in (13).Case3: The observed HR images are contaminated with sparse noise with the superimposition rate r h = 0.05 while the observed LR images are noiseless, i.e., r h = 0.05, σ h = σ l = r l = 0 in (13).Case4: The observed HR images are contaminated with Gaussian noise with standard deviation σ h = 0.05 and sparse noise with the superimposition rate r h = 0.05, while the observed LR images are noiseless, i.e., σ h = r h = 0.05, σ l = r l = 0 in (13).For the quantitative evaluation, we used the following four metrics: the root mean square error (RMSE): where h t and h t denote an estimated HR image and a groundtruth HR image, respectively, on the target date; the spectral angle mapper (SAM) [51]: where e n and e n ∈ R B denote the spectral vectors of the n-th pixel of h t and h t , respectively; the mean structural similarity overall bands (MSSIM) [52]: and the correlation coefficient (CC): where s ht ht denotes the covariance of h t and h t , and s ht and s ht denote the standard deviations of h t and h t , respectively.We calculated RMSE to gauge the difference between the estimated image and the ground-truth at the pixel level.SAM was used to measure spectral fidelity.Lower RMSE and SAM indicate better estimation performance.We used MSSIM to evaluate the similarity of the overall structure.CC shows the strength of the linear relationship between the estimated image and the ground-truth.Higher MSSIM and CC indicate better estimation performance.

D. Experimental Results with Simulated Data
Table IV shows the RMSE, SAM, MSSIM, and CC results in experiments with the simulated data.In Case1, STARFM, VIPSTF, RobOt, SwinSTFM, ROSTF-1, and ROSTF-2 perform equally well.In contrast, the results of RSFN are not good for both the Site1 and Site2 data.This may be because the training data described above was not large enough to train RSFN sufficiently.Thus, if more training data were used, RSFN might have produced better results.However, in actual applications, it is difficult to collect noise-free training data, and this is the situation considered in these experiments.Next, we focus on the results in Case2, Case3, and Case4, where the observed reference images are contaminated with noise.While STARFM, VIPSTF, RobOt, SwinSTFM, and RSFN perform significantly worse due to the influence of noise, ROSTF-1 and ROSTF-2 show no significant performance degradation in Case2, Case3, and Case4.This is because ROSTF estimates the target HR image while simultaneously denoising the reference HR image.Fig. 2 and Fig. 3 show the estimated results in Case1 for the Site1 and Site2 simulated data, respectively.In the zoomed-in areas, there are large temporal changes in brightness between the reference HR image h r and the target HR image, ground-truth.For the Site1 data, it can be visually seen that ROSTF-1 and ROSTF-2 capture these changes most accurately and estimate the brightness closest to that of the ground-truth compared with the other methods.This is thanks to the brightness constraint and the fifth constraint in (16), which promote the brightness of the estimated image to be close to that of the observed LR image l t on the target date, based on Assumption ii) and the observation model in (14).Following ROSTF, RobOt also captures the temporal changes well.STARFM and VIPSTF are greatly affected by the reference HR image h r and estimate closer brightness to that of h r than to that of the ground-truth.The result of RSFN is not good due to lack of training data.The SwinSTFM result appears to show spectral distortion.For the Site2 data, Fig. 3 shows that ROSTF-1 captures the changes in brightness of the zoomed-in area most accurately.ROSTF-2 and RobOt have good estimates, followed by ROSTF-1.STARFM and SwinSTFM do not capture the large temporal changes.The result of VIPSTF has different tints throughout.RSFN estimates brightness closer to that of h r than to that of the ground-truth.
Next, we focus on the results in noisy cases.Fig. 4 shows the estimated results for the Site2 data in noisy cases, i.e., Case2, Case3, and Case4.The results of STARFM, VIPSTF, and RobOt are contaminated with noise similar to that of h r because they estimate the target image at the pixel level based entirely on the noisy reference image h r .The results of RSFN are corrupted for the unexpected inputs because RSFN was not trained on noisy data.Furthermore, using noisy data to train RSFN may not have produced good results.This is because RSFN is robust to noise only in the situations where noisy pixels in the two reference HR images do not appear at the same location.SwinSTFM generates unstable noisy results due to noisy unexpected inputs.On the other hand, ROSTF-1 and ROSTF-2 provide good estimates even when the observed reference images are noisy, without much loss  of accuracy compared to the noiseless case.This is thanks to the framework that simultaneously performs noise removal

E. Experimental Results with Real Data
Table V shows the RMSE, SAM, MSSIM, and CC results in experiments with the real data.Compared to the results for the simulated data in Table IV, the performance of ROSTF-1 and ROSTF-2 degrades due to radiometric and geometric inconsistencies between the Landsat and MODIS sensors.Despite such the performance degradation, ROSTF-1 and ROSTF-2 perform as well as the existing methods in the noiseless case, i.e., Case1, and outperform them in the noisy cases, i.e., Case2, Case3, and Case4, as in the experiments with the real data.Therefore, we can say that ROSTF is robust to noise even for the real data.Fig. 7 and Fig. 8 show the estimated results in Case1 for the Site1 and Site2 real data, respectively.Compared to the results for the simulated data in Fig. 2 and Fig. 3, ROSTF- 1 and perform worse because there are modeling errors between the real HR images h r , h t and the real LR images l r , l t due to radiometric and geometric inconsistencies.Fig. 9 shows the estimated results for the Site2 data in the noisy cases, i.e., Case2, Case3, and Case4.The results of the existing methods are not good, especially STARFM, VIPSTF, and RobOt generate noisy outputs.The estimated images of ROSTF seem to be blurred in the zoomed area in Fig. 9.This is due to oversmoothing by the HTV regularization terms in (16), which might have undesirable effects on some applications.However, according to the difference map in Fig. 10, the results of ROSTF have the least error, and the accuracy evaluation in Table V also shows that ROSTF achieves the best performance in all metrics.

F. Ablation Study
We conducted ablation experiments with a focus on the following three items: • The edge constraint, ∥D h r − D h t ∥ p ≤ α, to encourage the land structure, i.e. the edges, of the reference and target HR images to be similar based on Assumption i).
• The brightness constraint, , which is designed based on Assumption ii), so that the estimated target HR image has a similar average brightness to the target LR image.• The denoising mechanism for the reference HR image h r , i.e., the first regularization term ∥D h r ∥ 1,2 and the third, fourth, sixth, and seventh constraints, We tested ROSTF with each of the above three components removed.In the following, we first present the ablation studies on the two constraints, the edge constraint and the brightness constraint, followed by the ablation study on the denoising mechanism.The hyperparameters in each optimization problem and the stopping criterion of each P-PDS-based algorithm to solve it were set as in the original ROSTF.On the other hand, the stepsizes in each algorithm were set as the values computed according to the operator-norm-based design method of variable-wise diagonal preconditioning in (11).
1) Edge and Brightness Constraints: First, we measure the effectiveness of the two constraints based on Assumption i) and Assumption ii), i.e., the edge constraint and the brightness constraint, in terms of convergence speed and fusion performance.
Table VI shows the average number of iterations spent before each algorithm stopped and the average performance results for all sites (Site1 or Site2), data types (real or simulated data), and noise cases (from Case1 to Case4).Note that each algorithm always stopped at the maximum number of iterations, 50000, even if the stopping criterion was not met.The original ROSTF performs best with the fewest number of iterations on average, indicating that both of the two constraints contribute to achieving higher fusion performance with fewer iterations.Fig. 11 shows the transition of the objective function values and the RMSE values for the original ROSTF-2, ROSTF-2 without the edge constraint, and ROSTF-2 without the bright-ness constraint in the experiment on the Site1 simulated data in Case1.ROSTF-2 without the edge constraint does not meet the stopping criterion until 50000 iterations.This would be because the solution space of the optimization problem without the edge constraint is too large to efficiently reach an optimal solution.This suggests that the edge has the effect of making the solution space moderately small and speeding up convergence.On the other hand, ROSTF-2 without the brightness constraint converges faster than ROSTF-2 without the edge constraint, but shows unstable behavior, especially in the early iterations.This may be because the variable h t is no longer directly updated as a primal variable when the brightness constraint is removed.The update equation corresponding to the brightness constraint is implemented as the only update equation for the primal variable h t , as in the lines 8 to 10 of Algorithm 1, while the update equations corresponding to the other components for h t update the dual z 2 , z 3 and z 6 .Directly updating the primal variable leads to faster convergence of the algorithm and more stable behavior than updating the primal variable via the update of the dual variables.By introducing the brightness constraint, the variable h t is updated directly as the primal variable, which improves convergence speed and stability.
Fig. 12 shows a visual comparison of the original ROSTF-2 and ROSTF-2 without these constraints.The image estimated by ROSTF-2 without the edge constraint (a) loses spatial structure, and ROSTF-2 without the brightness constraint (b) produces an image with incorrect brightness.On the other hand, the original ROSTF-2 (c) estimates brightness close to that of the ground-truth while still preserving spatial structure, indicating that both constraints work effectively.
2) Denoising Mechanism: Next, we move on to the ablation study of the denoising mechanism of ROSTF.The optimization problem of ROSTF without the denoising mechanism is formulated as Since the objective function contains only one regularization term for h t , no balancing parameter is needed.Also, note that the edge constraint of ( 25) is different from that of the original ROSTF in (16) because this optimization problem has no variable h r .
Table VII shows the average RMSE, SAM, MSSIM, and CC results for each noise case.In the noisy cases, i.e.Case2, Case3, and Case4, as expected, ROSTF without the denoising mechanism performs significantly worse than the original ROSTF due to the direct effect of noise.This shows that the denoising mechanism works effectively to make ROSTF robust to noise.On the other hand, we newly found that in the noiseless case, i.e., Case1, ROSTF without the denoising mechanism achieves slightly better fusion results than the  original one.This result indicates that in noiseless cases, the observed HR image h r can be used as the reference as it is, and there is no need to estimate h r .Furthermore, by removing the variable h r , ROSTF does not need the reference LR image l r as input in (25).Thus, fortunately, if noiseless images can be observed, ROSTF can be applied in situations where only two input images are available, a reference HR image h r and a target LR image l t .Fig. 13 shows the fusion results of ROSTF-1 without the denoising mechanism for the Site2 simulated data.It can also be visually seen that in Case1, ROSTF-1 without the denoising mechanism obtains a good result based on only two input images.On the other hand, in Case2, Case3, and Case4, the results of ROSTF without the denoising mechanism are contaminated with noise.This is because the edge constraint copies not only the true edge or spatial structure but also the noise in the reference HR image to the estimated target HR image.The results of this ablation study confirm that the denoising mechanism plays an effective role in avoiding such noise effects.

G. Computational Cost
We measured the actual runnning times using MATLAB (R2022b) on a Windows 11 computer with an Intel Core i9-13900 1.0GHz processor, 32GB of RAM, and NVIDIA GeForce RTX 4090.We used the Site1 and Site2 data with 400 × 400 pixels and six bands.The measurement results show that the average number of iterations for Algorithm 1

H. Summary
We summarize insights from the experiments as follows.
• From the results of the experiments in Case1, we see that ROSTF is comparable to state-of-the-art ST fusion methods in noiseless cases.Therefore, the observation model in (14) and the assumptions introduced in Sec.III-B are valid for ST fusion.• The results of the experiments in Case2, Case3, and Case4 confirm that ROSTF has good performance even when observed HR images are degraded by random noise, missing values, and outliers.• The ablation studies demonstrate that the key components, such as the edge constraint, the brightness constraint, and the denoising mechanism, work effectively as expected.

V. CONCLUSION
We have proposed an optimization-based ST fusion method, named ROSTF, which is robust to mixed Gaussian and sparse noise in observed satellite images.We have formulated the fusion problem as a constrained optimization problem and have developed the optimization algorithm based on P-PDS with OVDP.ROSTF was tested through experiments using both simulated and real data.The experimental results demonstrate that ROSTF achieves performance comparable to state-of-theart ST fusion methods in noiseless cases and significantly better in noisy cases.ROSTF will have a strong impact on the field of remote sensing, including the estimation of satellite image series with high spatial and temporal resolution from observed image series taken in measurement environments with severe degradation.
the vector of the b-th band pixel values whose n-th value is x b,n ∈ R. Let Γ 0 (R N B ) be the set of all proper lower-semicontinuous convex functions defined on R N B .The ℓ 1 -norm, the ℓ 2 -norm, and the ℓ 1,2 -norm of x are defined as ∥x∥ 1 := b n |x b,n |, ∥x∥ 2 := b n |x b,n | 2 , and ∥x∥ 1,2 := n b |x b,n | 2 , respectively.For an image x ∈ R N B , let D v and D h ∈ R N B×N B denote the matrices for computing the difference between vertical/horizontal adjacent pixels, respectively, and let D
and 28: O(N h B) when p = 2. • Step 9: O(N h ).• Steps 17, 18, 29, and 30: O(N l B). • Steps 11 and 27: O(N h B log(N h B)) when p = 1.• Steps 12 and 13: O(N l B log(N l B)).After all, one iteration of the algorithm has an overall computational complexity of O(N h B log(N h B)).

Fig. 7 .
Fig. 7. ST fusion results for the Site1 real data in Case1.

Fig. 11 .Fig. 12 .
Fig. 11.The behavior of the original ROSTF-2, ROSTF-2 without the edge constraint, and ROSTF-2 without the brightness constraint in the experiment on the Site1 simulated data in Case1.(a), (b) The transition of the objective function values until the algorithms stopped and in early iterations, respectively.(c), (d) The transition of the RMSE values until the algorithms stopped and in early iterations, respectively.

Fig. 13 .
Fig. 13.ST fusion results of ROSTF-1 without the denoising mechanism for the Site2 simulated data.
B. Introducing auxiliary variables z 1 , z 2 , z 3 , z 4 , z 5 , and z 6 , we can transform (17) into the following equivalent problem: min hr, ht, s hr , s lr , s lt B/k), where k is the window size of S S ⊤ x O (N B/k), where k is the window size of S Bx O (kN B), where k is the filter size of B B ⊤ x O (kN B), where k is the filter size of B prox ι S ω α

TABLE IV THE
RMSE, SAM, MSSIM, AND CC RESULTS IN THE EXPERIMENTS WITH SIMULATED DATA

TABLE V THE
RMSE, SAM, MSSIM, AND CC RESULTS IN THE EXPERIMENTS WITH REAL DATA

TABLE VI THE
AVERAGE NUMBER OF ITERATIONS SPENT BEFORE EACH ALGORITHM STOPPED AND THE AVERAGE PERFORMANCE RESULTS FOR ALL THE SITUATIONS

TABLE VII THE
AVERAGE PERFORMANCE RESULTS FOR EACH NOISE CASE IN THE ABLATION STUDY OF THE DENOISING MECHANISM 59.20 per second.TableVIIIshows the average running times of the non-deep learning-based comparison methods (STARFM, VIPSTF, and RobOt) and our methods (ROSTF-1 and ROSTF-2).For ROSTF, the average number of iterations for Algorithm 1 is also given in parentheses.Note that only the non-deep learning based methods were compared with ROSTF for a fair comparison, because deep learning based methods require training processes in addition to ST fusion processes.ROSTF-1 and ROSTF-2 took about 4 to 10 minutes.STARFM and VIPSTF took longer than ROSTF because they estimate the target HR image pixel by pixel.On the other hand, RobOt ran much faster than ROSTF, possibly because the Least Absolute Shrinkage and Selection Operator (LASSO) problem in RobOt has a closed-form solution in our experiment with only one reference date.This result indicates that ROSTF is slower than RobOt, but we believe that ROSTF is still practical in terms of computational cost. was