A Rigorously-Incremental Spatiotemporal Data Fusion Method for Fusing Remote Sensing Images

The spatiotemporal remote sensing images have significant importance in forest ecological monitoring, forest carbon management, and other related fields. Spatiotemporal data fusion technology of remote sensing images combines high spatiotemporal and high temporal resolution images to address the current limitation of single sensors in obtaining high spatiotemporal resolution. This technology has gained widespread attention in recent years. However, the current models still exhibit some shortcomings in dealing with land cover changes, such as poor clustering results, inaccurate incremental spatiotemporal calculations, and sensor differences. In this article, we propose a rigorously-incremental spatiotemporal data fusion method for fusing remote sensing images with different resolutions to address the aforementioned problems. The proposed method utilizes the particle swarm optimization Gaussian mixture model to extract endmembers and establishes a linear relationship between sensors to obtain accurate time increments. Furthermore, bicubic interpolation is used instead of thin plate spline interpolation for spatial interpolation, and also support vector regression is used to calculate weights for obtaining a weighted sum of temporal and spatial increments. In addition, sensor errors are allocated to the calculation of residuals. The experimental results show the efficacy of the proposed algorithm for fusing fine image Landsat with coarse image MODIS data and conclude that the proposed algorithm presents a better solution for heterogeneous data with strong phenological changes and regions with changes in surface types, which provides a better solution for remote sensing image fusion and, hence, improves the accuracy, stability, and robustness of data fusion.


I. INTRODUCTION
F ORESTS, as the largest carbon pool in terrestrial ecosystems, have the characteristics of wide distribution and diverse types [1]. Therefore, accurate and comprehensive forest resource monitoring is significant for maintaining the ecosystem's capacity, but it also faces significant technical challenges [2], [3], [4]. With the high development of satellite sensors in the last decades, remote sensing images have provided a powerful technical support for the dynamic monitoring of large-scale forest resources and accurate forest observation data [5] and have introduced many advantages such as wide coverage and high precision [6], [7]. However, due to hardware and technological limitations of sensors, a single-satellite product cannot obtain remote sensing images with both high temporal and high spatial resolution at the same time [8]. The emergence of spatiotemporal fusion technology provides an effective solution for processing remote sensing images [9].
The spatiotemporal fusion algorithm for remote sensing images has been developed since the 1990s and has made a significant progress in the past decade [10]. The core idea behind spatiotemporal fusion is to fuse remote sensing images from multiple sensors to compensate their respective shortcomings and generate high spatiotemporal resolution remote sensing images [11], as illustrated in Fig. 1. For example, the reflectivity images obtained by Landsat series, advanced land observation satellite (ALOS), and GF series satellites have a good spatial resolution of 3-30 m [12]. However, these images require long satellite revisit cycles. The natural obscuration of clouds and complex terrain limits the variety of high spatial resolution images in terms of fast surface features (these images are referred as "fine images" in this article). While satellites such as moderateresolution imaging spectroradiometer (MODIS) have a revisit time of one day (these images are referred as "coarse images" in this article), and their spatial resolution is low, from 250 m to 1 km, which cannot capture spatial details [13]. Therefore, generating a remote sensing image with a resolution of 30 m and a revisit period of 1 day by fusing fine and coarse images can improve data accuracy, completeness, and timeliness. This provides a reliable data support for accurately describing and simulating changes on the Earth's surface [14]. This  The traditional spatiotemporal fusion algorithms predominantly fall into three categories: 1) weight-based, 2) spectralunmixing-based, and 3) learning-based. The weight-based and spectral-unmixing-based are two categories of the first spatiotemporal fusion algorithm [15]. The weight-based methods calculate changes in surface emissivity by weighing the pixel values of remote sensing images to predict corresponding time images [16]. Typical-weight-based methods include the SpatioTemporal Adaptive Reflectance Fusion Model (STARFM) [17], Enhanced STARFM (ESTARFM) [18], Fit-FC [19], among others. These methods improve the accuracy of the weighted spatiotemporal fusion algorithm, and they work efficiently in homogeneous areas without requiring external data support while also having high fusion efficiency. However, they do not perform well in heterogeneous areas and have poor details in reconstructed images [20], [21]. Whereas the spectral unmixing based on mixed pixel decomposition is proposed to predict unknown fine images by spectral separation and endmember abundance calculation of coarse images [22], [23], [24]. However, a significant disadvantage of this method is that it is carried out under the assumption that the land type does not change, which cannot meet the occurrence of sudden events. On the other hand, learning-based spatiotemporal fusion algorithms are trained using existing datasets and treat the prediction of high-resolution images as the generation of supervised superresolution images [25], [26], [27]. Deep-learningbased methods have powerful feature extraction capabilities, and models such as the deep convolutional spatiotemporal fusion network [28] and depth networks based on spatiotemporal data fusion [29] have improved the accuracy of spatiotemporal fusion by optimizing the network and loss function. However, most deep-learning-based models are trained based on idealized datasets, which require at least three pairs of images as input [30]. This ignores the difficulty of obtaining ideal data in realistic studies due to weather and cloudiness. Furthermore, in impact fusion, the architecture and loss functions should be fully applied in the algorithm, which is computationally intensive [31].
To address the deficiencies mentioned above in spatiotemporal fusion algorithms, researchers have started to combine and optimize existing algorithms to enhance their generality by integrating the advantages of two or more models. Better results have been achieved in heterogeneous regions and regions with sudden changes in land type [32]. Zhu et al. [33] proposed a Flexible Spatiotemporal DAta Fusion (FSDAF) algorithm. FSDAF combines STARFM [17] and the unmixing-based data fusion [34] algorithms and integrates thin plate spline (TPS). Compared to other spatiotemporal fusion algorithms, FSDAF only requires the input of one pair of fine and coarse images at T B and the coarse images at the moment of T P , which reduces the input data needed [35]. FSDAF also performs well in heterogeneous data, as it can capture more information on physical changes in coarse images and has a high fusion accuracy. Recently, several improved models based on FSDAF have been developed. For instance, Improved Flexible Spatiotemporal DAta Fusion (IFSDAF) [36], which predicts the normalized difference vegetation index (NDVI); subpixel class fraction-based flexible spatiotemporal data fusion [37], which extracts endmember abundance based on subpixel information that consequently improves the accuracy of heterogeneous data prediction; FSDAF 2.0 [38], which solves the problem of boundary pixel mixing and effectively restores land cover changes; and object-based spatiotemporal fusion model [39] which combines nonpixel-based image segmentation with a weighting function that achieves good results in homogeneous physical changes. Like FSDAF, all the aforementioned algorithms assume that there is no change in land cover type during the temporal prediction phase and calculate the change in fine images directly from the coarse image change. However, this approach leads to bias in the prediction results [40], [41].
In conclusion, the hybrid spatiotemporal data fusion model has proven to be effective in dealing with land cover changes and has become the mainstream approach for remote sensing image fusion. It has been successfully applied in various fields such as land surface temperature monitoring [42], [43], vegetation coverage detection [44], [45], and forest resource change monitoring [46]. These applications demonstrate the potential of hybrid algorithms in addressing practical problems in remote sensing.
Currently, the hybrid spatiotemporal fusion algorithm, represented by FSDAF, still exhibits some shortcomings, which can be classified into the following aspects.
1) In terms of temporal change dimension: Spectral unmixing can roughly preserve the surface cover structure. However, poor clustering results due to randomly selected initial values in the clustering algorithm can seriously affect the accuracy of mixed image element decomposition, leading to reduced prediction accuracy and stability of the model. Additionally, this method tends to ignore the details inside the image. The predicted time of FSDAF assumes that there is no land type change occurring between T B and T P , and also the change of each end element is the same in the coarse resolution image. However, this assumption can introduce a great uncertainty to the time prediction. 2) In terms of spatial variation dimension: The rapid growth of forest crops and human activities can cause significant spectral changes on the surface. TPS interpolators are commonly used in FSDAF and some existing algorithms can perform well in homogeneous regions. However, their interpolation results are often too smooth in heterogeneous regions, which ignores important spatial details. Additionally, the assumption that fusion errors come from homogeneous landscapes in FSDAF lacks the theoretical basis.
3) In terms of sensor differences: Existing spatiotemporal fusion algorithms do not fully consider the differences between sensors in fine and coarse images and their impact on fusion. This issue has attracted many scholars' attention, and a linear model has been developed to address the problem of sensor differences between two sensors [47]. However, this solution does not completely solve the problem of alignment errors. To deal with the above difficulties and problems, this article proposes a rigorously-incremental spatiotemporal data fusion (RISDAF) method for fusing remote sensing images with different resolutions, which is verified by the fusion of fine images and coarse image. The proposed algorithm provides a better solution for heterogeneous data with strong phenological changes and areas with changes in surface types. Moreover, the proposed algorithm exhibits good stability and robustness.
In this article, Landsat and MODIS data are adopted as fine and coarse images, respectively. The main contributions of this work are concluded as follows.
1) To address the issue of inaccurate prediction of temporal changes, we use a particle swarm optimized Gaussian mixture model for end element extraction called particle swarm optimization Gaussian mixture model (PSO-GMM). This method overcomes problems associated with poor clusterings, such as unbalanced samples and noncylindrical data, and improves the accuracy and adaptability of mixed image element decomposition. Additionally, the linear regression algorithm is used to correct sensor errors. Furthermore, the difference between fine and coarse pixels is standardized when allocating time changes to fine pixels. 2 To overcome the issue of inaccurate prediction of spatial changes, we use bicubic interpolation for spatial interpolation instead of TPS interpolation, which is commonly used for spatial prediction. This method preserves the connectedness of spatial increments and improves computational speed, scalability, and smoothness. Furthermore, the article does not incorporate temporal and spatial prediction results directly into the computational process. Rather, a weighting algorithm is employed to combine the weights of temporal and spatial increments. These weights are calculated via the support vector regression (SVR) algorithm, which enhances the robustness and accuracy of data fusion. 3) To resolve the issue of sensor errors, this article introduces sensor errors into the residuals and assigns them to each fine pixel. This correction improves the spatial distribution of image fusion results for reliability and reduces the impact of sensor errors on image fusion. The rest of this article is organized as follows. Section II presents the specific architecture and implementation process of the proposed algorithm RISDAF. Section III describes the dataset and experimental settings, while Section IV presents the experimental results and analysis. Section V provides a discussion and shows some necessary intermediate experimental results during the experiments. Finally, Section V concludes this article.

II. METHODS
The proposed RISDAF algorithm takes the coarse and fine images at the T B moment and the coarse image at the T P time as inputs to predict the fine image at the T P time. Here, T B and T P represent the base date and predicted date, respectively. The proposed model aims to address the following problems: inaccurate endmember division in unsupervised classification during the spectral unmixing process, the accuracy deviation of fusion results caused by sensor errors, and the difference in spectral changes caused by strong spatial changes. The overall idea can be summarized as shown in (1). The fine image pixel value F P at the moment T P equals the sum of the fine image pixel value F B at the moment T B , the increment ΔF ST and the residual ε is The proposed model can be divided into four main steps as follows. The first step is the temporal prediction based on spectral unmixing. The second step is the spatial variation based on land cover combined with the temporal prediction results by a weighting algorithm. The third step is the residual correction that enhances the fusion accuracy by introducing the residual r i and sensor error r e . Finally, the last step is the enhanced neighborhood prediction by spatial filtering. Fig. 2 shows the flowchart of the proposed RISDAF.

1) Mixed Pixel Decomposition Based on PSO-GMM:
The spectral unmixing of remote sensing images acquired by land satellites is a challenging task due to multiple surface heterogeneous coverage types within a single pixel. In this regard, we perform endmember determination and image boundary extraction on the fine image at T B prior to spectral unmixing. However, traditional clustering algorithms used in spectral unmixing suffer from poor stability and sensitivity to cluster center selection. To address this issue, the proposed approach utilizes a PSO-GMM to extract endmembers. The PSO algorithm optimizes the objective function in GMM clustering while adjusting the position pbest i = (p i1 , p i2 , . . . , p iD ) and velocity ν i = (ν i1 , ν i2 , . . . , ν iD ) of each particle by t iterations to ensure convergence of the algorithm within a certain range. PSO-GMM addresses the problem of poor clustering for unbalanced samples and noncylindrical data, which are the most challenging tasks for the traditional clustering algorithms such as ISODATA and K-means algorithms. Using PSO-GMM to decompose the surface reflectance, the endmember hard classification map is generated, and the abundance value of each endmember is calculated for each fine pixel. This method reduces the problem of poor clustering results caused by the random selection of initial values in the GMM clustering algorithm and improves the accuracy of endmember extraction. The abundance value of each endmember is expressed as shown in where N C (x i , y i ) is the number of fine pixels belonging to m class at the coarse pixel (x i , y i ), and k is the number of fine pixels in a coarse pixel.
The RISDAF algorithm assumes that the land type remains unchangeable during the time prediction period. Based on the mixed pixel theory, we assume that the pixel value of the image is a linear combination of the endmember values and their corresponding abundances. Therefore, the pixel values of Landsat and MODIS at T B can be expressed as where b represents the band b, n represents the endmembers of the fine image at T P , m represents the mth endmember according to the linear mixed model. A represents the abundance value of the endmember, E represents the reflectivity of each endmember, and k represents a coarse-resolution pixel containing k fine-resolution pixels. As shown in the following equation, in the mixed pixel decomposition, the time change of the coarse pixel is the weighted sum of all the category changes that it contains: 2) Adjust the Differences of Sensors: Sensor errors can arise from various sources, including the specific design of the sensor, such as its bandwidth, imaging angle, and spectral response function, among others. These errors can lead to nonuniform reception of images that result in varying capabilities to capture land surface information. Due to the inherent of sensor differences, errors are inevitable during the imaging stage. Therefore, correcting the differences between the imaging of the two types of sensors is crucial. In this article, we propose a linear model to adjust the relationship between the two types of sensors, which standardizes the differences and corrects the sensor errors, as follows: 3) Temporal Increment Prediction: In FSDAF, the time change increment from T B to T P is assumed to be calculated directly as the difference between the coarse images at the two-time points, which can lead to significant uncertainty in the time prediction results. To address this issue, we propose to employ a subpixel-based approach for predicting the time of fine images at T P . The proposed RISDAF algorithm assumes that there is no change in land cover during the spatiotemporal fusion process, which means that endmembers and abundances of Landsat pixels remain constant. The time change from T B to T P can then be calculated as follows: Since A is only known in (7), the change in time cannot be calculated. However, the time change of the coarse image can be expressed as follows: In this article, the SVR algorithm is selected instead of the traditional least squares method to solve for ΔE C , which improves Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
the robustness of abnormal data. The preliminary results of the time increment prediction can be shown as in

B. Spatiotemporal Increment Prediction for Land Cover
Change at T P If the land cover category within the coarse pixel scale undergoes significant changes, it is demonstrated that the coarse image contains information that can provide insight into the changes in vegetation cover within the image. The incremental value of spatial dependence can be estimated by interpolation as follows: Therefore, the change information can be directly obtained. However, previous studies have shown that the TPS interpolator performs better in regions with similar characteristics but results in oversmoothed interpolations in heterogeneous regions, disregarding important spatial details. To address this issue, we employ the bicubic interpolation method to interpolate the coarse image at time T P to the fine scale, and hence obtaining spatial prediction results for the time of T P . This method enhances the information contained in the transition resolution image at T P and also improves the accuracy of the spatial prediction results.
Since the temporal and spatial predictions are two separate parts. The temporal prediction maximizes the spatial detail and accuracy of remote sensing images but fails to capture the overlay changes during the spectral unmixing. On the other hand, the spatial prediction captures the overlay changes but ignores the spatial details of the image. Therefore, the spatial prediction results and temporal prediction results can be combined together by a weighting algorithm to improve the robustness and accuracy of the fusion algorithm. In this article, the SVR algorithm is used to solve the optimal temporal and spatial incremental weights, and the objective function of the weight increments can be expressed as where W represents the weights of the features, Q is the regularization constant, and is the sitar insensitive loss function. ω s and ω t represent the weights of the spatial and temporal increments, respectively. Theoretically, the sum of these two weights is 1. ΔC S h , ΔC T h , andΔ C h are the spatially relevant increment, the temporally relevant increment and the true increment of the h th coarse pixel, respectively. The final incremental weighted sum of spatial and temporal increments is shown as

C. Distribution of Residuals
Although the incremental combination ΔF ST after the weight distribution combination can capture the fine change increment, there are still some errors. This calculation process is similar to FSDAF, we introduce a residual between T B and the predicted value T P , assuming that the residual R is related to the heterogeneity between the images, and calculates the spatial distribution of the pixel residual of the fine image between the predicted value and the real pixel value. The specific calculation process can be referred to (14)- (19) in FSDAF [33] and assign the residual distribution to each fine pixel in the image to obtain Strong temporal variations and errors between sensors can also lead to differences in data during the fusion process. TPS interpolation is used in FSDAF to guide the residual distribution by applying the equilibrium index HI from the classification map of fine images at T B , but when the land type at T P changes, there will be a large error in the estimation of residuals. In this article, we propose new residuals by combining the differences that exist between sensors in the fusion process to address the issue of strong temporal variations and errors between sensors. In the calculation of time prediction results, the reflectivity of the m-terminal element in the b-band at the T B moment, denoted as E B F (m, b), can be calculated from (15). However, the calculated result is not the pixel value at the real T B moment and it is instead denoted by E B F (m, b). A linear model is constructed between these two values to calculate the error value formula of each pixel as Since σ only calculates the difference between the two, it does not consider the impact of its difference on the spatiotemporal fusion results. Therefore, on this basis, this article proposes the reliability distribution residual r q (x i , y i , b) under the Gaussian distribution to normalize it as where mean represents the standard deviation of the data and stddv represents the standard deviation, to ensure that the residual is within a reasonable range. Assigning it to the fine pixels results in r e (x ij , y ij , b). Ultimately, the overall residuals assigned to each fine pixel are expressed by R(x ij , y ij , b) as

D. Spatial Filtering Based on Sliding Window
Due to spectral discontinuity at the boundary of low spatial resolution pixels, there is a blocking effect when the image is mixed, resulting in a loss of spatial details. Therefore, spatial filtering is used to mitigate this problem and achieve fine image prediction at T P . In spatial filtering, pixels with similar spectral and land cover information are considered to be similar pixels. In the sliding window, up to v adjacent pixels with similar spectra are selected for each target fine pixel based on spectral distance.  The weight of each similar pixel is represented by the Euclidean distance from it to the target fine pixel as follows: where L represents half of the sliding window size (for instance, L = 10 represents the window size of 21 × 21 fine pixels). Therefore, the weight of each fine pixel benefits from the calculation method in ESTARFM [18], which can be calculated as follows: In this article, the size of the sliding window is optimized through a series of trial-and-error experiments. The experimental results have led to the decision to set the sliding window size to 41 × 41, in order to balance the prediction accuracy and computational efficiency in the fusion process. The details of the comparison experiment can be found in Section IV. The threshold for similar pixels is set to 10-30. If the number of similar pixels exceeded 30, only the top 30 pixels are used. The resulting high spatial resolution image prediction at the final T P can be expressed as follows:

A. Study Areas and Datasets
In the experiments, two publicly available datasets proposed by Emelyanova et al. [48] are used to validate the effectiveness and stability of the RISDAF. The dataset consists of two sites, namely the heterogeneous Coleambally Irrigation Area (CIA) and the homogeneous Lower Gwydir Catchment (LGC), which have large-scale land cover changes.
The first study area CIA is located in the southern region of New South Wales, Australia, which is located at 34.0034E, 145.0675S and covers an area of 2193 km 2 . The CIA dataset mainly covers areas of agricultural rice fields and woodlands with neat boundaries and large extent, and after the summer season, the plants grow luxuriantly, and the landmarks have more obvious physical and spatial changes. In this experiment, Landsat images from Landsat-7 ETM+ are used as fine images while MODIS images from Terra MODTRAN4 are used as coarse images. The CIA dataset uses the image pairs on November 25, 2001, and the MODIS image on January 12, 2002, as shown in Fig. 3 to predict the Landsat image on January 12, 2002. The actual Landsat image on January 12, 2002, as shown in Fig. 3(d) is used for verification.
The second study area, LGC data are located in the northern region of New South Wales, Australia, which is located at 149.2815E, 299.0855S, and covers an area of 5440 km 2 . In this experiment, Landsat images from Landsat-5 TM are used   Fig. 4(d) is used for verification.
The MODIS and Landsat images in the dataset are acquired on the same date and undergo preprocessing, including atmospheric and geographic correction. To match the bands in the dataset, six groups of bands similar to Landsat and MODIS are selected, and a scale factor of 20 is applied between the two images. The image size of the CIA dataset is 800 × 800 pixels, and the image size of the LGC dataset is 1200 × 1200 pixels. To meet the experimental requirements and match the Landsat resolution, the MODIS image is upsampled to 25-m resolution through nearest-neighbor interpolation. In this article, all images use the combination of NIR-red-green bands for RGB visualization, thereby facilitating more straightforward identification of land object types such as vegetation and water.

B. Experimental Settings 1) Implementation Details:
The experiments are conducted on A620-G30 servers, each of which is configured with two AMD EPYC 7281 16-Core processors and 256 GB memory. In order to ensure fair comparisons, we use the same settings for all methods that we replicate. Experimental parameters are set to default values, as specified by the authors of each corresponding paper.
2) Experimental Design: In this article, we design a threepart experiment to verify the proposed RISDAF method. First, RISDAF is compared with several widely used algorithms, including the traditional weight-based algorithm STARFM, the flexible mixing-solution-based spatiotemporal fusion algorithm FSDAF, and all of the above algorithms using fine and rough pair of images as input. Although there are many scholars making improvements based on the FSDAF algorithm, the overall algorithm structure is similar, and FSDAF has representativeness and stability, and it has been widely used in various fields. Therefore, FSDAF is chosen as a benchmark algorithm for experimental comparisons. To ensure the fairness and authenticity of the comparative experiments, all algorithms use the default parameters provided by their respective authors during the experimentation process. First, quantitative metrics are employed to compare and analyze the prediction results of the three algorithms. Second, the experimental results are compared at the visual level by visualizing the partially enlarged results within their respective subregions. Scatter plots of the predicted and measured data in NIR bands are also plotted to aid in the analysis. Finally, the usability of the proposed algorithm is separately analyzed through ablation experiments.
3) Accuracy Assessment: To evaluate the effectiveness of the proposed method, we conduct a computational comparison of the experimental results with the corresponding real images. Five metrics are used to measure accuracy in the experiments: the root-mean-square error (RMSE), the correlation coefficient (CC), the structure similarity (SSIM), the spectral angle mapper (SAM), and the enhanced reconstruction of grayscale and aerial signal (ERGAS). These metrics are commonly used for evaluating the spatiotemporal fusion of remote sensing images. Specifically, RMSE and CC measure the differences between predicted values while SAM indicates the degree of spectral distortion, and SSIM measures the degree of texture similarity between spectra. A smaller value of RMSE, SAM, and ERGAS typically corresponds to a larger value for CC and SSIM, which indicates better fusion results.

A. Results and Analysis of Heterogeneous Regions
The quantitative measures of the CIA dataset are presented in Table I, and the best results are indicated in bold font. Overall, compared with STARFM and FSDAF, the proposed RISDAF algorithm has the lowest RMSE, ERGAS, and SAM, and the highest CC and SSIM. Among the results calculated in six bands, most of them are the best, except for the ERGAS of the blue and green bands, indicating the overall best performance of the predicted results. As the two sets of data in the experiment are collected during the vegetation growth period, as the cell structure in the vegetation leaves will strongly reflect near-infrared light, resulting in very bright reflections in the NIR band. Therefore, the NIR band is extensively used in vegetation growth monitoring. We calculated the percentage improvement of RISDAF over FSDAF in the six bands across four metrics: 1) RMSE, 2) CC, 3) SSIM, and 4) ERGAS. We found that the improvements in the NIR band outperformed the remaining five bands, with percentage increases sequentially reaching 12.2%, 12.1%, 11.5%, and 11.1%. As vegetation rapidly grows, the NIR band exhibits the greatest uncertainty in RISDAF, indicating that RISDAF is more accurate in capturing heterogeneous land and ecological changes.
In this experiment, the heterogeneous dataset CIA does not undergo any significant category changes, but there are obvious physical changes in two time periods. Therefore, the experimental results focus on observing the ecosystem dynamics of the predicted images and processing the image edges. Fig. 5 shows the original Landsat image and the prediction results of the three algorithms. The experimental outcomes from STARFM exhibit substantial boundary blurring, and distortion is evident in some images. Considering the CIA dataset, which lacks prominent changes in land cover types, the predicted images generated by RISDAF demonstrate higher precision across the full spectral range compared to FSDAF. Compared to FSDAF, the predicted images produced by RISDAF demonstrate higher precision across the entire spectral range. RISDAF more accurately models and predicts spectral diversity under conditions of spatiotemporal heterogeneity. Fig. 6 shows the zoomed-in orange and yellow areas of Fig. 5 to compare the differences between the predicted results and  the actual images. Both regions exhibit significant phenological changes during the vegetation growth period, and the predictions generated by the three algorithms are generally consistent with the real Landsat images. Based on visual interpretation, all three methods accurately capture the phenological changes between November 25, 2001, andJanuary 12, 2002. However, the structure of the proposed RISDAF algorithm is seemed to be closer to the original images than those of STARFM and FSDAF. The enlarged orange area is shown in Fig. 6(a)-(d). The RISDAF algorithm can capture the small irregular objects and pixel value changes more accurately and has more significant advantages in monitoring small farmland. Only the proposed method can correctly predict the physical changes and boundary areas in the orange area in Fig. 5(a), STARFM and FSDAF boundaries are blurred. This is because the proposed RISDAF combines temporal and spatial increments in its computation process, leading to advancements in restoring spatial details. Additionally, the magnified yellow regions in Fig. 5(a)-(d) correspond to what is shown in Fig. 6(e)-(h). Although the results after spectral rendering imaging predicted by the three algorithms are not exactly close to the original image, the image predicted by the proposed RISDAF algorithm has the clearest image structure as well as acquires sharper image boundaries, which is superior to STARFM and FSDAF.
In vegetation remote sensing, the reflection in the nearinfrared region is highly influenced by the internal structure of the leaves. Therefore, we select the predicted and actual data of the near-infrared band to create a scatter plot. In the CIA dataset, the scatter plots of the NIR bands based on STARFM, FSDAF, and the proposed RISDAF are shown in Fig. 7. As can be seen from Fig. 7, there is no significant bias among the three algorithms. However, by calculating R 2 , the results from the proposed RISDAF are superior to STARFM and FSDAF, being closer to the 1:1 line, indicating better fitting performance and smaller errors between the actual values and predicted values.

B. Results and Analysis of Homogeneous Regions
The experimental quantitative evaluation indicators of the homogeneous LGC dataset are shown in Table II. Compared with STARFM and FSDAF, the results of RISDAF prediction have the lowest RMSE, ERGAS, and SAM, the highest CC and SSIM, and the best effect. This shows that the proposed RISDAF algorithm has more powerful spectral retrieval and image reconstruction capabilities when the land scale changes on a large scale. The NIR band shows the most tremendous uncertainty in RISDAF and FSDAF, and the prediction effect is the best. In addition to the NIR band, each band also shows a good performance. The accuracy improvement in SWIR1 band and SWIR2 band is obvious, second only to the NIR band, and the RMSE single band index is increased by 5.5%. Therefore, the proposed RISDAF algorithm can predict large-scale land changes better compared with other algorithms. Fig. 8 shows the Landsat image on February 14, 2005, and the prediction results of the three algorithms. It can be seen from the map that due to the impact of floods in two time periods, after the flood, the recovery of the surface caused the change of land cover to a certain extent, and some features did not recover as before. Fig. 9 shows the enlarged orange and yellow areas of Fig. 8 to compare the differences between the predicted results      Fig. 8. The corresponding results for the subarea in orange in Fig. 8(a)-(d) are (a)-(d), the corresponding results for the subarea in yellow in Fig. 8(a)-(d) are (e)-(h). and the actual images. As can be seen in Fig. 9(a)-(d), Although STARFM and FSDAF also accurately capture physical changes due to seasonal changes, the predictions in the border areas are not accurate, small fields in some areas appear to be mixed, and there are traces of flooding that have not recovered in the FSDAF predicted images. By visual comparison, the STARFM predictions are largely accurate in Fig. 9(e)-(h), but the results of STARFM and FSDAF simulations generate images with less spatial detail, compared to RISDAF, which retains more image details with sufficient spectral similarity.
In the LGC dataset, the scatter plots of the NIR bands based on STARFM, FSDAF, and RISDAF are shown in Fig. 10. It can be seen from Fig. 10 that the proposed RISDAF is obviously closer to the 1:1 line. Furthermore, R 2 reached 0.91628, which was notably higher than that of STARFM and FSDAF, indicating that the detailed information can be better retained and the prediction accuracy can be improved in the case of surface-type mutation.    changes can be captured after decomposing mixed pixels, spatial prediction better preserves the spatial structure and details of the original image, making them clearer. Residual distribution can weaken the impact of time and space prediction as well as sensor differences. Final spatial filtering can eliminate the impact of the block effect. The classification of ground objects is clearer with more distinct boundaries and stronger spatial consistency, which results in more clear images. Fig. 12 shows the time prediction, joint spatial prediction, residual correction, and spatial filtering prediction results of the proposed RISDAF algorithm on the LGC dataset based on mixed pixel decomposition. The RMSE values are 0.02988, 0.02706, 0.02673, and 0.02580, respectively. The predicted fine image results on February 14, 2005, gradually converge to accuracy. During the period of image prediction, changes occurred due to flooding, leading to alterations in the types of ground cover. As a result, the focus of the ablation study results was on how to restore the flood-stricken areas, track the phenological changes of vegetation, and predict changes in types of ground cover. While the temporal prediction based on spectral unmixing can roughly predict the cover status and physical changes after the flood. However, since the assumption in the temporal prediction is that the type of coverage has not changed, the areas affected by the flood are somewhat blurred, resulting in low precision of the experimental results and poor model robustness and stability. Following spatial prediction, the accuracy of the experimental results significantly improved, the RMSE value increases by 9.4% compared to the first step, which is higher than the CIA dataset, due to the occurrence of flooding. However, RISDAF is able to retain better mutation details and predict more accurately for feature types covered by floods. To account for sensor error and spatiotemporal prediction error, residual correction is employed to recover the previously unrecovered portion. Spatial filtering is also applied to retain more spatial details, which result in improving prediction accuracy and phenological change trend.

V. DISCUSSION
As remote sensing technology continues to develop, numerous spatiotemporal fusion model algorithms have been proposed to integrate remote sensing images with different temporal and spatial resolutions. This integration enables the monitoring of long-time series through remote sensing images. However, in practical applications, fusion accuracy is often compromised by strong surface cover type shape changes, such as floods and disturbances. Through the extensive experiments conducted in this article, the proposed RISDAF algorithm achieves better prediction in both late recovery and strong physical change prediction of heterogeneous data. Our ablation experiments have proven the algorithm's indispensability in every part. In this article, we analyze and discuss the parts that are not analyzed in detail in the previous studies and present some necessary intermediate experimental results during the experiments.

A. Difference Between Proposed RISDAF and FSDAF in Time Increment Prediction
In the process of mixed pixel decomposition, clustering algorithms play a crucial role in endmember extraction and assigning each pixel to appropriate categories based on the similarity of pixels within the same category. However, calculating temporal increments is based on the similarity of pixels within the same category, making clustering algorithms very important. Our article proposes a new approach, the Gaussian mixture model clustering algorithm based on particle swarm optimization (PSO-GMM), to replace the ISODATA clustering algorithm in FSDAF. The ISODATA algorithm requires manual parameter tuning and is vulnerable to noise interference in large-scale calculations. In contrast, the PSO-GMM algorithm is capable of better exploring space [49], allowing the GMM clustering algorithm to quickly converge to optimal solutions and adaptively adjust parameters for different types and scales of datasets. Moreover, running the PSO-GMM algorithm multiple times reduces the impact of random initialization on clustering results and enhances the robustness of the clustering process. By accurately extracting different endmembers, this method significantly improves the accuracy of spectral unmixing.
Traditional temporal increment calculations in spatiotemporal fusion typically involve directly converting coarse pixel changes into fine pixel changes within a certain timeframe. This approach, however, overlooks the characteristic differences between various sensors and their response disparities to specific cover types, which could lead to errors in the fusion results. Furthermore, the alignment error between remote sensing images can impact the accuracy of spatiotemporal fusion outcomes, as it reflects the discrepancy between observed and underlying variables. To address these issues, our research introduces a linear model that registers the coarse and fine pixels and includes them in the computation of the temporal increment. This allows the model to consider the characteristic differences among sensors during the calculation process, thereby enhancing the precision of the fusion results. Experimental results show that, compared to FSDAF, the proposed RISDAF more effectively retains intraclass spatial details, particularly in predicting images after sudden changes in cover type. RISDAF demonstrates superior adaptability and accuracy, implying that the proposed RISDAF has greater adaptability and predictive capabilities in handling complex spatiotemporal fusion challenges. This offers a novel, more precise solution for spatiotemporal fusion in remote sensing imagery.

B. Advantages of Combining Time and Space Increments
Most hybrid spatiotemporal data fusion algorithms such as FSDAF can capture land cover changes through spatial increments [50]. The IFSDAF builds on this by introducing constrained least squares (CLS) to combine the temporal predictions after spectral unmixing with the number of spatial changes after TPS interpolation to obtain the best predicted amount. In contrast, the RISDAF proposed in this article applies bicubic interpolation to calculate spatial increments, which has the following advantages over TPS interpolation in predicting spatial increments. 1) Due to the variety of surface types of datasets, the bicubic interpolation considers more surrounding data points, leading to the better fitting of local variations and generating smoother results. 2) On large-scale datasets measured in pixels, TPS interpolation requires solving a large-scale linear equation system, which can be relatively slow, especially on large datasets. In contrast, bicubic interpolation can precompute the coefficient matrix and perform interpolation using simple matrix multiplication, resulting in faster computation. The CLSs method is replaced by an SVR algorithm to solve the weights and combine the temporal and spatial increments. The SVR is based on the idea of a support vector machine (SVM), which has better generalization ability on multidimensional data. It maps low-dimensional data to high-dimensional data through kernel functions and flexibly controls the fitting accuracy and robustness of outliers by setting the parameters of the loss function. The proposed RISDAF improves the performance and robustness of the model by calculating the weighted sum of temporal and spatial increments through weight allocation. This approach enables the provision of more accurate and robust prediction results.

C. Improvement of Residual Allocation
In the spatiotemporal fusion model, the residual is defined as the difference between the predicted image and the true image, which is used to guide the generation of the predicted image [51]. Residual correction, as an important step to improve model accuracy, has been widely applied in spatiotemporal fusion algorithms. FSDAF introduces the homogeneity index (HI) for residual allocation to capture land cover changes. However, FS-DAF suffers from serious collinearity problems where changes in independent variables can cause variance changes in residuals, thereby affecting the accuracy and reliability of spatiotemporal fusion models.
The proposed RISDAF algorithm introduces sensor errors into the residual calculation to address this issue. As different types of errors exist between different sensors such as systematic errors and random errors, it is necessary to consider their effects when allocating residuals. Linear correction is used to adjust differences between different sensors, and residual coefficients (r e ) are proposed to convert them into the same reflectance values. Specifically, a fitting method is used for linear correction to establish reflectance conversion between different sensors and ensure that residuals are evenly distributed, thereby improving the accuracy and reliability of model prediction results.

D. Algorithm Performance Analysis Influenced by Moving Window Size
Consideration of the remote sensing imaging edge effect is crucial in the spatiotemporal fusion of remote sensing images. Due to the intricate classification of image features, the features of the image typically change during prediction. In computing the pixel value of the target pixel, neighboring pixels with similar spectra within the sliding window can be selected for computation. Generally, selecting a large sliding window increases the computational workload within the window and decreases the correlation between the center target pixel and the edge pixel. On the other hand, selecting a small sliding window may not yield distinct feature calculation results for the central target pixel. Therefore, choosing an appropriate sliding window size to select similar pixels can significantly improve image prediction accuracy.
In this experiment, the heterogeneous dataset CIA and homogeneous dataset LGC are used for the calculation under different window sizes, which are 11 × 11, 21 × 21, 31 × 31, 41 × 41, 51 × 51, and 71 × 71. The experimental results of the CIA dataset with different sliding window sizes on January 12, 2002, are shown in Table III. The experimental accuracy does not improve because the sliding window size increased, and the five evaluation metrics of the proposed RISDAF algorithm are   ON LGC DATASET optimal when the sliding window size is 41 × 41 OLI pixels. The experimental results of the LGC dataset with different sliding window sizes are shown in Table IV. The experimental accuracy is optimal when the sliding window size is 41 × 41 OLI pixels, and compared with the window size of 11 × 11 OLI pixels, RMSE improves by 3.78%, CC improves by 3.81%, SSIM improves by 3.39%, ERGAS improves by 5.19%, and SAM improves by 8.02%. Therefore, the sliding window of 41× 41 OLI pixels is the best parameter for the experiment, which achieves the smoothing effect while retaining the spatial details.

E. Further Improvement of RISDAF
The aforementioned experimental results and analysis demonstrate that the proposed spatiotemporal fusion algorithm RISDAF provides an improved solution for heterogeneous data with strong phenological changes and regions with surface-type variations. This enhancement improves the accuracy of fusion, yet it is undeniable that the algorithm has certain limitations. Most of the current spatiotemporal fusion algorithms, including the one presented in this article, are predominantly based on public datasets for experiments and analysis, thereby heavily relying on the quality of the input data. In the process of spectral unmixing, RISDAF depends on the accuracy of land classification. If applied in real-world scenarios, the algorithm's effectiveness might be reduced due to the possibility of multiple types coexisting within a single pixel in a heterogeneous landscape. Furthermore, when handling large real-world datasets or generating and analyzing long-term sequence data, the algorithm demands substantial computational resources. This constrains the feasibility of the model in scenarios where resources or processing time are limited. Therefore, improving model precision and fusion efficiency for real data types will be the focal point of future research.

VI. CONCLUSION
This article proposes a RISDAF method to address the difficulties and problems of fusing remote sensing images with different resolutions. The RISDAF uses Landsat and MODIS data as fine and coarse images, which are compared with two excellent spatiotemporal fusion algorithms in terms of quantitative metrics and visual interpretation experiments. The RISDAF method provides a better solution for heterogeneous data with strong phenological changes and areas with changes in surface types, improving the accuracy and adaptability of mixed image element decomposition, scalability, and smoothness. Furthermore, through ablation experiments, it has been verified that each part of the proposed model in this article has an irreplaceable role.
In summary, RISDAF provides a more reliable solution to improve the accuracy of mixed pixel decomposition, optimize spatial details by calculating the weighted sum of temporal and spatial increments, and reduce the impact of sensor differences on spatiotemporal fusion, which improves the stability and robustness of the algorithm. This improvement is beneficial for effective dynamic land surface monitoring through satellite imagery.