Unmixing-Based Spatiotemporal Image Fusion Based on the Self-Trained Random Forest Regression and Residual Compensation

Spatiotemporal satellite image fusion (STIF) has been widely applied in land surface monitoring to generate high spatial and high temporal reflectance images from satellite sensors. This article proposed a new unmixing-based spatiotemporal fusion method that is composed of a self-trained random forest machine learning Regression (R), low resolution endmember Estimation (E), high resolution surface reflectance image Reconstruction (R), and residual Compensation (C), that is, RERC. RERC uses a self-trained random forest to train and predict the relationship between spectra and the corresponding class fractions. This process is flexible without any ancillary training dataset, and does not possess the limitations of linear spectral unmixing, which requires the number of endmembers to be no more than the number of spectral bands. The running time of the random forest regression is about ~1% of the running time of the linear mixture model. In addition, RERC adopts a spectral reflectance residual compensation approach to refine the fused image to make full use of the information from the low resolution image. RERC was assessed in the fusion of a prediction time Moderate Resolution Imaging Spectroradiometer (MODIS) with a Landsat image using two benchmark datasets, and was assessed in fusing images with different numbers of spectral bands by fusing a known time Landsat image (seven bands used) with a known time very-high-resolution PlanetScope image (four spectral bands). RERC was assessed in the fusion of MODIS-Landsat imagery in large areas at the national scale for the Republic of Ireland and France. The code is available at https://www.researchgate.net/profile/Xiao_Li52.


I. INTRODUCTION
A CCURATE monitoring of the Earth's land surface is crucial for understanding the environment and its dynamics [4], [5]. Monitoring of large-area Earth surface dynamics has been greatly facilitated by the development of satellite remote sensing techniques. For instance, the Advanced Very High Resolution Radiometer (AVHRR) enabled monitoring of the Earth at a spatial resolution of approximately 1 km and on a potentially daily basis. Moderate Resolution Imaging Spectroradiometer (MODIS) can also monitor the Earth's surface on a daily basis with a spatial resolution of 250-500 m. High resolution (HR) imagery provides monitoring capabilities at resolutions typically finer than 100 m, but less frequently than low resolution (LR) imagery. The Landsat series data have been monitoring the Earth every 16-18 days at a 15-80 m resolution for approximately 50 years [7]. Sentinel-2 multispectral imagery has provided a revisit frequency of approximately 5 days since 2016 [10]. Although multiple optical satellite remote sensing systems allow the monitoring of the Earth, satellite remote sensing is often limited by the trade-off between spatial and temporal resolutions [11], [12], [13], [14].
Different from the aforementioned STIFs, the other group of STIF models is the unmixing-based STIF which reconstructs an image at the spatial resolution of the HR image at a known time while preserving spectral reflectance information from the LR image at the prediction time [6], [62], [63]. The classic unmixing-based STIF is composed of four steps: 1) classifying the HR image into a multiclass land cover map; 2) calculating the LR class fraction images based on the HR multiclass land cover map; 3) estimating the LR image endmember spectra based on the LR image and the corresponding class fraction images at the prediction time; and 4) assigning the estimated endmember spectra to the HR land cover map according to the corresponding pixel labels. Unlike the aforementioned STIF models such as spatial and temporal adaptive reflection fusion model (STARFM) [1] and flexible spatiotemporal data fusion (FSDAF) [38] that require the HR and LR images to have the same or similar spectral bands, the unmixing-based STIF can fuse HR and LR images with different spectral bands and is more flexible than other fusion methods. For instance, the unmixing-based STIF has been used in the fusion of Landsat TM and MERIS to generate a 30 m resolution multispectral image that preserves MERIS's spectral information of 15 spectral bands [63], [68]. The unmixingbased STIF can fuse HR and LR are acquired at different dates, and is different from the spatial-spectral fusion or pansharpening which requires the HR and LR to be acquired from the same sensor or acquired at the same or similar dates. In addition, the unmixing-based STIF requires a minimum number of inputs than other STIF models: a prediction time LR image and a known time (either predates or postdates the prediction time) HR image (see Fig. 1), and is thus more flexible than the STARFM-like and FSDAF-like STIFs which require both the HR and LR imagery at the known time to be available.
The first unmixing-based STIF was a multisensor multiresolution image fusion [62] and has been greatly improved in recent years. Zurita-Milla et al. The study [63] fused Landsat TM with MERIS imagery to generate a 30 m resolution multispectral image that preserves MERIS's spectral information. Amorós-López et al. [68] proposed a model suitable for complex heterogeneous regions by fusing Landsat with MERIS imagery for crop monitoring. Liu et al. [66] applied the linear spectral unmixing model to generate an HR class fraction image, which was used as a substitute for the pixelbased hard classification map to enhance fusion accuracy in regions of heterogeneous land cover. Wang et al. [55] proposed a block-removed unmixing that effectively reduced the blocky effect in the STIF. Unmixing-based STIF has also been applied to generate multispectral or hyperspectral reflectance imagery and the corresponding vegetation indices at high spatiotemporal resolutions. For instance, Busetto et al. [69] estimated the time series sub-pixel normalized difference vegetation index (NDVI) images by the fusion of MODIS and Landsat imagery, and Zurita-Milla et al. [6] estimated vegetation indices to monitor the seasonal changes in vegetation by the fusion of MERIS full-resolution and Landsat imagery. Unmixing-based STIF has also been combined with weightedfunction-based fusion to further utilize the advantages of each fusion method. For instance, Gevaert and García-Haro [65] combined the strengths of unmixing-based STIF and STARFM to make the fusion result less sensitive to reflectance changes. Xu et al. [70] modified the unmixing-based STIF by adding a regularization term of the endmember spectra to ensure that the extracted endmember spectra did not differ greatly from the predefined endmember spectra. Jiang and Huang [71] used two spectral unmixing approaches and STARFM to reduce blurring problems. Although the unmixing-based STIF has several advantages over other STIFs, several limitations exist.
First, most unmixing-based fusion methods assume that the HR image pixels are pure and assign each HR pixel to a single class based on clustering or classification algorithms applied to the HR image. If the neighboring HR pixels within the same LR pixel are labeled with the same classes, they are assigned the same spectra in the fused image. As a result, the fusion homogenizes the spectra for the neighboring HR pixels with the same class, that is, the homogenization effect. This phenomenon results in an inability to represent intraclass spectral variability for the neighboring same-class HR pixels for these unmixing-based STIFs [55], [63], [66]. To address the mixed pixel problem that is also common with HR imagery and to reduce the homogenization effect in the unmixingbased STIF, multitemporal fusion (MTF) uses a soft clustering algorithm to map HR class fraction images [68]. However, the MTF unmixing result may be sensitive to the fuzzy parameters used in the soft clustering algorithm. The unmixing would resemble the result of hard classification if the fuzzy parameter is small, and result in similar class fractions for all classes in the HR pixel if the fuzzy parameter is very large [72]. The linear spectral unmixing-based spatiotemporal data fusion model (LSUSDFM) has used the fully constrained least squares linear spectral mixture analysis (FCLS) algorithm to spectrally unmix the HR reflectance image [66]. Anyway, the FCLS is an inversion problem and is ill-posed when the number of clusters is larger than the number of spectral bands in the HR image, and fails to consider the intraclass spectral variabilities in endmembers in the unmixing and cannot deal with the multiple scattering effects using a linear mixture model. Moreover, the FCLS inversion is an optimization approach that is usually time-consuming.
The second limitation of the current unmixing-based STIF model is that it only uses the LR image at the prediction time in endmember extraction and fails to fully use the LR image. In particular, the unmixing-based STIFs estimate LR image endmember spectra based on a set of LR pixels in a local window, with the target LR pixel as the window center, and assign the endmember spectra to the HR scale according to the pixel labels generated from the HR image to generate the fused image. The estimated endmembers do not reflect the information of solely the target LR pixel, but represent the averaged spectral information within the local window. As a result, the estimated LR endmembers may not represent a drastic reflectance change that occurred mainly in the target LR pixel. The using of residual compensation by comparing the fused with the observed LR image at the prediction time could enhance the accuracy of the FSDAFlike STIFs [38] and their derivations [36], [37], [42]. However, the FSDAF-like STIFs require an additional LR image at the known time in the residual compensation, based on the assumption that the HR and LR images have similar spectral bands. This approach is thus not suitable for the unmixingbased STIFs which do not input the LR image at the known time and which may deal with HR and LR images with different spectral bands. To the best of our knowledge, the use of the LR image for residual compensation for the unmixingbased STIF has not yet been reported.
Finally, although the application of STIF in large areas has been studied, these STIFs are mainly applied for single band data such as vegetation index [3], [47] and land surface temperature band [51], [56], and the STIF study for multispectral reflectance imagery remains challenging because it involves more input bands and the solution is more complicated. For instance, the benchmark datasets [15], [23] and most test imagery used in STIF studies are limited in a spatial span, typically no larger than the range of one Landsat scene of 185 × 185 km (34 225 km 2 ), as shown in Table I. In STIF for a very large area, it is necessary to mosaic a series of HR images acquired at different times as the LR image at the known time. Although the composition of large-area  I   LOCATION AND AREA OF STUDY REGIONS USED IN THE RECENT  AND STATE-OF-THE-ART STIFS FOR MULTI-AND HYPER-SPECTRAL  REFLECTANCE IMAGE FUSION. THE FUSIONS OF SINGLE-BAND  DATA SUCH AS NDVI, TEMPERATURE, AND  EVAPOTRANSPIRATION ARE NOT INCLUDED cloud-free images has been greatly facilitated by the online cloud computing platform of Google Earth Engine [73], it is very difficult to generate the corresponding LR images in which each pixel should have the same acquisition time as the corresponding HR pixels. This greatly limits the use of the popular STIFs such as STARFM and FSDAF which require the HR and LR imagery before the prediction time to be acquired at the same or similar time in large area image fusion. In contrast, the unmixing-based STIF only inputs the HR image without the corresponding LR image at the known time, and is thus not limited by the same/similar time requirement in the known time HR-LR image pair and is more flexible and has considerable potential for fusion over a very large area unlike the other STIFs. However, to the best of our knowledge, the use of large-area Google Earth Engine-composited imagery in unmixing-based STIF has not been reported.
In this article, a novel sub-pixel unmixing-based data fusion that is composed of a self-trained random forest machine learning Regression (R), LR endmember Estimation (E), HR surface reflectance image Reconstruction (R), and residual Compensation (C), i.e., RERC, is proposed to address the limitations of the current STIFs. The RERC uses the HR class fraction information to reduce the homogenization effect in pixel-based classification. RERC uses a self-trained machine-learning regression model, which is automatic and computationally efficient, to map HR class fractions to over- come the limitations of current linear and soft clustering algorithms. In addition, RERC compares the fused image with the LR image at the prediction time and composites the residuals to refine the prediction image. This residual composition step is different from the FSDAF-like methods because RERC neither requires a LR image at the known time nor requires the HR and LR images to have similar spectral bands. The method has the minimum input in STIF, including a HR reflectance image at a known time and a LR reflectance image at the prediction time, and does not require the LR reflectance image at the known time and is thus more flexible than STARFM and FSDAF. RERC was assessed in three experiments. The first experiment compared the RERC with state-of-the-art unmixing-based STIFs in the fusion of prediction time MODIS images with a known time Landsat using two open-source benchmark datasets. The second experiment assessed the proposed method in fusing a prediction time Landsat with a known time very-high-resolution PlanetScope image (four spectral bands) to assess RERC in the fusion imagery with different spectra bands. The third experiment assessed the proposed method in the fusion of MODIS-Landsat imagery in very large areas at the national scale for the Republic of Ireland (∼70 273 km 2 ) and France (∼551 500 km 2 ) to assess RERC at the national scale.

II. METHODOLOGY
The proposed RERC (see Fig. 2) has four main steps: 1) generating the HR class fraction images at the known time (t 0 ) using a self-trained regression; 2) estimating the endmember spectra at the prediction time (t p ) and reconstructing a HR image at t p according to the linear mixture model; and 3) refining the HR image to generate the final image at t p based on residual compensation.
A. Unmixing the HR Images at t 0 Based on the Self-Trained Regression The RERC extracts HR class fraction images from the HR image at t 0 . First, RERC applies unsupervised algorithms of the k-means clustering [74] to the HR image at t 0 to automatically generate a pixel-based HR land cover map with n classes. Then, RERC uses a self-trained regression of random forest to train the relationship between the spectral image and the corresponding class fractions at a coarse resolution scale with a predefined scale factor z. In particular, the inputted HR reflectance image is spatially degraded by averaging the spectra of all the HR pixels within the z × z window, and the HR land cover map at t 0 is spatially degraded to class fraction images by dividing the total number of HR pixels of a class by z 2 . In RERC, the scale factor z was set to 10 so that the value of the minimal interval between two class fractions was 1%. If z is too small, the corresponding class fractions may belong to only a limited number of values. For instance, if z is set to 2, then the spatially degraded image contains four HR pixels, and the corresponding class fraction for a class is one of the values of 0% (none of the pixels belong to this class), 25% (one pixel belongs to this class), 50% (two pixels belong to this class), 75% (three pixels belong to this class), and 100% (four pixels belong to this class). In contrast, if z is too large, the training data would be too complex [75], [76].
With the spatially degraded reflectance image and class fraction image as the training dataset, RERC uses random forest machine learning, which is a supervised ensemblelearning non-linear regression algorithm based on regression trees [77], to construct the regression relationship between the image pixel spectra and the corresponding class fractions. A random forest regression model was constructed for each class separately according to the degraded reflectance and class fraction images. For each class, RERC uses the B HR × 1 size (B HR is the number of HR image bands) spectral vectors from all the degraded pixels as the input of the random forest regression model and uses the corresponding class fraction values for that class from all the corresponding degraded pixels as the output of the regression model. The trained random forest regression model for each class according to the degraded reflectance and class fraction images was then applied to the HR image at t 0 to predict class fractions at the HR scale. The class fraction regression model trained at an LR resolution scale has been proven to be effective for predicting class fractions at a finer scale [75], [76]. For each HR pixel, the fraction for each class is divided by the sum of all class fractions in that HR pixel, such that all class fractions sum to one. A flowchart of the self-trained regression is shown in Fig. 3.

B. Estimating the LR Endmember Spectra and
Reconstructing the HR Image at t p This step (step (2) in Fig. 2) is the same as the unmixingbased STIFs that explore the sub-pixel land cover information Fig. 3. Flowchart of the self-trained regression model that is used to estimate HR class fraction images at t 0 . Scale factor z is set to 10 (the minimal interval between two class fraction values is, therefore, 1%). The different colors in the HR land cover map represent different land cover classes, and different gray images in the spatially degraded and HR class fraction images at t 0 represent different class fractions.
such as MTF and LSUSDFM. The endmember spectra at t p were estimated based on the LR reflectance image at t p and the corresponding class fractions at t 0 . First, the HR class fraction image predicted from a random forest was spatially degraded to the LR scale by averaging the HR class fraction within each LR pixel according to s which is the scale factor between the input LR and HR imagery. With the LR image at t p and the degraded class fractions at t 0 , the endmember spectra for each LR pixel at prediction time t p are estimated based on the inversion of a linear mixture model. To avoid collinearity in the estimation of endmembers, regularizationbased linear unmixing is used to estimate the endmember for each LR pixel (i.e., local endmember) [66], [68], [70]. The endmembers used for regularization (i.e., global endmembers) are directly selected from the LR image at t p to avoid the impact of atmospheric conditions and spectral bias [68]. For a target class, the spectral values from a set of LR pixels at t p with the highest class abundance value of the target class are averaged as the global endmember of the target class [66]. Finally, for each LR pixel, the local endmember spectra within the LR pixel are estimated according to an inversion of a linear spectral mixture model using a group of LR pixels within a w × w sliding window, with the target LR pixel as the window center. For the ith LR pixel, the local endmember spectra of different classes in the bth band (b ∈ 1, . . . , B LR , where B LR is the number of LR image bands) are estimated based on k 2 neighboring LR pixels by minimizing the least square error where n is the number of classes, k is the number of LR pixels, w is the size of the local window,ê b,i is the local endmember spectra in the bth band for the ith LR pixel, y b,k is the spectrum in the bth band in the kth LR pixel, a c,k is the class fraction of the cth class (c = 1, 2, . . . , n) in the kth LR pixel,ê c,b,i is the cth class spectrum in the bth band in the ith target LR pixel,ē c,b is the global endmember spectrum for the cth class in the bth spectral band, and α is the regularization parameter. RERC predicts the reflectance image by linearly combining the estimated LR endmember spectra with the HR class fractions based on a linear mixture model where e j is a B LR × n sized local endmember spectra matrix for the jth HR pixel, a j is an n × 1 sized class fraction vector of different classes in the jth HR pixel, and y j is the predicted B LR × 1 sized spectra in the jth HR pixel.
C. Residual Compensation for Generating the Final HR Image at t p RERC refines the fused HR image using information from the LR reflectance image based on residual compensation. RERC spatially degrades the fused HR image at t p to the LR scale by averaging the reflectance values within the LR pixels according to the scale factor s and compares it with the observed LR image at t p to generate a spectral difference image. Assume y b,i is the observed spectrum in the bth LR band (b ∈ 1, . . . , B LR ) in the ith LR pixel at t p , andŷ b, j,i is the estimated spectrum in the bth band in the jth HR pixel of the ith LR pixel. ξ b,i is the spectral difference between the observed spectrumŷ b,i and the synthetic spectrum in the bth band in the ith LR pixel. The spectral residual value for the ith LR pixel in the bth LR image spectral band (b ∈ 1, . . . , B LR ), ξ b,i , is calculated as ξ b,i was calculated at the LR scale, whereas the fused image spectrum in (2) is at the HR scale. To match this spatial resolution gap, ξ b,i is spatially interpolated to a HR scale of ξ interpolation b,i using bicubic spatial interpolation with the scale factor s. Direct summation of ξ interpolation b,i with the estimated HR spectrum may cause a blurring effect [38], [55], and the residual image is refined using a weighted sum of spectrally similar pixels within an m × m sized local square window from the HR image at t 0 , assuming that spectrally similar pixels at t 0 would have similar spectral change [1], [38].
The spatial prediction value SP b, j for the jth HR pixel in the bth LR image spectral band is calculated as where L is the number of spectrally similar pixels in the inputted HR image. The spectrally similar pixels are selected based on a set of L HR pixels that have the smallest spectral differences in the reflectance image at t 0 [1], [36], [37], [38]. The weight of the lth (l ∈ L) HR pixel, w l , in (4), is determined according to the geometric distance between the jth target HR pixel and the lth neighborhood pixel, d l, j , as The final spectrum in the jth HR pixel in the bth LR image band (b ∈ 1, . . . , B LR ) for the unmixing-based STIF in the fusing of HR and LR imagery with the same spectral bands is calculated by a weighted sum of the fused image and the spatial prediction image aŝ where HI b, j denotes the heterogeneous index for the jth HR pixel in the bth band (B LR = B HR ). The spatial prediction image is generated from a spatial interpolation image at the LR scale, and the fusion may be blurred if the HR pixel is located at the boundary between two land cover classes. The weight HI b, j in (6) is used to give a low value if the jth target pixel is located in the boundary region to preserve the edge and a high value if the jth target pixel is located in the homogeneous region for smoothing the region. The heterogeneous index HI b, j is calculated as where Std b, j is the standard deviation of the spectral reflectance ranging from 0 to 1 in the bth band in a 7 × 7 HR local window (the optimal window size is set through many trials) in the inputted HR image at t 0 , with the jth HR pixel as its window center. Note that if the input HR image at t 0 represents the digital number (DN) value instead of the spectral reflectance ranging from 0 to 1, the DN values can be divided by the maximum DN value in the corresponding bands so that the calculated Std b,i ranges from 0 to 1. HI b, j is relatively small if Std b, j is high, which means that the target pixel is located in a heterogeneous region. HI b, j is relatively large if Std b, j is small, which means that the target pixel is located in a homogeneous region.
In the fusion of HR and LR imagery with dissimilar spectral bands, the averaged heterogeneous index in (8) is used as a substitute for the per-band heterogeneous index in (7), calculated using all spectral bands in the HR image as follows:   It is clear that a flood was present on 12 December 2004, and this dataset was used to assess the unmixing-based STIF when dealing with abrupt land cover change associated with a flood event.
2) PlanetScope and Landsat Imagery Experiment: In this experiment, the very-high-resolution PlanetScope image and Landsat 8 image were adopted. The PlanetScope has a 3 m spatial resolution and four spectral bands of blue, green, red, and near-infrared (NIR). The multispectral Landsat 8 image, with seven bands including the coastal aerosol, blue, green, red, NIR, and two shortwave infrared (SWIR) bands, were selected, and the panchromatic, cirrus and two thermal infrared bands were excluded in the experiment. RERC was used to fuse a 3 m resolution image with Landsat spectral bands (seven spectral bands).
The study area (114.45 • E, 31.23 • N) is located near Wuhan, China with an area of 144 km 2 in Fig. 6. The Landsat subset image acquired on 20 October 2019 was used as the LR image at the prediction time t p [see Fig. 6(b)]. The PlanetScope subset image acquired on 21 October 2019 was used as the HR image for validation. The PlanetScope subset image acquired on 19 September 2019 was used as the HR image at the known time t 0 . There is no cloud-free Landsat image acquired near the known time t 0 ; only the unmixing-based STIF can be used in the STIF, whereas the STARFM-like 3) Experiment in Very Large Areas for the Republic of Ireland and France: In this experiment, Landsat and real MODIS imagery for the entire Republic of Ireland were adopted to assess the RERC for national-scale image fusion. The Republic of Ireland is located in the North Atlantic Ocean [see Fig. 7(a)]. Ireland has a temperate oceanic climate with cloudy and wet weather, and is cloudy and rainy. Cloud-and shadow-free Landsat imagery is rare for this region. STIF was used to generate imagery at the spatial resolution of Landsat with MODIS spectral reflectance.
In this study, the 463 m MODIS MCD43A4 daily surface reflectance image acquired on 12 August 2022, was adopted as the LR image at t p . The MODIS image was cloud-free, and was reprojected to the WGS 1984 projection and resized to a resolution of 480 m [see Fig. 7(d)], and the scale factor s was 16 between the MODIS and Landsat imagery. The Landsat 8 OLI products were used as known and validation time data. The mosaiced Landsat 8 OLI images that were mostly cloudand shadow-free, acquired on 12 August 2022, were used for validation [see Fig. 7(e)]. The prevailing climatic conditions often result in variable cloud cover which makes it difficult to form a Landsat image for the entire nation at the same or similar time, the known time HR image used in the unmixingbased STIF is a synthetic composited and mosaic Landsat 8 OLI image acquired at different times from the Google Earth Engine in Fig. 7(b). The composite and mosaic image was generated from all Landsat 8 surface reflectance images that were atmospherically corrected. The cloud and cloud-shadow pixels in all Landsat 8 images were masked using the quality assessment band in the Landsat 8 level 2 product. The composited Landsat imagery was generated using the median values from all cloud-and shadow-free values for each pixel between 1 January 2021 and 1 August 2022, from a total of 341 Landsat imagery. Using the median value has the benefit of removing clouds (which have a high value) and shadows (which have a low value) that are not masked by the quality assessment band for the Landsat imagery [78]. Almost all Landsat pixels in the composited image are cloud-and shadow-free. All the composited images were then mosaicked in Fig. 7(b). The number of valid cloud-and shadow-free pixels during this period in the corresponding composite Landsat 8 OLI image is shown in Fig. 7(c). Although most areas with yellow and red colors in Fig. 7(c) have more than ten valid cloud-and shadow-free observations from Landsat-8 during this period, about 13.52% of the regions have less than ten cloud-and shadow-free Landsat-8 observations with green color in Fig. 7(c), including 0.25% regions with no more than three cloud-and shadow-free Landsat-8 observations.
Landsat and real MODIS imagery for France were also used to assess the RERC. France is a country that has historically been one of the world's major agricultural centers [see Fig. 8(a)]. France has various natural land cover types, including forests, croplands, moorlands, and grasses, which may present different reflectance features throughout the year. The STIF, which generates imagery at the spatial resolution of Landsat with MODIS spectral reflectance, is helpful in the understanding of phenology and variations of the land covers, especially agricultural land.
The 463 m MODIS MCD43A4 image acquired on 13 August 2022 [see Fig. 8  in the unmixing-based STIF is a synthetic composited and mosaic Landsat 8 OLI image acquired during the period between 1 April 2022 and 1 August 2022, generated from Google Earth Engine using the median values from all cloudand shadow-free values during this period for each pixel based on a total of 653 Landsat imagery. The number of valid cloud-and shadow-free pixels between 1 April 2022 and 1 August 2022, in the corresponding composited and mosaic Landsat 8 OLI image is shown in Fig. 8(c). Approximately 70.21% of regions have less than ten cloud-and shadow-free Landsat-8 observations during the period with green color in Fig. 8(c), including 9.64% of regions with no more than three cloud-and shadow-free Landsat-8 observations.

B. Model Comparison and Parameter Settings
The proposed RERC was compared to three established unmixing-based STIF methods in the CIA and LGC experiments. Since only the HR at t 0 and LR image at t p are available, the STARFM-like and FSDAF-like methods which require an LR image at t 0 as input were not compared.
The first comparator method was the unmixing-based data fusion (UBDF) proposed by Zurita-Milla et al. [63]. The UBDF assumes that the HR pixels at t 0 are pure and directly assigns the estimated endmember spectra from the LR image to the corresponding HR image pixel. The second comparator method was the MTF proposed by Amorós-López et al. [68]. The MTF uses a soft clustering algorithm to generate the HR class fractions using the Mahalanobis distance between the HR pixel spectra and the cluster centroid. The third comparator method is the LSUSDFM proposed by Liu et al. [66]. The LSUSDFM uses FCLS to generate HR class fraction images.
The performance of the proposed RERC depends on several parameters. For all methods, the LR sliding window size used for endmember estimation was set to k = 11 [68]. The regularization parameter was set to α = 0.1 according to the previous studies [66], [68]. In the RERC, the fusion accuracy and computational efficiency are related to the number of clusters n. A larger cluster number is more suitable for dealing with heterogeneous landscapes, but it increases the computation time. The cluster number is usually set to a relatively large value (usually 30) for unmixingbased STIF to reduce the impact of the homogenization effect, which indicates the predicted reflectances are the same for neighboring HR pixels that contain the same class located within the same LR pixel, in the fused image. The optimal number of clusters was set to n = 16 for MTF [68]. The optical cluster number in the LSUSDFM is dependent on the landscape complexity of the study site. The cluster number was set to n = 10 for both the LSUSDFM and RERC, considering landscape heterogeneity and computational efficiency. In the RERC, the size of the window used for selecting spectralsimilar pixels was set to m = 16, which equals the scale factor s, and the spectrally similar pixel number was set to L = 20 [37], [38]. In the experiment in very large areas for the Republic of Ireland and France, the mosaiced Landsat image was divided into 2400 × 2400 pixel patches considering the computational efficiency and computer memory, and all the pixels were used for training the regression model.

C. Accuracy Assessment
Many quantitative metrics have been used to assess the different unmixing-based STIFs. The metrics included the root mean square error (RMSE), average absolute deviation (AAD), correlation coefficient (CC), and structure similarity (SSIM) [8], [37], [38]. Smaller RMSE and AAD and larger CC and SSIM indicate a better match between the fused and reference images.

IV. RESULTS
A. Experiment on the CIA and LGC Sites 1) Simulated MODIS-Like Image Experiment on the CIA Site: The study site, which is located in a region of heterogeneous cropland cover, experienced a drastic change in the spectral reflectance values from time t 0 in Fig. 9(a) to time t p in Fig. 9(b) and (c). The UBDF, which assigns each HR pixel to a unique endmember spectrum, usually homogenizes the spectral reflectance values for spatially adjacent pixels, such as in regions I and II in the zoomed areas (see Fig. 9). In contrast, the MTF, LSUSDFM, and RERC generated more variable spectral reflectance values in these regions. The sub-pixel MTF, LSUSDFM, and RERC significantly reduced the homogenization effect to a great extent. In the regions highlighted with circles in the zoomed areas A and B, the RERC in Fig. 9(g) is more similar to the reference images in Fig. 9(c) than UBDF, MTF, and LSUSDFM. For instance, in region V in zoomed area B, the patch has a darker yellow color in the Landsat image at t 0 which is similar to its adjacent patches in Fig. 9(a), which have a relatively darker color than the surrounding patches in the reference image in Fig. 9(c). UBDF, MTF, and LSUSDFM predicted this patch with a light cyan color in region V, which is similar to the surrounding patches. In contrast, the RERC in Fig. 9(g) predicted a darker color for this patch in region V, which is similar to that in the reference image in Fig. 9(c).
The quantitative accuracy metrics obtained for the outputs from the different STIFs at the CIA site are listed in Table II. The pixel-based UBDF method typically generated the highest RMSE and AAD and the lowest CC and SSIM in different spectral bands, whereas the sub-pixel-based methods improved the accuracy of these metrics. The proposed RERC outperformed the comparator STIFs with a decrease in the RMSE and AAD and an increase in the CC and SSIM in all spectral bands.
2) Real MODIS Image Experiment on the CIA Site: In zoomed area A in Fig. 10, the region highlighted with the circle I experienced a drastic spectral reflectance change due to land cover changes such as crop rotation and is represented as a white color in the reference Landsat image at t p . The unmixing-based STIF of UBDF, MTF, and LSUSDFM predicted the region with cyan color highlighted with the circle I, and RERC predicted this region with white color, which is most similar to the reference. In zoomed area B, the subregion highlighted in circle II is represented by a dark red color, and the subregion highlighted in circle III is represented as a dark green color reference Landsat image at t p . The UBDF and LSUSDFM predicted dissimilar reflectance to the reference image in circles II and III, and the MTF predicted dissimilar reflectance to the reference image in circle III. In contrast, the proposed RERC predicted an image that is the most similar to the reference in circles II and III.
The quantitative accuracy metrics obtained for the outputs from the different STIFs at the CIA site are listed in Table III. The pixel-based UBDF method typically generated the highest RMSE and AAD and the lowest CC and SSIM in different spectral bands, whereas the sub-pixel-based methods improved the accuracy of these metrics. RERC outperformed the comparator STIFs with a decrease in the RMSE and AAD and an increase in the CC and SSIM in all spectral bands.
3) Simulated MODIS-Like Image Experiment on the LGC Site: The study site experienced a flood event which is observable in the MODIS and reference Landsat images at t p in Fig. 11. MTF, UBDF, and LSUSDFM predicted only parts of the flood, as shown in zoomed areas A and B, whereas RERC mapped almost all the regions covered by the flood. In particular, in region I in zoomed area A, UBDF, MTF, and LSUSDFM predicted a flood with a discontinuous color, whereas RERC mapped the flood with a more continuous color. In region II in zoomed area A, UBDF, MTF, and LSUSDFM predicted a flood with a disconnected shape, whereas RERC predicted a flood that was spatially connected and continuous. In the zoomed area, B, UBDF, MTF, and LSUSDFM failed to map the flood or only partly mapped the flood in regions III and IV, whereas RERC successfully mapped the flood in both regions. In general, compared with the UBDF and LSUSDFM images, the RERC images in the entire and zoomed areas in Fig. 11(g) are more similar to the reference. Similar to the results for the CIA site, the sub-pixel scale MTF, LSUSDFM, and RERC decreased the RMSE and increased the CC and SSIM in comparison with UBDF. RERC generated the lowest RMSE and AAD and the highest CC and SSIM in Table IV.

4) Real MODIS Image Experiment on the LGC Site:
The study site experienced a flood event which is observable in the MODIS and reference Landsat images at t p in Fig. 12. Both zoomed areas A and B experienced floods according to the known [see Fig. 12(a)] and prediction time [see Fig. 12(c)] Landsat images. In zoomed area A, only the proposed RERC predicted a flood, as highlighted by circle I. In zoomed area B, UBDF failed to predict the flood, as highlighted in circle II, and MTF and LSUSDFM predicted part of the flood, as highlighted in circles II and III. In contrast, the RERC predicted a flood that was more similar to that in the reference Landsat image. The region highlighted in circle IV did not experience floods. The UBDF, MTF, and LSUSDFM incorrectly predicted this region with some floods with light blue color, as shown in Fig. 12, whereas the RERC prediction is more similar to the Landsat image at t p . This experiment shows that RERC outperformed the comparison method in the prediction of reflectance changes due to both gradual phenological changes and land-cover changes. Similar to the results for the CIA site, the sub-pixel scale MTF, LSUSDFM, and RERC decreased the RMSE and increased the CC and SSIM in comparison with UBDF. RERC generated the lowest RMSE and AAD and the highest CC and SSIM in Table V.

B. PlanetScope and Landsat Image Experiment
The input and RERC result images are shown in Fig. 13. The UBDF, MTF, and LSUSDFM were not compared because they are computationally inefficient. In particular, the running time of the random forest regression used in RERC is only ∼1% of the running time of the FCLS linear mixture model (the running time is 506 s for random forest regression, longer than 7000 s for the non-linear model, and longer than 50 000 s for FCLS on the CIA site). In Fig. 13(c), the Landsat images were relatively coarse to represent the detailed spatial distribution of land covers. The boundaries are jagged and the settlements were blurred in the Landsat image at t p such as highlighted in circle I in zoom area A, and are clear in the RERC prediction image in Fig. 13(d). In zoom area B, the bare land region highlighted with circle II had a surface reflectance change from t 0 to t p , and RERC had predicted this   reflectance change in Fig. 13(d). In zoom area B, the region highlighted with circle III had experienced a drastic surface reflectance change from t 0 to t p , and the RERC prediction in Fig. 13(d) is similar to the reference PlanetScope image at t p in Fig. 13(b).
The Landsat image contains more spectral bands, such as the coastal aerosol band and the SWIR bands, than the four bands PlanetScpoe image (NIR, red, green, and blue bands). In this study, RERC predicted the image with Landsat spectral information including the coastal aerosol and the SWIR bands,   and at the PlanetScpoe spatial resolution in Fig. 13(e). Since RERC can fuse imagery with different spectral bands in Fig. 13(f), it is more flexible than the STARFM-like and FSDAF-like STIFs which fuse imagery with the same spectral bands. Table VI shows the quantitative metrics for RERC. RERC generated RMSE and AD of approximately lower than 0.040 and 0.032, respectively, in all four spectral bands. The quantitative metrics were not assessed in the coastal aerosol and SWIR bands which the PlanetScope does not contain. The RMSE and AAD values for RERC usually increase with an increase in the standard deviation in the spectral band of the reference Landsat image at t p . In other words, if the standard deviation in the reflectance in a spectral band is high, the reflectance is more variable in this band, and it is more difficult for the RERC to accurately predict the reflectance. In particular, RERC generated the highest RMSE and AAD values in the NIR band, which had the highest standard deviation in reflectance in the reference Landsat image at t p , and generated the lowest RMSE and AAD in the blue band, which has the lowest standard deviation in reflectance in the reference Landsat image at t p . It is also evident that there is a band difference (including the central wavelength and bandwidth) between the reference PlanetScpoe image and the input Landsat image at t p , and the spectral band difference would also impact the quantitative analysis of the fusion result.    ellipse in the known Landsat image is represented in pink in Fig. 14(c), and this region is represented in green in the reference Landsat image in Fig. 14(e) because the vegetation is in the growing season, showing a reflectance change due to phenological factors. This region, highlighted with the ellipse, is represented as green in the zoomed area in the MODIS image in Fig. 14(d); however, the spatial details are blurred in the MODIS image. In the RERC prediction image, the zoomed area highlighted with the ellipse is represented in green in Fig. 14(b), which is similar to that in the refer -TABLE VII   ACCURACY METRICS OF THE PROPOSED RERC FOR THE REPUBLIC OF  IRELAND. UBDF, MTF, AND LSUSDFM WERE NOT COMPARED  BECAUSE THEY ARE RELATIVELY COMPUTATIONALLY  INEFFICIENT WHEN THE STUDY SITE IS VERY LARGE ence Landsat image in Fig. 14(e), showing the ability to map surface reflectance changes due to phenological changes. Compared with the MODIS image, the fused RERC represents more of the spatial details of land cover, such as the vegetation highlighted with the ellipse and the lakes highlighted with the circle in Fig. 14(b), showing the necessity of the fusion of MODIS and Landsat in monitoring land cover. The quantitative metrics for RERC are listed in Table VII. In the blue, green, red, and SWIR bands, the RMSE and AAD values range from 0.0090 to 0.0273 and from 0.0062 to 0.0195, respectively. The RMSE and AAD values are relatively larger in the NIR band than those in other bands. The main reason is that the reflectance values have the largest variance in the NIR band with a standard deviation of 0.0977. RERC predicts a CC value of 0.8243 and SSIM value of 0.8024 in the NIR band, showing that the RERC prediction image has a high correlation with the reference Landsat image for validation.
2) Experiment in France: The results of the RERC and zoomed areas in the input, result, and reference imagery are shown in Fig. 15. It is evident that the zoomed area has experienced drastic surface reflectance changes by comparing the known time image [see Fig. 15(c)] with the reference image [see Fig. 15(e)], as highlighted by the circles. The drastic surface reflectance changes are obvious in the MODIS image, as highlighted by the circles in Fig. 15(d). However, the MODIS image is blurred and the relatively coarse resolution fails to represent the spatial detail of the reflectance changes. In comparison with the MODIS image, the RERC image is generated at 30 m resolution and better demonstrates the spatial detail of the reflectance at the prediction time such as highlighted in the circles in Fig. 15(b). The quantitative metrics of the RERC for the experiment focused on France are listed in Table VIII. Similar to the Republic of Ireland experiment, RERC predicted the highest RMSE and AAD values in the SWIR2 band, which had the largest standard deviation value in reflectance, and predicted the lowest RMSE and AAD values in the blue band, which had the smallest standard deviation value in reflectance. The mean values generated from the RERC were very similar to those in the MODIS image at t p

V. DISCUSSION
The comparison of different unmixing-based algorithms in generating the HR class fraction images, the performance of different unmixing-based STIFs in dealing with blocky effects in the output, and the limitations and future work for RERC are discussed here.

A. Comparison of Different Unmixing Algorithms Used in Generating the HR Class Fraction Images
RERC and the two comparator methods of MTF and LSUS-DFM used different unmixing algorithms to generate HR class fraction images. The random forest regression used in RERC has several advantages compared with the soft-clustering algorithm used in MTF and the FCLS unmixing used in LSUSDFM.
First, in MTF, the soft-clustering algorithm used is dependent on the softness parameter [68]. A very small softness parameter results in class fractions from different classes close to 0 or 1, and the result is similar to a hard classification map. A very large softness parameter would result in class fractions from different classes similar to 1/n which is rare in real scenarios [72]. Therefore, the optimal softness parameter value is dependent on the heterogeneity of land cover and the corresponding existence of mixed pixels in the HR image. In comparison with the soft-clustering algorithm, the random forest regression used in RERC is more flexible and robust in use.
Second, in the LSUSDFM, the FCLS spectral unmixing is ill-posed if the number of endmembers is larger than the number of HR image bands, and the ill-posed problem is more severe if the HR image (such as Landsat, PlanetScope, and Chinese GF) used in the unmixing has very few spectral bands. In contrast, the random forest regression can generate accurate fractions of multiple classes when the image has a limited number of spectral bands [79], [80], and has been used to generate class fraction images for the PlanetScope image which has only four spectral bands. In addition, the FCLS is an inversion problem and is especially time-consuming when the spectral bands and the number of endmembers are large, which greatly limits its application to large areas. In contrast, the self-trained regression is not based on the inversion approach and the regression model is more computationally efficient; its unmixing running time is only ∼1% of the running time of the FCLS linear mixture model. Third, the FCLS linear mixture model is dependent on the endmembers, and may fail to consider the intraclass spectral variability information in the unmixing, whereas self-trained regression does not require information from endmembers. Last, the FCLS linear mixture model is not suitable for non-linear mixture models, whereas the self-trained regression is more flexible.

B. Performance of Different STIFs in Dealing With Blocky Effects
The blocky effect indicates the discontinuity in reflectance between HR pixels of the same class crossing the boundaries of two neighboring LR pixels. The blocky effect results from the fact that the endmembers in the two neighboring LR pixels are estimated using different sets of LR pixels, and the same class in the two neighboring LR pixels may be assigned different reflectance. The blocky effect arises because each LR pixel is unmixed independently in the estimation of the local endmember. Fig. 16 compares the impact of the blocky effect highlighted with circles in the figures for different unmixing-based fusion methods on the CIA and LGC sites using real MODIS imagery. The blocky effect is the most obvious in the pixel-scale UBDF result and the MTF result because MTF with a relatively small softness parameter of 1.1 would generate class fractions close to 0 or 1. RERC generated the minimal blocks. The advantage of reducing the blocky effect accounts for two reasons for RERC. First, for the pixel-based STIF, which directly assigns the estimated endmember spectra to the corresponding HR pixel, the resultant reflectances for the same class pixels located in the same LR pixel are homogenized, and the same class pixels located at neighboring LR pixels would result in a blocky effect in reflectance if the estimated endmembers for neighboring LR pixels are different. In contrast, the sub-pixelbased STIFs linearly combine the local endmembers with the HR class fractions to generate the fused image, and the resultant reflectance is dependent on both the endmembers and HR class fractions. The effectiveness of this strategy, that is, the sub-pixel strategy, is also demonstrated in the MTF and LSUSDFM results in comparison with the UBDF results in Fig. 16. Second, RERC selects similar pixels to postprocess and averages the pixel spectral values in the predicted image. Since similar pixels may be selected for HR pixels located in different LR pixels, the blocky effect that occurs for similar pixels located at neighboring LR pixels could be reduced.

C. Limitations and Future Works
RERC spatially interpolates the LR image at t p to an HR scale based on bicubic interpolation and then downscales the LR image at t p to an HR scale based on the spectrally similar pixels in the HR image at t 0 . However, if a land-cover change occurs, the HR land-cover spatial pattern changes accordingly, and the HR image at t 0 may not represent the real landcover spatial pattern at t p . Results show that, like other STIFs, the proposed RERC better mapped the reflectance change in homogeneous regions but may fail to predict the texture in land cover changed areas. This is because RERC uses bilinear interpolation in the residual compensation approach in which the spatial details are not reconstructed. Future studies can focus on the use of deep learning to map regions where land cover has changed [81].
Although RERC could reduce the blocky effect to some extent, it does not involve incorporating new constraints in unmixing. The blocky effect is mainly because different LR pixels are involved in unmixing spatially adjacent pixels, resulting in dissimilar spectral values even for the same class located in the spatially adjacent LR pixels. Thus, an effective and direct way to reduce the blocky effect for the unmixing-based STIF is by minimizing the reflectance difference of the same class in spatially adjacent LR pixels. Wang et al. [55] proposed a novel block-removal method that minimizes the residual error to ensure the spatial continuity of the endmember reflectance in unmixing, which is an effective and general solution for UBDF and other STIFs. This constrained-function strategy is applied directly to the same class pixels at the pixel scale, and its application in sub-pixel scale block removal should be explored in future studies. Third, this article fused images using only one HR image at a known time. If both HR images that predate and post-date the prediction time are available, it is suggested to fuse the prediction time image separately using different HR images, and then combine the fused image to generate the final predicted image to further improve the fusion accuracy.
In addition, although many machine learning models such as the artificial neural network and support vector machine regression can be used in the training and predicting of the sub-pixel class fractions, the random forest has been adopted in RERC for its simplicity and high precision [82], [83]. Future works can explore various simple machine-learning regressions and deep learning regression in exploring sub-pixel class fraction information from the remote sensing imagery in the unmixing-based STIF.
Last, the unmixing-based STIF requires less input than the state-of-the-art STARFM-like and FSDAF-like fusions and is thus more flexible in use, especially in image fusion in very large areas. For instance, the unmixing-based STIF can be used to generate high spatiotemporal imagery based on the LR imagery such as MODIS and Sentinel-3 which have a large span and the mosaicked medium resolution image such as Landsat and Sentinel-2 acquired at different dates to generate Landsat-like or Sentinel-2-like imagery with very high temporal repetition rates. Future studies could focus on this while continuously increasing the efficiency of the unmixing-based STIF.

VI. CONCLUSION
This study proposes a new unmixing-based STIF of RERC based on a self-trained machine-learning regression, LR endmember estimation, HR image reconstruction, and residual compensation. The self-trained regression trains the relationship between the reflectance image and the corresponding class fractions at a coarse resolution scale and then uses this relationship in unmixing the HR class fraction images. In comparison with the FCLS linear spectral unmixing and the soft-clustering, the self-trained regression does not require any information about endmembers, and is flexible in use. In addition, the self-trained regression does not have physical constraints like the FCLS linear spectral unmixing which requires the number of endmembers to be no more than the number of spectral bands to generate reliable results, and does not require the information about endmember distribution that is used in the soft-clustering. Last, self-trained regression is computationally efficient, and its computation time is approximately 1% and 10% of that for the FCLS linear spectral unmixing and the soft-clustering, respectively. The proposed RERC incorporates the LR image at the prediction time and better predicts the reflectance in regions that experienced drastic reflectance changes than the comparator of unmixingbased STIFs, owing to the residual compensation term to make the full use of the LR image at the prediction time. The experimental results also show that RERC not only reduced the homogenization effect compared with UBDF, but also reduced the blocky effect to a great extent. RERC has been applied to fuse a 3 m PlanetScope image of four bands image at the known time with a Landsat image of seven bands at the prediction time to generate a 3 m seven bands multispectral image and is more flexible than the STARFMlike and FSDAF-like STIFs which require additional LR image at the known time and requires the LR and HR to have similar spectral bands. RERC has been applied to fuse 30 m imagery with MODIS spectral reflectance at the national scale for the Republic of Ireland (∼70 273 km 2 ) and France (∼551 500 km 2 ), showing its potential for mapping daily multispectral imagery at Landsat spatial resolution for large-area land surface monitoring. He is also with the University of Chinese Academy of Sciences, Beijing, China. His research interests include remotely sensed image fusion and deep learning.
Xia Wang received the B.S. and M.S. degrees in land resource management from the China University of Geosciences, Wuhan, China, in 2012 and 2014, respectively, and the Ph.D. degree from Wuhan University, Wuhan, in 2018.
From 2021 to 2022, she served as a Visiting Scholar at the Lancaster Environment Centre, Faculty of Science and Technology, Lancaster University, Lancaster, U.K. Currently, she holds the position of Assistant Researcher at Wuhan Botanical Garden, Chinese Academy of Sciences, Wuhan. Her research pursuits encompass the field of ecological environment assessment and environmental remote sensing monitoring.
Yun Du received the B.S. degree in geomorphology and quaternary geology from Nanjing University, Nanjing, China, in 1989, the M.S. degree in physical geography from the Institute of Geodesy and Geophysics, Chinese Academy of Sciences, Wuhan, China, in 1992, and the Ph.D. degree in historical geography from Wuhan University, Wuhan, in 1999.
He is currently a Professor with Innovation Academy for Precision Measurement Science and Technology, Chinese Academy of Sciences. His research interests include environment monitoring and evaluation.
Giles M. Foody (Fellow, IEEE) received the B.Sc. and Ph.D. degrees from the University of Sheffield, Sheffield, U.K., in 1983 and 1986, respectively.
He is Professor of geographical information science with the School of Geography, University of Nottingham, Nottingham, U.K. He has produced nine books and more than 260 refereed journal articles and his work has been cited more than 38 000 times (h index = 95). His main research interests lie at the interface between remote sensing, ecology, and informatics with a core focus on image classification for land surface cover mapping and monitoring applications at scales ranging from the subpixel to global.
Dr. Foody was elected a Fellow of the IEEE (FIEEE) in 2013 and a member of Academia Europaea (MAE) in 2021. He served as the founding Editor-in-Chief of Remote Sensing Letters. He holds additional editorial roles in over ten other journals including Landscape Ecology, the International Journal of Remote Sensing, Remote Sensing of Environment, Remote Sensing, and the International Journal of Applied Earth Observation and Geoinformation.