Filling Then Spatio-Temporal Fusion for All-Sky MODIS Land Surface Temperature Generation

The thermal infrared band of the moderate resolution imaging spectroradiometer (MODIS) onboard the Terra/Aqua satellite can provide daily, 1 km land surface temperature (LST) observations. However, due to the influence of cloud contamination, spatial gaps are common in the LST product, restricting its application greatly at the regional scale. In this article, to deal with the challenge of large gaps (especially complete data loss) in MODIS LST for local monitoring, a filling then spatio-temporal fusion (FSTF) method is proposed, which utilizes another type of product with all-sky coverage, but coarser spatial resolution (i.e., the 7 km China Land Data Assimilation System (CLDAS) LST product). Due to the great temporal heterogeneity of LST, temporally closer auxiliary MODIS LST images are considered to be preferable choices for spatio-temporal fusion of CLDAS and MODIS LST time-series. However, such data are always abandoned inappropriately in conventional spatio-temporal fusion if they contain gaps. Accordingly, pregap filling is performed in FSTF to make fuller use of the valid information in temporally close MODIS LST images with small gaps. Through evaluation in both the spatial and temporal domains for three regions in China, FSTF was found to be more accurate in reconstructing MODIS LST images than the original spatio-temporal fusion methods. FSTF, thus, has great potential for updating the current MODIS LST product at the global scale.


I. INTRODUCTION
L AND surface temperature (LST) is an important physical quantity that can be used for studying the interaction between the Earth's surface and the atmosphere [1], [2], [3]. Until now, LST has been applied widely in research on regional drought monitoring [4], [5], land surface evapotranspiration estimation [6], [7], and the heat island effect [8], [9]. Remote sensing provides great potential for large-scale LST monitoring, as LST can be retrieved from the thermal infrared (TIR) band of sensors onboard several satellites (e.g., the Landsat and Terra/Aqua satellites) [10], [11], [12], [13]. Although theories for LST retrieval have been developed and applied widely using various sensors, the LST estimated using satellite sensor data has an obvious defect. Existing studies demonstrate that about 67% of the Earth's surface is contaminated by cloud at any one time [14]. As TIR wavelengths cannot penetrate cloud, some level of data loss in acquired images is to be expected, caused by poor imaging environments. This characteristic restricts greatly the applications of satellite-derived LST products, such as in global climate change studies which require spatial continuity of data. For example, the moderate resolution imaging spectroradiometer (MODIS) sensor onboard Terra, as an important data source for generating global LST, provides a 1 km LST product (i.e., MOD11A1), the scale of which is appropriate for LST research missions at the regional scale. However, the unpredictable imaging environment cannot guarantee data integrity, resulting in spatial information loss in daily MODIS LST. Thus, to provide an all-sky MODIS LST product, the missing information in the daily MODIS LST needs to be reconstructed.
Generally, for small areas of information loss, satisfactory restoration results can be obtained by adopting spatial reconstruction methods in remote sensing [15], [16]. As these methods reconstruct the missing area based only on spatially complete images from the same data source, when the missing area is large the uncertainty in reconstructing the missing information can also be large. Thus, to ensure high accuracy of all-sky MODIS LST generation, other auxiliary data and reconstruction methods are required when the information loss is large (especially completely lost in a local area of interest). Apart from the above-mentioned LST products made using thermal remote sensing, land surface models (LSMs) for land data assimilation, including the China Land Data Assimilation System (CLDAS) [17] and Global Land Data Assimilation System [18], can also provide large area LST data with a relatively coarse spatial resolution. As these datasets are produced by combining observations from a variety of sources, such as ground and satellite sensor observations, these systems can also provide an all-sky spatially complete LST product. Due to this advantage, LSMs have great potential to supplement the missing information in optical image-derived products, and assist the reconstruction of MODIS LST [19]. Thus, in this article when reconstructing MODIS LST with large areas of data loss, a typical LSM is considered, such as the CLDAS data which can provide 7 km spatially seamless hourly LST products covering the Asian area. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Specifically, spatio-temporal fusion can help to be used to reconstruct the missing MODIS LST based on spatially complete MODIS-CLDAS LST image pairs at other times, and a CLDAS LST image at the prediction time.
Over the past decades, various categories of spatio-temporal fusion approaches were developed [20], [21], [22], [23], [24], [25]. Generally, the spatial weighting-based and the spatial unmixing-based methods are the two earliest types of spatiotemporal fusion methods. Specifically, the common practice for spatial weighting-based methods is to predict the center fine pixels by quantifying the contribution of the neighboring spectrally similar pixels using a weighting function. The spatial and temporal adaptive reflectance fusion model (STARFM) [26] is regarded as the classic example of this type. Based on STARFM, the enhanced STARFM (ESTARFM) algorithm [27], the Fit-FC method [28], the spatial temporal adaptive algorithm for mapping reflectance change [29], and the spatial weighting-based virtual image pair-based spatio-temporal fusion (VIPSTF-SW) [30] approaches were further developed.
Spatial unmixing-based methods estimate the value of the fine pixels by solving the reflectance of all object classes through different kinds of unmixing models. This category of method was developed on the basis of the multisensory, multiresolution technique proposed by Zhukov [31]. By applying different constraints and auxiliary data to the unmixing model, various spatial unmixing-based algorithms were proposed [32], [33], [34], [35]. Additionally, recently developed methods include hybrid methods integrating the advantages of the spatial weighting and the spatial unmixing-based methods [36], [37], [38], [39], together with learning-based methods such as the sparserepresentation-based spatiotemporal reflectance fusion model (SPSTFM) [40] and the wavelet-artificial intelligence fusion approach [41]. Except for conventional machine learning methods, deep learning-based models were also developed due to their advantages in describing the complex relationship between coarse and fine spatial resolution images. So far, spatio-temporal fusion methods based on different improved versions of convolutional neural networks [42], [43], [44], [45] and generative adversarial networks (GANs) [46], [47], [48] were proposed. Some examples are the multistage remote sensing image spatiotemporal fusion network (MSFusion) [45] and the GAN-based spatiotemporal fusion model (GAN-STFM) [46].
Although spatio-temporal fusion provides practical means for reconstructing large areas of information loss in MODIS LST, their reliability is limited by the availability of auxiliary data. It is acknowledged that due to strong temporal heterogeneity, LST changes greatly over time, and images with closer acquisition times tend to have greater similarity. Thus, in spatio-temporal fusion, fine spatial resolution images with a shorter time interval are considered as a preferable choice for the auxiliary data. Nevertheless, due to frequent cloud cover, there exists great difficulty in finding temporally close images with spatially complete coverage. Although the common practice to search for cloud-free images with a longer time interval can avoid the impact of data quality, the prediction may contain large uncertainties as there can be great LST changes between the prediction and known times. Actually, the impacts of cloud on remote sensing images vary, and temporally close images with a small patch of data missing have the potential to provide significant auxiliary information to the prediction. Thus, for the reconstruction of MODIS LST with large areas of data loss, there is a great need to develop spatio-temporal fusion methods to take full advantage of temporally close LST images with gaps.
To take fuller advantage of the available MODIS LST data, a filling then spatio-temporal fusion (FSTF) method is proposed. Instead of searching for spatially complete, but temporally far MODIS LST as auxiliary data, FSTF considers temporal proximity also, that is, it utilizes the MODIS LST data temporally closest to the prediction time for spatio-temporal fusion. Considering that there probably exist missing spatial data (but with small gaps) in the temporally closest MODIS LST images, gap filling is first applied to keep the integrity of the auxiliary data. Many gap filling methods are available for this task. Amongst the existing filling methods, the most commonly used are hybrid methods that integrate the spatial information of the remaining valid data and the temporal information of auxiliary images acquired at other times (i.e., temporally neighboring image with complete coverage) [49], [50]. Some examples are the neighborhood similar pixel interpolator (NSPI) [51] and the modified NSPI (MNSPI) [52] methods. Additionally, learning-based methods were also developed recently [53], [54], [55], with the unique advantage of characterizing the nonlinear relation between the images with gaps and the auxiliary data.
After gap filling the temporally close MODIS LST data, spatio-temporal fusion is implemented using the gap filled MODIS LST, along with the CLDAS LST at the known and prediction times. FSTF provides a new solution to the reconstruction of MODIS LST with large gaps when there exists difficulty in finding spatially complete auxiliary data and LST changes greatly over time, thus solving the key issue in all-sky MODIS LST generation. Generally, this article makes the following three main contributions.
1) The FSTF method makes fuller use of the valid information in the images acquired temporally close to the prediction time, thus, breaking through the limitation on data integrity for traditional spatio-temporal fusion and enhancing the flexibility of data selection for the case with strong temporal heterogeneity (e.g., the MODIS LST studies in this article). 2) By integrating gap filling and the spatio-temporal fusion techniques effectively, more accurate all-sky MODIS LST can be reconstructed.
3) The generated all-sky LST has the potential to provide hourly MODIS-like LST by inheriting the hourly temporal resolution of CLDAS LST, facilitating fine temporal resolution (e.g., diurnal) LST change monitoring. The hourly, 1 km LST has great potential for various studies based on the need for dynamic LST at the regional scale. The remainder of this article is organized into four sections. In Section II, the data and the proposed FSTF method are introduced. Section II-A and B present the three study areas and the two categories of data utilized in this research. Section II-C introduces the principles of the gap filling and spatio-temporal fusion methods employed in the experiments. Section II-D, E, and F present the principle, theoretical basis, and implementation of the proposed FSTF method, respectively. In Section III, the effectiveness of FSTF is validated in both the spatial and temporal domains. Section Ⅳ further discusses the potential and limitations of FSTF. Section Ⅴ concludes this article.

A. Study Area
The experiments for this research were implemented in three regions in North China, each covering a spatial extent of 140 km × 140 km. Each area is covered by one meteorological station providing ground-based measurements. The location of the three regions and the corresponding meteorological stations are shown in Fig. 1 [56] located in Gansu province (Region 3). The automatic meteorological stations are all installed on towers, providing datasets including air temperature, wind speed, precipitation, and four-component radiation every 10 min. For the three towers in Regions 1-3, the underlying surfaces are reed wetland, corn belt, and piedmont desert, respectively.

B. Data
The aim of this research was to reconstruct MODIS LST images, with the assistance of the CLDAS LST product. The datasets in this experiment, therefore, included: the MODIS LST product (MOD11A1), CLDAS LST product, and ground-based LST.
1) MODIS and CLDAS LST data: MODIS LST is provided by the MOD11A1 daily surface temperature product (version 6) (MODIS/Terra LST/emissivity daily L3 global 1 km SIN grid product), which can be obtained from https://search.earthdata. nasa.gov/. This product was generated by applying the split window algorithm to bands 31 and 32 of MODIS onboard Terra. The MOD11A1 product provides 1 km observations, including daytime and nighttime LST, quality indicators, and observation times. When MODIS data are acquired, they are first reprojected to the same coordinate system as that of CLDAS (WGS84) by the MODIS Reprojection Tool (MRT). Then, to obtain effective MODIS data, the acquired MODIS LST is filtered according to the 8-bit byte quality control (QC) flag of pixels in the QC layer. Specifically, pixels with the flag "cloud" and "average LST error > 3 K" are considered to be invalid data. Thus, the MODIS LST pixels with fine data quality can be selected.
CLDAS is a land surface data assimilation system, which can provide a large number of land surface observation products, such as air temperature, air pressure, soil moisture, and LST. Different from MODIS, the CLDAS product integrates ground observations provided by automatic meteorological stations, numerical analysis/forecast products provided by the European Centre for Medium-Range Weather Forecasting, and many other products. The LST product of CLDAS can provide 7 km × 7 km observations covering the Asian area. In terms of temporal resolution, CLDAS provides hourly LST observations, which have the potential to match accurately the acquisition times of MODIS LST images. By examining the acquisition times of the MODIS LST time-series in the study period, CLDAS LST at universal time coordinated (UTC) 4 a.m., 3 a.m., and 4 a.m. were selected for Regions 1-3, respectively. Also, the fused LST was evaluated using the ground-based LST at the corresponding time. The CLDAS LST can be obtained from http://data.cma.cn/data.
2) Ground-based measurements: Considering that no real reference exists for the reconstructed MODIS LST, groundbased measurements (i.e., in situ data) were used for evaluating the accuracy of the LST time-series data, which is a common strategy [58], [59]. The ground measurements were acquired from the automatic meteorological station data provided by the Institute of Tibetan Plateau Research (http://data.tpdc.ac.cn). The three meteorological stations collect four-component radiation every 10 min, providing important variables for the acquisition of ground-based LST. Generally, the ground-based LST can be calculated according to the Stefan-Boltzmann law based on the surface upwelling and atmospheric downwelling longwave radiation where T s is the derived ground-based LST, and σ is the Stefan- . L ↑ and L ↓ are surface upwelling and atmospheric downwelling longwave radiation, respectively, and ε b is the broadband emissivity, which can be estimated by where ε 29 , ε 31 and ε 32 are narrowband emissivities of MODIS bands 29, 31, and 32, which can be obtained from the MODIS/Aqua LST/3-band emissivity daily L3 global 1 km product (MYD21A1) [60], [61]. For the three study areas, the reconstructed MODIS LST was evaluated using the in situ LST at the same acquisition time.

C. Gap Filling and Spatio-Temporal Fusion Methods
FSTF implements gap filling first and then applies spatiotemporal fusion using CLDAS LST. This section introduces the gap filling and spatio-temporal fusion methods used in this research.

1) Gap filling (MNSPI):
Gap filling aims to reconstruct the spatial information loss caused by cloud or other factors. Until now, methods utilizing both spatial and temporal correlations together are considered to be more reliable compared with methods using either of the two. Typically, for the MNSPI approach, a spatial-based prediction is first estimated according to spatially neighboring information of the data with gaps, and then a temporal-based prediction is made with the assistance of the information in the auxiliary image. For final prediction, the spatial-based and temporal-based predictions are weighted according to the spatial and temporal heterogeneity.
2) Spatio-temporal fusion (STARFM, ESTARFM, VIPSTF-SW, and SPSTFM): Spatio-temporal fusion aims at obtaining fine spatial and temporal resolution images by combining the advantages of images with different resolutions. Generally, the basic mechanisms for predicting the fine spatial resolution image F p by spatio-temporal fusion can be summarized using the following framework [30], [37]: where the prediction of F p includes two terms: the known fine image F k and the fine increment ΔF k→p (i.e., temporal change) to be estimated. The first term makes use of the available fine spatial resolution information directly, while the second term predicts fine spatial resolution change information from the available coarse spatial resolution data. The estimation of ΔF k→p is the core of spatio-temporal fusion, which is calculated by applying different downscaling operators f to the coarse spatial resolution increment ΔC k→p . For STARFM, f is a weighting function considering together the spatial, temporal, and spectral differences of neighboring pixels [26]. Based on STARFM, a conversion coefficient is used in ESTARFM to describe this relationship more explicitly [27]. For VIPSTF-SW, spatial weighting-based fusion is performed using a virtual image pair that is generated by applying a linear transformation to the original image pair [30]. For SPSTFM, ΔF k→p is estimated by training a dictionary-pair of patches between two coarse and fine difference image pairs [40].

D. Proposed FSTF Method
In conventional spatio-temporal fusion, when there exist spatial gaps in temporally adjacent MODIS pixels, the image is completely discarded and cloud-free auxiliary images at a more distant time are further adopted. However, considering the great heterogeneity of LST in the temporal dimension, this practice can amplify the uncertainty of spatio-temporal fusion as greater LST changes may occur when there is a large time interval between the known and prediction times. Thus, it is necessary to make fuller use of the important effective information in temporally adjacent images, even when they are contaminated by cloud. Actually, the area of influence of cloud cover in remote sensing images varies greatly. For images with small cloud coverage, abandoning the whole image leads to a significant waste of valuable auxiliary data. Thus, to address this problem, the FSTF algorithm is developed here. Note that this method aims to reconstruct MODIS LST images with a large area of information loss (especially for completely missing data). The process of FSTF (green line) is shown in Fig. 2, presented with a comparison to traditional spatio-temporal fusion (red line).
For FSTF, to predict the MODIS LST at t p , auxiliary image pairs on other dates need to be selected. Suppose that there are four image pairs acquired before and after the acquisition time of t p . More precisely, t m and t n are temporally close to t p , while the MODIS LST images contain data loss to some extent. Image pairs acquired at t k and t l , however, are temporally further from the predicted MODIS LST, with complete spatial coverage. For traditional spatio-temporal fusion, image pairs acquired at t k and t l are applied directly. Due to the great LST change occurring from the known to prediction times, however, the reliability of prediction remains to be examined. Alternatively, for FSTF, MODIS LST images acquired at t m and t n are first reconstructed using gap filling methods. Then, the complete MODIS and CLDAS LST image pairs are included in the spatio-temporal fusion to make a final prediction. It is noted that the proposed FSTF is a class of method that can be implemented by applying different gap filling and spatio-temporal methods.
When reconstructing the MODIS LST time-series, we divided the acquired MODIS LST images into three classes according to the missing area: no gaps, small gaps (missing area less than 40%), and large gaps (missing area more than 40%). For the small gap case, a gap filling method (i.e., MNSPI) was applied to reconstruct the missing information with the temporally closest, spatially complete MODIS LST images. For the large gaps case, spatio-temporal fusion was implemented based on the temporally closest MODIS LST image, which refers to either the observed MODIS LST image with no gaps or filled data for the small gaps case. By applying the above-mentioned process, all-sky MODIS LST time-series images can be reconstructed.

E. Comparison Between FSTF and Traditional Spatio-Temporal Fusion
For further examination of the rationale of FSTF, a theoretical comparison between the FSTF and traditional spatio-temporal fusion is presented. Suppose that there is a MODIS LST image M m acquired at a time close to the MODIS LST M p to be predicted, but with a small area of spatial information loss. Also, the temporally further, spatially complete MODIS LST M k is available. Traditionally, spatio-temporal fusion requires spatially complete auxiliary data. Accordingly, the prediction for M p isM where f 1 is the downscaling function in spatio-temporal fusion, and ΔC k→p is the CLDAS LST increment from t k to t p . FSTF conducts spatio-temporal fusion based on the temporally close MODIS LST image M m , even if there exists data loss. Thus, the prediction for M p iŝ where M m_valid denotes the data for the valid area of M m andM m_mis sin g is the missing area of M m required to be estimated. It is noted that there is no overlap between M m_valid andM m_mis sin g . By applying gap filling based on M k , (5) can be updated as follows: In (6), M k_mis sin g are the valid data in M k that share the same geographical location with the missing area in M m , and ΔM k→m_valid is the MODIS LST increment from t k to t m for the valid area. f 2 is a spatial interpolation algorithm.
For comparison between traditional spatio-temporal fusion and FSTF, the traditional version in (4) is altered tô where ΔC k→m and ΔC m→p are the CLDAS LST increments from t k to t m and t m to t p , respectively. It is noted that f 1 here should be a linear function, which is in accordance with the four spatio-temporal fusion methods applied in this research. Through comparison between (7) with (6), it is found that there are two terms differing, that is, M k_valid + f 1 (ΔC k→m ) in (7) and M m_valid + f 2 (ΔM k→m_valid ) in (6). For the two constant parts M k_valid and M m_valid , they cover the same spatial area. However, it is clear that compared with M k_valid , M m_valid is temporally closer to the prediction time and, thus, can provide more reliable auxiliary information. The core is to compare f 1 (ΔC k→m ) with f 2 (ΔM k→m_valid ). First, from the perspective of spatial scale, f 1 (ΔC k→m ) involves a downscaling process with great uncertainty. However, f 2 (ΔM k→m_valid ) in FSTF is a spatial interpolation algorithm performed at the same fine spatial resolution with the original data, which tends to involve less uncertainty. Second, from the amount of data for prediction, f 2 (ΔM k→m_valid ) in FSTF needs only to predict the data for pixels in the missing area, but f 1 (ΔC k→m ) needs to predict the data for all pixels in the entire region. It is widely acknowledged that uncertainty normally exists in any prediction process. In summary, we can conclude that FSTF tends to involve less certainty than traditional spatio-temporal fusion.

F. Implementation of the FSTF Method
The specific implementation of FSTF is as follows: Step 1: The obtained MODIS LST time-series was classified into three categories: complete LST image, LST image with small gaps, and LST image with large gaps.
Step 2: The MNSPI method was applied to reconstruct the information in the MODIS LST images with small gaps based on the temporally closest, complete LST image.
Step 3: For MODIS LST images with large gaps, there are two parts. First, the MNSPI method was applied to reconstruct the missing information in the two MODIS LST images acquired temporally closest, before and after the prediction time.
Second, based on the reconstructed MODIS LST images, spatio-temporal fusion was applied to reconstruct the missing MODIS LST image, with the assistance of the corresponding CLDAS LST images at the three times (the prediction time and the two known times of reconstructed MODIS LST data).

III. EXPERIMENTS
The experiments for this research are divided into two parts. In the first experiment in Section Ⅲ-A, the proposed FSTF method was examined in the spatial dimension. That is, the reconstructed MODIS LST image was evaluated (predicting one LST image at each time) by comparison with the reference image with complete spatial coverage. According to the process of FSTF, Section Ⅲ-A presents the results of gap filling and spatiotemporal fusion, followed by a comparison between different spatio-temporal fusion methods to provide the FSTF version with the greatest performance for Section Ⅲ-B. In the second experiment in Section Ⅲ-B, the performance of FSTF was tested in the temporal dimension. That is, the reconstructed MODIS LST time-series was evaluated based on the temporal profile of each pixel, by referring to the in situ LST measurements. Table I lists the acquisition times of the MODIS and CLDAS data utilized in Experiments 1 and 2.

A. Experiment 1: Evaluation in the Spatial Dimension
To examine the feasibility of FSTF, a comparison experiment was conducted between spatio-temporal fusion (simplified to STF in the experiments) and FSTF. The research areas included in this experiment were selected amongst the three regions introduced in Section Ⅲ-A. The data are shown in Fig. 3. For Cases 1 and 2, experimental data were selected from Region 1. While for Case 3, data were selected from Region 3. To Similarly, the available MODIS LST images (with no gaps) on 31 July 2018 and 17 May 2020 were used as references for evaluation in Cases 2 and 3, respectively. For the three cases, the STARFM, ESTARFM, SPSTFM, and VIPSTF-SW methods were applied to STF, while the four corresponding FSTF versions were also tested. It is noted that FSTF is a class of methods composed of a gap filling method and a spatiotemporal fusion method. In this research, FSTF was specified by integrating the MNSPI method and four different spatiotemporal fusion methods (i.e., STARFM, ESTARFM, SPSTFM, and VIPSTF-SW), which are named as STARFM-based FSTF, ESTARFM-based FSTF, SPSTFM-based FSTF, and VIPSTF-SW-based FSTF, respectively. 1) Gap filling: Gap filling is the first step for FSTF, the performance of which can affect the final prediction. The gap filling results for three cases are shown in Fig. 4. Amongst the three cases, quantitative evaluation can be conducted for the simulated experiment (i.e., Case 1), as the spatially complete MODIS LST images at the corresponding time are available. For Case 1, the correlation coefficients (CCs) for the reconstructed MODIS LST images on 27 April 2018 and 3 May 2018 are 0.929 and 0.820, respectively. Moreover, the root mean square errors (RMSEs) are 1.221 K and 1.142 K, respectively. As the performance of MNSPI depends on the size and position of missing area greatly, the accuracy varies in different gap filling cases. Generally, the results of MNSPI for the three cases have a satisfactory performance considering the visual continuity and relatively harmonious hue.
2) Spatio-temporal fusion: The results of FSTF and STF are presented in Fig. 5. Checking the fusion results, the predictions of FSTF are visually more similar to the reference than those for STF in most cases. Specifically, for Case 1, the values of the STF  predictions tend to be larger than those in the reference, which incorrectly visually present as red, especially in the middle and upper left of the image. Generally, the ESTARFM-based and VIPSTF-SW-based results have a color similar to the reference. For Case 2, almost all the predictions present an overprediction of LST. Compared with the results of STF, the predictions of all three FSTF versions are visually more accurate and they present more similar colors to the reference. Amongst all four versions, the prediction of the VIPSTF-SW-based methods appears to be the closest to the reference visually, and other predictions overestimate the range of high-temperature area. For Case 3, the prediction using FSTF is challenged, as the missing area overlaps greatly in the two auxiliary MODIS LST images. Moreover, although the MODIS LST image pairs involved in STF are temporally further from the prediction date, the MODIS LST image acquired on 1 May 2020 is quite similar to the predicted MODIS LST image due to the irregular variation of LST. Thus, in this case, FSTF may fail to produce more satisfactory results than STF. From visual inspection, however, FSTF presents more satisfactory results for all four versions in the prediction of the upper half of the image, but fails to predict the correct color in the middle of image corresponding to the missing area in the auxiliary images.
Quantitative evaluation for spatio-temporal fusion is implemented based on RMSE and CC, as exhibited in Table II. For Cases 1 and 2, the accuracies of FSTF are obviously greater than for STF. For Case 1, the RMSEs of STARFM-based, ESTARFM-based, SPSTFM-based, and VIPSTF-SW-based  VIPSTF-SW-based FSTF produces the smallest RMSE of 4.890 K and largest CC of 0.784. For Case 3, FSTF produces less accurate predictions than the original methods, which corresponds to the visual inspection. Actually, in most cases, LST acquired within a few days tends to be more similar, while LST acquired more than half a month differs more according to the natural variation in temperature. Objectively, when the MODIS LST images were acquired temporally further away, but are more similar in value to the prediction date, FSTF may fail to produce more accurate prediction. This kind of extreme situation, however, is rare in practice, which requires subjective inspection in the auxiliary data selection process. Thus, in spatio-temporal fusion, temporally closer image pairs are still a preferable choice for fusion. To avoid the contingency of the experiment, examination of the temporal dimension will be conducted in Section Ⅲ-B, with the mission of reconstructing MODIS LST time-series.
3) Comparison between different spatio-temporal fusion methods: In this research, four spatio-temporal fusion methods (i.e., STARFM, ESTARFM, SPSTFM, and VIPSTF-SW) were applied to both STF and FSTF. From visual inspection, the results of the four methods seem to be similar, for both STF and FSTF versions. Checking the quantitative evaluation results, it is found that the ESTARFM-based and VIPSTF-SW-based methods tend to be more accurate for both the STF and FSTF versions. It is noted that although the predictions of SPSTFM-based FSTF have the largest CC and smallest RMSE for Case 1, its accuracy declines greatly for Cases 2 and 3. While VIPSTF-SW-based FSTF produces the second largest CC of 0.868 for Case 1, and the largest CC and the smallest RMSE for Case 2. Specifically, the CC of VIPSTF-SW-based FSTF for Case 2 is 0.040, 0.001, and 0.050 larger than for STARFM-based, ESTARFM-based, and SPSTFM-based FSTF, respectively. For RMSE, VIPSTF-SWbased FSTF is 1.835 K, 1.456 K, and 1.365 K smaller than for STARFM-based, ESTARFM-based, and SPSTFM-based FSTF, respectively. For Case 3, ESTARFM-based STF produces the largest CC of 0.965 and the smallest RMSE of 2.601 K, while the performance of the four methods in FSTF is relatively similar. Considering the relative performances of the four methods in this examination of the spatial dimension, the VIPSTF-SW-based method will be applied to the following experiments in the temporal dimension.

B. Experiment 2: Evaluation in the Temporal Dimension
To examine the performance of FSTF in the temporal dimension, MODIS LST images covering a period of a few months were selected, as shown in Fig. 6. Specifically, images from 1 March 2018 to 31 August 2018, 1 March 2018 to 30 June 2018 and 1 May 2020 to 31 August 2020 were considered for Regions 1-3, respectively. Checking the LST image types defined in Section Ⅱ-D for the three regions, there are 13, 5, and 4 images with no gaps in the three regions, with a proportion of 7.07%, 4.1%, and 3.25% amongst all available MODIS LST images for Regions 1-3, respectively. Obviously, as cloud contamination tends to be a normal phenomenon, the number of spatially complete MODIS LST is limited. Amongst all three regions, the number of images with large gaps tends to be the largest, occupying 53.26%, 51.64%, and 63.41% for Regions 1-3, respectively. Considering the composition of the three types of images for the three regions, reconstruction of the MODIS LST time-series in Region 3 is the most challenging. Partial reconstruction results of the VIPSTF-SW-based STF and FSTF methods are shown in Fig. 7. Generally, the reconstruction results have a fine spatial continuity and accord with the law of the natural change of LST. For several days, the reconstruction results of STF and FSTF differ greatly, such as the results on 20 August for Region 1, 4 April for Region 2, and 21 July for Region 3.
To quantify the reconstruction accuracy for the three regions, the in situ LST measurements for Zhangye wetland station, Huailai station, and Huazhaizi desert station were applied for Regions 1-3, respectively, as shown in Fig. 8. The left column presents the predictions for STF and FSTF at the location of the stations and the in situ measurements. To present the differences between the predictions of STF and FSTF more clearly, the absolute difference between the prediction and the in situ measurements is shown in the right column (Fig. 8). Generally, the prediction of FSTF is closer to the measured LST than that for STF for the three regions. For Region 1, as the number of spatially complete MODIS LST images is relatively large, the predictions of STF and FSTF appear to be similar on most dates. As can be seen from Fig. 8(b), however, the absolute difference for FSTF on the first half of the dates is smaller than that for STF, and the performances of these two methods in the second    half of the dates are close to each other. For Region 2, it can be noted in Fig. 8(c) that the green line is closer to the in situ LST measurements, indicating the greater performance of FSTF. Again, the prediction of FSTF also presents a smaller absolute difference in Fig. 8(d). For Region 3, both Fig. 8(e) and (f) indicate that the predicted LST of FSTF is closer to the measured LST on most dates.
To examine the overall performance, the error distributions and scatterplots between the predictions and in situ data are shown in Figs. 9 and 10, respectively. In Fig. 9, it is noted that for all three regions, the error of FSTF is closer to zero compared with STF. In Fig. 10, for all three regions, the scatter plots of FSTF against the in situ data are closer to the y = x line and are more aggregated compared with those for STF and the in situ data. To further evaluate the overall accuracy, the mean absolute error (MAE), RMSE, and coefficient of determination (R 2 ) were calculated, as shown in Table III. Checking the three indices, the FSTF predictions present greater accuracy generally. More precisely, the MAEs of FSTF are 0.683 K, 1.273 K, and 2.447 K smaller than those for STF for Regions 1 to 3, respectively. Furthermore, FSTF produces RMSEs that are 0.666, 0.900, and 2.189 K smaller than for STF for the three regions. Thus, when reconstructing LST time-series with large spatial gaps, FSTF can produce greater accuracy.

A. Prediction of 1 Km Hourly MODIS LST Data
As presented in the Introduction, the CLDAS product can provide 7 km hourly LST. To reconstruct all-sky LST, this article utilized the CLDAS LST at the same acquisition time of MODIS LST for reconstruction. Ultimately, MODIS LST time-series were generated by FSTF. Although this research produces 1 km spatial resolution daily LST, the temporal resolution may be coarse for studies on diurnal variation in LST. Actually, with the reconstructed daily 1 km MODIS LST and 7 km hourly CLDAS LST, there exists a great possibility to obtain hourly MODIS-like LST by inheriting the spatial resolution of MODIS LST and temporal resolution of CLDAS. This process can be realized directly by employing spatio-temporal fusion. Once the spatially complete MODIS-CLDAS LST image pair at one time point in a day is acquired, it can be regarded as the known fine-coarse image pair. Thus, by fusing with CLDAS at other times during the day, 1 km LST for the other 23 h in a day can be predicted.
Taking the reconstruction of hourly 1 km LST on May 1 to May 10 for Region 1 as an example, the reconstruction results are shown in Fig. 11. To obtain a more reliable prediction, the former and latter temporally closest MODIS-CLDAS LST image pairs were applied for spatio-temporal fusion. For example, when reconstructing the 1 km LST at UTC 7:00 on May 2, the MODIS-CLDAS LST image pairs at UTC 4:00 on May 2 and May 3, together with the CLDAS at UTC 7:00 on May 2 were included. Checking the reconstruction results, the variation of LST within a day is reconstructed, as the LST increases from UTC 0:00 to 6:00 and decreases from UTC 6:00 to 23:00. Also, the variation of LST presents temporal continuity, which is in accordance with the common expectation that LST changes gradually over time. Generally, the reconstruction of 1 km hourly LST is feasible from visual inspection. The reconstructed 1 km hourly LST images have great potential for research on diurnal variation in LST across different land cover types. Note that when generating 1 km hourly LST, the errors may accumulate and, thus, it is important to ensure a high accuracy of any previous daily LST reconstruction based on FSTF.

B. Flexibility of FSTF
In this article, a FSTF method is proposed to reconstruct all-sky MODIS LST images with the assistance of CLDAS data. FSTF includes two steps: gap filling and spatio-temporal fusion. By integrating a typical gap filling method (i.e., MNSPI) and one of the four spatio-temporal fusion methods (i.e., STARFM, ESTARFM, SPSTFM, and VIPSTF-SW), four specific forms of FSTF were developed in this article. As the two parts together determine the accuracy of FSTF, to further increase the reconstruction accuracy, it is of great need to explore more forms of FSTF by combining more powerful gap filling and spatio-temporal fusion methods. For gap filling, more spatialtemporal information-based methods can be considered, such as deep learning-based methods. The key issue would be to collect sufficient reliable training data based on the platform of a high performance computer. For spatio-temporal fusion, the four algorithms included in this article are spatial weightingbased methods and machine learning-based methods. In future research, the performance of FSTF may be improved by developing other spatio-temporal fusion methods, such as hybrid methods and deep learning-based methods. In potential models, it would be crucial to account for the change pattern of LST over time. Moreover, it is noted that the FSTF method proposed in this article implements gap filling and spatio-temporal fusion with images at just one or two time points. Actually, there exists great spatial and temporal correlation between the image time-series to be reconstructed. In future research, an extended FSTF version integrating the information of the image time-series deserves to be developed for more reliable reconstruction.

C. Potential of FSTF
The FSTF method proposed in this article aims at reconstructing large areas of data loss in MODIS LST, thus, contributing to the generation of all-sky MODIS LST products. It has great potential for updating the current MODIS LST product at the global scale. Furthermore, FSTF has the potential to be applied to more situations. First, FSTF can help to generate all-sky LST with finer spatial resolution by blending the predicted all-sky MODIS LST product with a finer spatial resolution, but coarser temporal resolution product. For example, the Landsat-8 TIR band can provide 100 m spatial resolution LST every 16 days, but also encounters the problem of spatial information loss. In this case, FSTF can be applied to reconstruct Landsat LST with data loss by fusing with all-sky 1 km MODIS LST, thus, generating all-sky 100 m LST. Second, other than LST, FSTF has the potential to support the reconstruction of other surface observation data. Generally, FSTF has the ability to reconstruct fine spatial resolution products with data loss by fusing with spatially complete products of the same type, but with coarser spatial resolution and finer temporal resolution. Actually, missing data is a common problem in surface observation products, as many products are generated from optical remote sensing images, which always face the issue of cloud contamination.

D. Uncertainty in FSTF and Validation
The FSTF method proposed in this article provides a new approach for reconstructing MODIS LST images. However, it is a method composed of multiple steps and, importantly, its performance relies heavily on the pregap filling process. As MNSPI cannot produce a perfect prediction, the error caused by gap filling may propagate to the postspatio-temporal fusion step. Thus, the impact of the error caused by the pregap filling process should be considered when applying spatio-temporal fusion. Moreover, it would also be worthwhile to explore a one-stage method that can realize gap filling and spatio-temporal fusion in a unified framework, where the uncertainty can be handled jointly.
In situ time-series data were employed for accuracy validation of FSTF in the temporal domain, due to a lack of real, spatially complete reference data. In practice, due to the lack of alternative data, in situ measurements serve as a common choice for evaluation of predicted LST time-series [58], [59]. However, there may exist errors in the reference data themselves, not least since the automatic meteorological stations are installed on towers instead of on the ground. Thus, when using in situ LST data to evaluate the accuracy of FSTF, uncertainties remain and these should be further investigated. In future research, ground measurements with greater reliability are expected to be explored to provide a more critical validation system. FSTF provides a new means for making full utilization of the available data, and in this research was demonstrated to be more accurate than the conventional spatio-temporal fusion methods used widely for LST reconstruction in recent studies [8], [12], [19]. Moreover, the MODIS LST time-series reconstructed by FSTF tends to be closer to the in situ time-series data in the temporal trend [see Fig. 8(a), (c), and (e)]. Future research should explore ways to increase the accuracy of FSTF further to meet various research demands. In particular, in future research it would be interesting to further increase the accuracy of FSTF by involving multisource data and optimizing the integration of gap filling and spatio-temporal fusion methods.

V. CONCLUSION
As an important sensor for global monitoring, MODIS can provide 1 km LST every day, but is affected by different degrees of spatial information loss. For generation of an all-sky MODIS LST product, it is both necessary, and a great challenge, to reconstruct images with large gaps, especially for completely missing data in a local area of interest. This article proposes a FSTF method for reconstructing MODIS LST time-series with the assistance of CLDAS LST data. By integrating effectively gap filling and spatio-temporal fusion methods, FSTF provides a practical solution for all-sky MODIS LST time-series generation. Based on the experiments conducted in three regions, the following main conclusions are made.
1) FSTF is able to take full advantage of temporally close MODIS LST data with small gaps and can produce greater reconstruction accuracy than the original spatio-temporal fusion. 2) CLDAS is a type of effective auxiliary data for reconstructing MODIS LST with large gaps. 3) VIPSTF-SW-based FSTF is generally superior to STARFM-based and ESTARFM-based FSTF. He is currently a Distinguished Professor of Spatial Data Science and Dean of the Faculty of Science and Technology, Lancaster University, Lancaster, U.K. He was previously a Professor of Geography with the University of Southampton, where he is currently a Visiting Professor. He is also a Visiting Professor with the Chinese Academy of Sciences, Beijing, China. He previously held the Belle van Zuylen Chair with Utrecht University, The Netherlands. He has authored or coauthored more than 300 peer-reviewed articles in international scientific journals and around 50 refereed book chapters. He has also edited nine journal special issues and eight books. The main focus of his research is in remote sensing, geographical information science and spatial (and space-time) statistics applied to a range of environmental science and socio-economic problems.
Prof. Atkinson is the recipient of the Peter Burrough Award of the International Spatial Accuracy Research Association and is a Fellow of the Learned Society of Wales. He is the Editor-in-Chief for Science of Remote Sensing, a sister journal of Remote Sensing of Environment. He also sits on the editorial boards of several further journals including Geographical Analysis, Spatial Statistics, International Journal of Applied Earth Observation and Geoinformation, and Environmental Informatics.