Object-Level Hybrid Spatiotemporal Fusion: Reaching a Better Tradeoff Among Spectral Accuracy, Spatial Accuracy, and Efficiency

Spatiotemporal fusion (STF) is a cost-effective way to complement the spatiotemporal resolution of multisource images, which has been employed in various applications requiring image sequences. In real-world applications, the spectral accuracy, spatial accuracy, and efficiency of STF play a critical role. Despite this, most STF methods focus on improving spectral accuracy, whereas the challenges of spatial information loss and low efficiency have received limited attention. In addition, the improvements in spectral accuracy, spatial accuracy, and efficiency in STF are contradictory, and existing STF methods cannot balance them well, which limits their reliability and applicability for various STF tasks. To solve the above-mentioned issues, this study proposes an object-level hybrid STF method (OL-HSTFM), which incorporates the efficiency advantage of the object-level fusion strategy, spectral accuracy advantage of the three-step method (Fit-FC), and the spatial accuracy advantage of the spatial and temporal adaptive reflectance fusion model. The performance of OL-HSTFM was compared with two classic STF methods and eight state-of-the-art STF methods at two sites. The experimental results indicate that OL-HSTFM outperforms the other ten methods in overall performance and has excellent efficiency. Furthermore, this study proposes a new metric that can assess the accuracy of both spatial and spectral domains in STF, which provides a more comprehensive and intuitive measurement of the quality of fused images compared to commonly used metrics.

Object-Level Hybrid Spatiotemporal Fusion: Reaching a Better Tradeoff Among Spectral Accuracy, Spatial Accuracy, and Efficiency

Dizhou Guo and Wenzhong Shi
Abstract-Spatiotemporal fusion (STF) is a cost-effective way to complement the spatiotemporal resolution of multisource images, which has been employed in various applications requiring image sequences.In real-world applications, the spectral accuracy, spatial accuracy, and efficiency of STF play a critical role.Despite this, most STF methods focus on improving spectral accuracy, whereas the challenges of spatial information loss and low efficiency have received limited attention.In addition, the improvements in spectral accuracy, spatial accuracy, and efficiency in STF are contradictory, and existing STF methods cannot balance them well, which limits their reliability and applicability for various STF tasks.To solve the above-mentioned issues, this study proposes an object-level hybrid STF method (OL-HSTFM), which incorporates the efficiency advantage of the object-level fusion strategy, spectral accuracy advantage of the three-step method (Fit-FC), and the spatial accuracy advantage of the spatial and temporal adaptive reflectance fusion model.The performance of OL-HSTFM was compared with two classic STF methods and eight state-of-theart STF methods at two sites.The experimental results indicate that OL-HSTFM outperforms the other ten methods in overall performance and has excellent efficiency.Furthermore, this study proposes a new metric that can assess the accuracy of both spatial and spectral domains in STF, which provides a more comprehensive and intuitive measurement of the quality of fused images compared to commonly used metrics.

I. INTRODUCTION
H IGH spatiotemporal resolution remote sensing images are in high demand for numerous applications, such as crop Dizhou Guo is with the Jiangsu Key Laboratory of Resources and Environmental Information Engineering, China University of Mining and Technology, Xuzhou 221116, China (e-mail: dizhou_guo@cumt.edu.cn).
Wenzhong Shi is with the Department of Land Surveying and Geo-Informatics, Otto Poon Charitable Foundation Smart Cities Research Institute, The Hong Kong Polytechnic University, Hong Kong, SAR, China (e-mail: john.wz.shi@polyu.edu.hk).
Digital Object Identifier 10.1109/JSTARS.2023.3310195yield estimation [1], [2], wetland observation [3], [4], and disaster assessment [5], [6].However, due to the contradiction between satellite scanning swath and pixel size, no single satellite can provide continuous high-resolution surface monitoring [7], [8].Spatiotemporal fusion (STF) is regarded as a promising solution to resolve the contradiction between the spatial and temporal resolutions of satellite images and has been utilized for various applications requiring image sequences [9], [10], [11], [12], [13], [14].Typically, STF blends images from two types of satellites: one with high spatial resolution but low revisit frequency (referred to as "fine images" and their pixels as "fine pixels"), and another with revisit high frequency but low spatial resolution (referred to as "coarse images" and their pixels as "coarse pixels").
Over the past two decades, the STF technique has undergone rapid development, and more than one hundred STF methods have been proposed [7].The majority of the existing methods can be divided into four groups [15]: filter-based, unmixing-based, learning-based, and hybrid methods.The spatial and temporal adaptive reflectance fusion model (STARFM) proposed by Gao et al. [16] is the first filter-based method, in which a semiempirically weight function model is constructed to combine the change information of similar pixels for reconstructing the fine image.This has since inspired many other filter-based methods, such as the enhanced STARFM (ESTARFM) [17], the spatial and temporal nonlocal filter-based fusion model [18], the rigorously-weighted STF model [19], and the three-step method (Fit-FC) [20].The unmixing-based method exploits mixed pixels in a coarse image based on the spectral linear mixing theory [21] to obtain a finer image.The multisensor multiresolution technique [22] is the first unmixing-based method for STF.Since then, new methods including the spatial-temporal data fusion approach (STDFA) [23], the unmixing-based spatiotemporal reflectance fusion model [24], the blocks-removed spatial unmixing method [25], and the geographically weighted spatial unmixing method [26] have been proposed to improve the flexibility of the unmixing process.The first learning-based STF method is based on the sparse representation model [27].With advances in hardware and deep learning technology, numerous deep learning networks for STF have been developed [28], such as the STF using deep convolutional neural networks [29], the deep convolutional STF network [30] and its enhanced version [31], the STF method using a GAN [32], the GAN-based STF model [33], the hybrid convolutional neural network [34], and the robust STF network [35].To incorporate the strengths of the various models, numerous hybrid methods that combine multiple models have been developed, such as the flexible spatiotemporal data fusion (FSDAF) method [36], the robust adaptive spatial and temporal fusion model [37], the FSDAF 2.0 method [38], the enhanced FSDAF method that incorporates subpixel class fraction change information (SFSDAF) [39], the object-based STF model (OBSTFM) [40], and the reliable and adaptive spatiotemporal data fusion (RASDF) method [41].STF accuracy evaluation system can guide the development direction of the algorithm.However, the metrics commonly used in previous studies [8], [42], [43], [44] for evaluating the accuracy, including root-mean-square error (RMSE), average difference (AD), correlation coefficient (r), and structural similarity index measure (SSIM), primarily reflect the spectral similarity with the reference fine image rather than the spatial similarity [45].To assess the all-round performances of STF models, recently, Zhu et al. [45] proposed a novel framework for evaluating STF accuracy that encompasses both spectral and spatial domains.Specifically, RMSE and AD are used to describe spectral accuracy, the normalized difference of Robert's edge detector result (EDGE) and the normalized difference of local binary patterns detector result (LBP) are employed to measure spatial accuracy.Zhu et al. [45] found that the improvement of spectral accuracy generally requires the decrease of spatial accuracy as a cost in existing STF methods: optimizing spectral accuracy through the best utilization of coarse images may produce smooth results, whereas maximizing spatial accuracy through the best utilization of fine images can reduce the flexibility of restoring spectral information.This suggests that there is a difficult-to-reconcile contradiction between spectral and spatial accuracy in existing STF methods, hindering the improvement of overall accuracy.
In addition to the tension between spectral accuracy and spatial accuracy, the field of STF also faces a contradiction between efficiency and accuracy.For example, in the STARFM-like methods [16], [17], [46], [47], reducing the size of the moving window can improve efficiency, but may result in a decrease in spectral accuracy.Training the network with more data can typically enhance the performance of STF, but it increases the training time.Hybrid STF methods merge the benefits of multiple models to achieve a better balance between spectral accuracy and spatial accuracy, but it generally leads to more complex computation.Two recent studies have attempted to resolve this issue: 1) Gao et al. [48] used the compute unified device architecture (CUDA) to parallelize the FSDAF (cuFSDAF), the experiments shown that cuFSDAF can obtain similar accuracies to FSDAF, while achieving speed-ups of 140.3-182.2over the original FS-DAF program.This parallelization strategy can greatly increase computational efficiency, but it also places higher demands on the computing platform.2) Our team proposed an object-level processing strategy [49] to make the STF model lighter, and the experimental results indicated that the object-level fusion versions of STARFM, ESTARFM, and Fit-FC can obtain similar spectral accuracy to their original methods while achieving speed-ups of 102.89-113.71,92.77-115.73,and 30.51-36.15 times over their original programs.The object-level processing strategy provides a low-cost solution to the tension between efficiency and accuracy.However, the experimental results showed that the object-level fusion version of STARFM (OL-STARFM) and the object-level fusion version of Fit-FC (OL-Fit-FC) had a distinct tendency in spatial accuracy and spectral accuracy: OL-STARFM is advantageous for preserving spatial details, but it generally obtains low spectral accuracy, whereas OL-Fit-FC can obtain excellent spectral accuracy, but it generally produces smooth results.Despite a satisfactory balance in efficiency and accuracy, these two methods still could not reach a satisfactory tradeoff between spectral accuracy and spatial accuracy.
Spectral accuracy, spatial accuracy, and efficiency of STF are all critical in real-world applications: high computational efficiency enhances the feasibility and applicability of STF methods Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
for large-scale and long-term tasks, and high accuracy in both spectral and spatial domains is vital for accurate information extraction (e.g., land cover classification).However, to the best of our knowledge, most STF methods only focus on improving the spectral accuracy [45], a few STF methods aim to improve both efficiency and spectral accuracy [48], [49], none of the methods have considered all three factors simultaneously, and few previous studies [45] have analyzed sufficiently the reason for the contradiction between spectral accuracy and spatial accuracy, which limits reliability and applicability of existing methods for various STF tasks.Furthermore, most users are concerned with the overall accuracy of the spatial and spectral aspects of the fusion results, as both are crucial for real-world applications.The framework developed by Zhu et al. [45] can access spectral accuracy and spatial accuracy of fusion results by using multiple metrics (RMSE, AD, EDGE, and LBP).However, there is still no single metric that can accurately assess the comprehensive quality of the fused image, making it difficult for users to choose the optimal method for their needs.
In this study, to cope with the above-mentioned issues, two lightweight methods (OL-STARFM and OL-Fit-FC) are adopted as representatives to sufficiently analyze the reasons for their distinct tendency in spectral accuracy and spatial accuracy.Based on these findings, an object-level hybrid STF method (OL-HSTFM) is developed based on the multimodel strategy to incorporate the spatial accuracy advantage of the OL-STARFM model and the spectral accuracy advantage of the OL-Fit-FC model.In addition, a novel spatiospectral accuracy metric (SSAM), which can assess the accuracy of both spatial and spectral domains, is proposed to provide a comprehensive and intuitive evaluation of the quality of the fused image.The rest of this article is organized as follows.In Section II, we describe the principles of OL-STARFM, OL-Fit-FC, adaptive weighting module, and SSAM, and analyze how to combine the advantages of OL-STARFM and OL-Fit-FC.In Section III, we introduce the study sites and experimental designs.In Section IV, we test and compare the proposed OL-HSTFM with ten popular methods.In Section V, we discuss the advantages of OL-HSTFM and the limitations of this study and conclude this article.

II. METHODOLOGY
OL-HSTFM consists of three modules: the OL-STARFM module, the OL-Fit-FC module, and the adaptive weighting module.The OL-Fit-FC module is used to recover most of the spectral information, whereas the OL-STARFM module mainly aims to supplement the lost spatial information for the prediction of OL-Fit-FC.Specific principles of OL-STARFM and OL-Fit-FC can be found in [49].OL-HSTFM uses an adaptive weighting module to flexibly combine the strengths of the OL-STARFM and OL-Fit-FC.A list of important notations and definitions is given in Table I for convenience.The flowchart of OL-HSTFM is shown in Fig. 1.

A. OL-STARFM Module
STARFM is a classical STF method and is still widely used by many institutions like the United States Department of Agriculture.OL-STARFM is the object-level fusion version of STARFM and offers two advantages over its predecessor: 1) OL-STARFM operates on object-level processing, resulting in a speed-up of 102.89-113.71times compared to STARFM [49]; and 2) it provides the better spatial accuracy.
OL-STARFM consists of two stages including temporal change estimation and residual compensation.Before the fusion process, the auxiliary fine image is divided into homogeneous regions by the multiresolution segmentation algorithm of eCognition software.The parameters of segmentation are the same as those in [40] and [49]: the smoothness weight and spectrum weight are set to 0.5 and 0.6, respectively, and the scale value is set to 150.The temporal change value of each segmented object is then estimated to obtain the preliminary prediction of OL-STARFM ( F s ) as follows: where M[•] means taking the median of all pixel values in the segmented object.B denotes the index of band, p is the prediction phase, b is the base phase, the term s indicates the index of the segmented object, F indicates the fine image, and C indicates the coarse image.OL-STARFM uses the median value instead of the expected value to reduce the impact of poor-quality pixels for the prediction.However, this strategy may bring deviation to the prediction.Thus, a step of pixel-level residual compensation is added to enhance the preliminary prediction, the distribution of residuals takes the pixel as the basic unit, as follows: where R s is the differences between the temporal changes of coarse images and fine images.Scale ↑ and Scale ↓ mean upscaling function (pixel aggregation) and downscaling function (bicubic interpolation), respectively.F S is the final prediction of OL-STARFM.Unlike STARFM, the neighboring fine pixels in OL-STARFM do not participate in the weighting calculation, the prediction of OL-STARFM is obtained by a simple addition between the value of the auxiliary fine pixel and the estimated temporal change value from auxiliary coarse pixels.While this implementation is lightweight and conservative, allowing the model to inherit most of the texture information from the auxiliary fine image and preserve more spatial information than that of STARFM, it also limits the model's flexibility in capturing strong temporal changes, leading to its disadvantage in predicting spectral information.

B. OL-Fit-FC Module
Numerous studies have reported the effectiveness of Fit-FC in capturing strong phenological changes [45], [50] and its satisfactory ability to achieve high spectral accuracy [49], [51], [52].OL-Fit-FC is the object-level fusion version of Fit-FC and has improved efficiency over Fit-FC, the latest version of the OL-Fit-FC1 is approximately 50 times faster than the Fit-FC and can achieve better spectral accuracy.
OL-Fit-FC consists of two stages including regression model fitting and residual compensation.In the regression model fitting stage, OL-Fit-FC conducts the local linear regression model in each segmented object to relate the observations acquired at two phases to obtain the preliminary prediction ( F F ) as follows: where a and b are estimated by using a guided filter [53] within the segmented object.Specifically, the coarse image at the prediction phase is employed as the filtering input and the coarse image at the base phase is employed as the guide.The two regression coefficient values vary greatly in different segmented objects and are not constrained to minimize the difference in observations acquired in different phases.
The regression model inevitably has a residual error, and to avoid the spectral distortion in the final prediction, these residuals are eliminated in the next residual compensation stage; the distribution of residuals takes the object as the basic unit to combine adjacent similar information as follows: (5) where R F is the residual of the regression model in relating the coarse images.E[•] indicates taking the expected value.F F is the final prediction of OL-Fit-FC.Fit-FC and OL-Fit-FC employ a local linear regression model to relate the auxiliary coarse images and then migrate them to the auxiliary fine image to predict the temporal change.This strategy effectively leverages the information of auxiliary coarse images.However, the regression model fitting step can cause a loss of spatial detail, as the absolute value of the slope of the local linear regression model (a) may be less than 1.Large losses of spatial information may also lead to a decline in spectral accuracy.As shown in Fig. 2, the average value of all pixels in the red band image (B 3 ) is changed from 975 to 1200 and different linear models are used to simulate the regression processing in Fit-FC and OL-Fit-FC.Obviously, plenty of spatial details are lost when the value of a approaches zero.

C. Adaptive Weighting Module
OL-STARFM and OL-Fit-FC both have advantages in terms of preserving spatial information and predicting spectral information, respectively, and are both efficient.Incorporating the advantages of the two methods can achieve a better tradeoff among efficiency, spatial accuracy, and spectral accuracy in STF.However, the contradiction between spectral accuracy and spatial accuracy is sharp, and it is quite difficult to make both of them reach the state-of-the-art level.For most long-term tasks, such as vegetation phenology monitoring, spectral accuracy is generally more important than spatial accuracy.Therefore, the prediction of the OL-Fit-FC module is preferred, and the prediction of the OL-STARFM module is employed to supplement the lost spatial information for the prediction of OL-Fit-FC.Theoretically, two conditions need to be met for significant loss of spatial information in the OL-Fit-FC module: 1) this area contains abundant spatial information; and 2) the absolute value of the slope of the local linear regression model (a) in this area is less than 1.Thus, in areas that meet the above two conditions, the predictions of OL-Fit-FC and OL-STARFM are weighted according to the richness of spatial information and the value of the slope.In areas that do not meet the above two conditions, only the prediction from OL-Fit-FC is used to ensure high spectral accuracy.
The fine images at the base phase and the prediction phase generally contain similar spatial information [38].Therefore, the richness of spatial information can be quantified by filtering the auxiliary fine image using an edge detector.The Robert's edge detector is adopted for its high efficiency.Pixels with values higher than the 70th percentile in the spatial features map produced by Robert's edge detector are considered to contain spatial information.These values are normalized under the assumption Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.that they follow a Gaussian distribution as follows: where RE is the result of Robert's edge filtering applied to the auxiliary fine image (F b ).Stddev[•] indicates the standard deviation.The results of the calculation in ( 7) are capped at 1 if they are greater than 1.Generally, a larger value of SIR indicates the presence of more abundant spatial information around the pixel.
For the pixel with an absolute value of slope less than 1 and the value of SIR greater than a preset empirical threshold T (which is set to 0.1 in this study), the prediction of OL-STARFM is introduced to participate in weighting.The weight of the prediction of OL-STARFM is calculated as follows: Finally, the predictions of OL-Fit-FC and OL-STARFM are combined to produce the final prediction of OL-HSTFM as follows: Note that this study gives priority to ensuring high spectral accuracy in the final prediction.However, for tasks where spatial accuracy is more important than spectral accuracy, such as urban monitoring, the opposite strategy can be employed to attain a prediction with high spatial accuracy but compromised spectral accuracy: the prediction of OL-STARFM is favored, whereas, for pixels where the value of SIR is below a preset empirical threshold T , the prediction of OL-Fit-FC is introduced to participate in weighting.

D. Spatiospectral Accuracy Metric
The accuracy of STF can be decomposed into spectral domain and spatial domain [45].However, because the change amplitude of spectral accuracy metrics (e.g., RMSE and AD) and spatial accuracy metrics (e.g., EDGE and LBP) is different, the overall accuracy cannot be described by simply adding or multiplying the two kinds of accuracy metric.To quantify the change amplitude of the two kinds of accuracy metrics, the acceptable results with overestimated temporal change information (ATF) and spatial information (ASF) proposed in [45] are calculated as follows: where F b is the fine image of the base phase, and C b and C p are coarse images of the base phase and the prediction phase, respectively.ATF and ASF, respectively, represent the overestimation results of the two most common fusion models (i.e., temporal predicting-based fusion model and spatial sharpening-based fusion model) [45].The differences in spectral accuracy and spatial accuracy can then reflect the variation range of these two accuracy measures in most STF models.In this study, RMSE and EDGE [the calculation formula for EDGE is presented in (17) and (18)] are used to quantify spectral accuracy and spatial accuracy, respectively.The variation ranges of these two metrics are calculated as follows: where V spe and V spa indicate the variation range of spectral accuracy and spatial accuracy, respectively.Then, a consolidated accuracy index (CA) can be obtained after eliminating the difference between the change amplitudes of the two metrics as follows: where P is the prediction of STF, and α and β represent the weights of spectral accuracy and spatial accuracy, which are set to 0.7 and 0.3, respectively.The value of CA can measure the overall quality of the fused image.However, CA has different change amplitudes in different experiments, making it unsuitable for comparison between experiments and unable to let the user know how much more useful information the fused image provides than the available images (F b and C p ).To address this issue, CA of F b , C p , ATF, and ASF are calculated, the minimum value of them is defined as the bottom line (BL) of consolidated accuracy.The decline rate of CA of the fused image compared with BL is defined as the spatiospectral accuracy metric (SSAM) as follows: The ideal value of SSAM is 1; a larger value of SSAM indicates a more accurate fusion result.When SSAM value is less than 0, it means that the fused image cannot provide more useful information than the available images.

A. Study Area and Dataset
The performance of OL-HSTFM was evaluated using two distinct landscapes from the recently released STF dataset 2 2 [Online].Available: https://github.com/Andy-cumt/Spatiotemporal-fusiondata[49].The first study site is located in the irrigation area of Belle Glade ("Belle" herein), southern Florida, USA (80°53 32.12 W, 26°40 49.38 N).The Landsat 8 OLI surface reflectance Tier 1 product and the MODIS Nadir BRDF-Adjusted reflectance product MCD43A4.006were adopted as fine images (800 × 800 pixels, with a spatial resolution of 30 m) and coarse images (with a resampled spatial resolution to 480 m), respectively.The two image pairs were obtained on January 18, 2014 (auxiliary image pair) and October 22, 2016, respectively.As shown in Fig. 3.This site is covered by regular and rectangular farmland and has undergone significant phenological change between two time periods.The second site is located in the south of Poyang Lake wetland ("PY" herein), Jiangxi Province, China (116°11 37.35 E, 28°57 57.39 N).The Landsat 8 OLI surface reflectance Tier 1 products and the Terra surface reflectance products MOD09A1.006were adopted as fine images (960 × 960 pixels, with a spatial resolution of 30 m) and coarse images (with a resampled spatial resolution to 480 m), respectively.The two image pairs were obtained on October 24, 2014 (auxiliary image pair) and December 19, 2017, respectively (as shown in Fig. 4).The landscape of this site is quite complex and heterogeneous; it is covered by the seasonal wetland and the fragmented farmland and has undergone strong phenological changes and shape changes between two time periods.Both images have six bands (blue, green, red, Nir, Swir1, and Swir2).

B. Experimental Design
To verify the effectiveness of the adaptive weighting module in incorporating the advantages of OL-STARFM and OL-Fit-FC, the interim results of OL-HSTFM are qualitatively compared and analyzed.Then, two classic STF methods including STARFM and STDFA and eight state-of-the-art STF methods including Fit-FC, OL-Fit-FC, FSDAF, FSDAF 2.0, SFSDAF, OBSTFM, RASDF, and OL-STARFM were involved in both qualitative and quantitative comparisons to thoroughly assess the performance of OL-HSTFM.The all-round performance assessment (APA) diagram designed by Zhu et al. [45] was adopted for intuitive quantitative evaluation.Four APA metrics were employed: AD, RMSE, EDGE, and LBP.Their ideal value is zero.Generally, predictions achieve higher spectral accuracy when the values of RMSE and AD are closer to zero, and higher spatial accuracy when the values of EDGE and LBP are closer to zero.The calculation formulas for EDGE and LBP are as follows: EDGE, LBP = E (S nd ) where EE indicates the edge detector.When calculating EDGE, Robert's edge detector is used to filter the prediction and auxiliary fine image and their normalized difference of spatial feature (S nd ) is obtained, and the pixels with edge values higher than the 90th percentile in S nd are used to calculate EDGE.When calculating LBP, the local binary pattern detector is used.For a more detailed explanation, please refer to [45].
To evaluate the comprehensive accuracy of the proposed fusion image, the reference fine images and all predictions were classified using the K-means method, with the classification  result of the reference fine image serving as the benchmark.The overall classification accuracy (OCA) was used to evaluate the comprehensive accuracy of the predictions and verify the effectiveness of the proposed SSAM.In the Belle area and PY area, the number of classifications was set to 5 and 3, respectively, and the maximum number of iterations was set to 10 in each experiment.The running time of each method was recorded.All methods were performed based on a computer having i7-10875H (2.30 GHz), 16 GB RAM, and GeForce RTX 2060.Except for FSDAF, which was developed on the IDL platform, all other methods were run on the MATLAB 2020b platform.

A. Qualitative Comparison
Figs. 5 and 6 present the map of the absolute value of the slope (|a|) in Nir band, the map of SIR in Nir band, the map of the weight of OL-STARFM (W ) in Nir band, the predictions of OL-Fit-FC, OL-STARFM and OL-HSTFM, and the reference fine images in two experiments.In Fig. 5, it can be seen that the SIR map accurately reflects the richness of spatial information.For instance, SIR values of roads are notably high in the zoomed area.The prediction of OL-Fit-FC is highly similar in terms of spectral information to the reference fine image, but it loses much of the spatial texture inside the farm.The prediction of OL-STARFM preserves much richer spatial information than that of OL-Fit-FC, but there are slight deviations in the spectral information, such as the area marked by yellow dotted lines in the zoomed area.It can be found that the prediction of OL-HSTFM contains noticeably more spatial information than that of OL-Fit-FC, and has fewer spectral gross errors than that of OL-STARFM.The same conclusion can be drawn from the next experiment shown in Fig. 6.The prediction of OL-Fit-FC has a color more similar to the reference image than that of OL-STARFM, but it loses a significant amount of spatial details including fragmented farmland, rivers, and small lakes (e.g., the area marked by the yellow dotted box lines and white dotted box lines).The prediction of OL-STARFM rigidly inherits the spatial information of auxiliary fine images but underestimates the spectral values.It can be found that the prediction of OL-HSTFM is closest to the actual image.Consequently, the weighting module of OL-HSTFM can effectively incorporate the advantages of OL-STARFM and OL-Fit-FC.
Figs. 7 and 8 present the reference fine images and predictions of all 11 methods in Belle and PY, respectively.In Fig. 7, it appears that the predictions of OL-HSTFM, OBSTFM, and RASDF are closer to the reference fine image than the other predictions.The predictions of Fit-FC and OL-Fit-FC are both noticeably blurry, and their blurred areas are distinct: Fit-FC tended to blur the boundaries between farms (e.g., the area marked by yellow lines), whereas OL-Fit-FC tended to blur the texture information inside farms.The three FSDAF-like methods (FSDAF, FSDAF 2.0, and SFSDAF) tended to blur the areas that experienced strong temporal changes.STARFM, OL-STARFM, and STDFA inherited rich spatial information from the auxiliary fine image, but they underestimated the brightness value of an image in a near-infrared band.In Fig. 8, the prediction of OL-HSTFM appears to be closest to the reference fine image.Fit-FC and OL-Fit-FC produced smooth results, in which the texture information of many fragmented farms and small lakes is lost (e.g., the areas marked by yellow lines).The loss of a large amount of spatial information also resulted in spectral deviation (e.g., areas marked by white lines in the predictions of Fit-FC and OL-Fit-FC).FSDAF, FSDAF 2.0, SFSDAF, and RASDF accurately predicted the spectral value of wetlands, but they noticeably underestimated the spectral values of fragmented farmland (e.g., the area in the lower-left corner of the image).The predictions of STARFM, OL-STARFM, and STDFA contain abundant spatial information, but their spectral deviation is greater than that of other predictions.Accordingly, OL-HSTFM demonstrated better comprehensive performance than the other ten methods.

B. Quantitative Comparison
The APA diagrams describing the spectral accuracy and spatial accuracy of the 11 methods are presented in Fig. 9. Based on their performance characteristics, the 11 methods can be categorized into the following three groups: 1) methods that attain high spectral accuracy but low spatial accuracy (Fit-FC and OL-Fit-FC); 2) methods that achieve low spectral accuracy but high spatial accuracy (STARFM, OL-STARFM, and STDFA); 3) methods that reach a balance between spectral accuracy and spatial accuracy (FSDAF, FSDAF 2.0, SFSDAF, OB-STFM, RASDF, and OL-HSTFM).Generally, the methods of the first and second categories are composed of a single model whereas those in the third category are made up of multiple models.The yellow circle in Fig. 9 marks the location of OL-HSTFM, which can be observed to have obtained the best spectral accuracy and a compromised spatial accuracy (much better than Fit-FC and OL-Fit-FC) in two experiments.To visually explain the tradeoff between spectral accuracy and spatial accuracy, linear regression is conducted on the two APA metrics (RMSE and EDGE), excluding OL-HSTFM, as shown in Fig. 10.It can be found that the increase of EDGE by 0.1 requires an average cost of 0.0013 and 0.0011 for the increase of RMSE in Belle area and PY area, respectively.Compared to OL-Fit-FC, OL-HSTFM achieved 0.1822 and 0.1806 increase in EDGE in the Belle area and PY area, respectively, without sacrificing its spectral accuracy.
The classification results and SSAM of 11 predictions in Belle and PY are shown in Figs.11 and 12, respectively.Note that the comparison is rough, as the classification result of the reference fine image is not necessarily accurate, and the K-means method may not the suitable for all predictions.For example, RASDF Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.and FSDAF 2.0 employ the fuzzy C-means method to classify the auxiliary fine image before the unmixing process, thus their OCA may be under-estimated in these experiments.SFSDAF employs the K-means method to classify the auxiliary fine image before the unmixing process, thus its OCA may be overestimated.In Figs.11 and 12, it can be found that the methods that achieved high spectral accuracy but low spatial accuracy, such as Fit-FC and OL-Fit-FC, did not obtain high OCA, because it is difficult to determine the boundary of the ground features of different categories but with similar spectral attributes in the blurred images.Suggesting that the accuracy of information extraction is dependent on both spectral accuracy and spatial accuracy.A significant positive correlation can be seen between OCA and SSAM, for example, OL-HSTFM achieved the highest OCA and SSAM in both experiments, whereas STDFA obtained the lowest OCA and SSAM in both experiments.To validate whether the proposed SSAM can measure the overall quality of the fused image, the correlation coefficient between OCA and different metrics (RMSE, EDGE, RMSE+|EDGE|, RMSE×|EDGE|, SSAM (1,1) , and SSAM (0.7,0.3) , where the subscripts of SSAM represent the weights of spectral accuracy (α) and spatial accuracy (β)) were calculated, as shown in Table II.Evidently, OCA and SSAM (0.7,0.3) have the strongest correlation, indicating that the proposed SSAM (0.7,0.3) can most accurately measure the overall accuracy of the fusion image.More validations are listed in the Supplementary Material.Note that the results of RASDF and FSDAF 2.0 were not involved in the calculation to minimize the impact of inappropriate classification methods.
The specific quantitative results of the two experiments are presented in Tables III and IV.In terms of spectral accuracy, the prediction of OL-HSTFM achieved the best spectral accuracy in both experiments.More overall spectral quality assessments are listed in Supplementary Material.In of spatial accuracy, OL-HSFTM outperformed Fit-FC, OL-Fit-FC, and OBSTFM.In terms of overall accuracy, OL-HSTFM outperformed the other ten methods.In terms of efficiency, Fig. 13 intuitively presents the computing times of each STF method in two experiments: OL-HSTFM is less efficient than OL-STARFM and OL-Fit-FC, but it was 7.09-48.47times faster than the remaining eight methods.Fig. 14 presents the all-round performance of the 11 methods using radar charts.To put the three aspects (spectral accuracy, spatial accuracy, and efficiency) with vastly different change amplitude in the unified assessment framework, quantitative indicators are simplified: spectral accuracy is roughly measured by subtracting the average rank of AD and RMSE from the number of methods (e.g., 11), spatial accuracy is roughly measured by subtracting the average rank of EDGE and LBP from the number of methods, and efficiency is roughly measured by subtracting the rank of time consumption from the number of methods.It appears that OL-HSFTM has the largest triangle in both experiments.
V. DISCUSSION AND CONCLUSION STF technique has undergone rapid development over the past two decades and has promoted a variety of applications.The spectral accuracy, spatial accuracy, and efficiency of STF are all critical in real-world applications.However, most existing STF methods only concentrate on enhancing spectral accuracy, neglecting the problems of spatial information loss and low efficiency.Furthermore, improving spectral accuracy, spatial accuracy, and efficiency in STF can be contradictory, and current STF methods cannot well balance them: methods based on a single principle, such as STARFM and STDFA, typically can obtain high accuracy in only one domain.Hybrid methods, such as FSDAF and OBSTFM, incorporate the benefits of multiple models but tend to result in a compromise between spectral and spatial accuracy and the absence of a comprehensive analysis of the pros and cons of different models.The incorporation of multiple complex models can also lead to heavy computation.In addition, there is no single metric that can accurately assess the comprehensive quality of the fused image, making it difficult for users to choose an appropriate method for their application.
To address the aforementioned issues, this study conducts a thorough analysis of the reasons for the distinct tendency of two promising STF methods (OL-STARFM and OL-Fit-FC), in terms of spectral accuracy and spatial accuracy.Then, a new object-level hybrid STF method (OL-HSTFM) is developed, which is based on the multimodel strategy to incorporate the spectral accuracy advantage of OL-Fit-FC and the spatial accuracy advantage of OL-STARFM.The performance of OL-HSTFM was compared with two classic STF methods (STARFM and STDFA) and eight state-of-the-art STF methods (Fit-FC, OL-Fit-FC, FSDAF, FSDAF 2.0, SFSDAF, OB-STFM, RASDF, and OL-STARFM) in two sites.Both qualitative and quantitative verifications showed that the OL-HSTFM can achieve the best overall performance compared with the other 10 methods, and has comparable efficiency to that of OL-Fit-FC.These results indicate that OL-HSTFM can reach a better tradeoff among spectral accuracy, spatial accuracy, and efficiency than other STF methods.Furthermore, this study proposes a new SSAM to assess the accuracy of both spatial and spectral domains in STF.The experiments showed that SSAM is more comprehensive and intuitive in assessing the quality of the fused image compared to existing metrics, and it has the potential to promote cross-comparisons of various STF methods and assist users in selecting the optimal method for their particular application.
Despite the favorable results obtained by OL-HSTFM and the thorough comparison and analysis carried out in this study, there are still some limitations.For example, while the adaptive weighting module of OL-HSTFM can alleviate the conflict between spatial accuracy and spectral accuracy, achieving high accuracy in both domains remains challenging.In addition, while this study provides a rough comparison (see Fig. 14) to evaluate the overall performance in terms of spectral accuracy, spatial accuracy, and efficiency, a more comprehensive and rigorous assessment framework is needed.Further exploration is required in the future.
OL-HSTFM may be the first STF method that takes into account spectral accuracy, spatial accuracy, and efficiency simultaneously.Given OL-HSTFM's noteworthy applicability in various STF tasks, the program of OL-HSTFM is openly available at https://github.com/Andy-cumt/Object-level-spatiotemporalfusion-models.

Fig. 2 .
Fig. 2. Schematic diagrams of using different linear models to make the average value of the single red band image (B 3 ) change from 975 to 1200.

Fig. 5 .
Fig. 5. Fusion results of the Belle dataset.(a) Maps of the absolute value of the slope (|a|) in the regression model in Nir band.(b) Map of the result of Robert's edge filtering applied to the auxiliary fine image (SIR) in Nir band.(c) Map of the weight of OL-STARFM (W ) in Nir band.(d) Prediction of OL-Fit-FC.(e) Prediction of OL-STARFM.(f) Prediction of OL-HSTFM.(g) Reference fine image.

Fig. 6 .
Fig. 6.Fusion results of PY dataset.(a) Maps of the absolute value of the slope (|a|) in the regression model in Nir band.(b) Map of the result of Robert's edge filtering applied to the auxiliary fine image (SIR) in Nir band.(c) Map of the weight of OL-STARFM (W ) in Nir band.(d) Prediction of OL-Fit-FC.(e) Prediction of OL-STARFM.(f) Prediction of OL-HSTFM.(g) Reference fine image.

Fig. 9 .
Fig. 9. All-round performance assessment diagrams for displaying the accuracy of 11 STF methods for fusing images of (a) Belle and (b) PY.

Fig. 13 .
Fig. 13.Computing time of each STF method in two experiments.

Fig. 14 .
Fig. 14.Radar charts showing the ranks of spectral accuracy, spatial accuracy, and efficiency of 11 STF methods in two experiments.
Manuscript received 8 March 2023; revised 20 May 2023; accepted 19 August 2023.Date of publication 30 August 2023; date of current version 12 September 2023.This work was supported in part by the Otto Poon Charitable Foundation Smart Cities Research Institute, Hong Kong Polytechnic University (Work Program: CD03), and in part by the Urban Informatics for Smart Cities,

TABLE I LIST
OF IMPORTANT NOTATIONS AND DEFINITIONS

TABLE II CORRELATION
COEFFICIENT BETWEEN OVERALL CLASSIFICATION ACCURACY AND DIFFERENT METRICS

TABLE III SPECTRAL
ACCURACY, SPATIAL ACCURACY, OVERALL ACCURACY, EFFICIENCY, AND THEIR RANKS OF ALL 11 PREDICTIONS IN BELLE TABLE IV SPECTRAL ACCURACY, SPATIAL ACCURACY, OVERALL ACCURACY, EFFICIENCY, AND THEIR RANKS OF ALL 11 PREDICTIONS IN PY