Accuracy Evaluation of the Landsat 9 Land Surface Temperature Product

Having a good knowledge of the uncertainty in the land surface temperature (LST) product will help to encourage its use in a wide number of applications, including urban heat islands, geothermal detection, and surface energy balance. Landsat 9 was launched on 27 September 2021 and provides an LST product, which is generated by the radiative transfer equation algorithm and has a spatial resolution of 30 m. In this article, we evaluated the performance of the Landsat 9 LST product by using a temperature-based (T-based) method and cross-validation. The T-based validation results showed that the average bias at the surface radiation budget network and baseline surface radiation network sites was 0.24 K and that the corresponding root mean square error (RMSE) was 3.42 K. The Landsat 9 LST product was in good agreement with the Landsat 7/8 LSTs, with an average bias of 0.25/0.08 K, an RMSE of 0.51/1.04 K, and a mean absolute error of 0.38/0.64 K. The comparable performance of the Landsat 7/8/9 LST products can be explained by the consistent LST retrieval algorithm. The absolute differences in the LST between Landsat 9 LST and MOD11 (MOD21) LST images were between 0.01 (0.65) and 2.50 K (1.76 K), whereas the RMSE values were between 1.40 (1.80) and 3.65 K (3.26 K). The specific heat capacity and thermal inertia of the different land surface covers can explain the significant biases. The above evaluation results are consistent with the initial performance testing of thermal infrared sensor-2 (TIRS-2) by the National Aeronautics and Space Administration and the U.S. Geological Survey. Although the released Landsat 9 LST product showed good performance in the preliminary evaluation, the split-window algorithm may be a better option for Landsat 9 LST retrieval, as the TIRS-2 data addressed stray light incursion. Since there are no official validation results that have been published, this article provides a third-party performance evaluation of the Landsat 9 LST product and will benefit research fields that require Landsat series LST products.


I. INTRODUCTION
L AND surface temperature (LST) is one of the key parameters in land-surface physical processes on the regional and global scales, integrating the interactions between the surface and atmosphere and all of the energy exchanges between the atmosphere and the land [1], [2], [3], [4], [5], [6], [7]. It has been widely applied to hydrology, meteorology, and the surface energy balance [8], [9], [10], [11]. LST and emissivity have been identified as a critical Earth system data record and essential climate variable by the National Aeronautics and Space Administration (NASA) and many other international organizations [12], [13], [14].
Landsat series satellites provide long-term time-series thermal infrared (TIR) observations and are an invaluable data source for obtaining LST records [15], [16]. The successful launch of Landsat 9 on 27 September 2021 enables continuous records of the Earth to be collected for the next 5 years. The design of Landsat 9 is similar to that of Landsat 8 and has two instruments, the operational land imager 2 (OLI-2) and the thermal infrared sensor 2 (TIRS-2), which can observe the Earth's surface at an altitude of 705 km. The TIRS-2 continuously measures the TIR radiation of the Earth's surface in two TIR bands at a spatial resolution of 100 m without the stray light issues that plagued Landsat 8 [17]. On-orbit lunar scans indicated that the new design changes to TIRS-2 have reduced the total scattering to 1% or less [18]. The published Landsat 9 LST product is generated using the radiative transfer equation (RTE) algorithm at a spatial resolution of 30 m, which has significant potential for exploring urban heat islands, geothermal detection, and volcanology studies [19].
Having a good knowledge of product uncertainty can help users make better use of the LST product [14]. Currently, the temperature-based (T-based) method, radiance-based (R-based) method, and cross-validation [20], [21], [22] have been widely employed to validate the released LST products. With the validation results, the project team can improve the accuracy of the LST products by correcting the deficiencies that are raised in the validation. For example, validations have indicated that the Collection collection 5 MXD11 LST products have large negative biases over barren surfaces [23]. As a result, the released Collection 6 MXD11 LST products have been improved by refining the production algorithm over barren surfaces [21]. As official validation results have not been published, evaluating the accuracy of the Landsat 9 LST product will help to encourage This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ its use to explore urban heat islands, geothermal detection, and volcanology research.
This article aims to evaluate the Landsat 9 LST product through the use of a T-based method and cross-validation. This article is organized as follows: Section II introduces the used Landsat and Moderate Resolution Imaging Spectroradiometer (MODIS) products and ground measurements. Sections III and IV provide the evaluation results and discussion. Section V discusses the conclusions of this article.

A. Landsat LST Retrieval Algorithm
To maintain LST algorithmic continuity across Landsat series TIR records, Landsat LST products were estimated using the RTE algorithm proposed by Malakar et al. [24] where L i is the at-sensor radiance of channel i, T s is the LST, B i (T s ) is the blackbody radiance of channel i, ε i is the land surface emissivity (LSE) of channel i, L ↑ i , L ↓ i , and τ i are the atmospheric upward radiance, downward radiance, and transmittance of channel i, respectively.
LST can be calculated using (1) by the providing LSE and three atmospheric parameters (L ↑ i , L ↓ i , and τ i ). The atmospheric profiles of geopotential height, specific humidity, and air temperature extracted from two reanalysis products are imported into the radiative transfer code MODTRAN 5.2 to calculate three atmospheric parameters. Atmospheric correction was performed using the Goddard Earth Observing System Model Version 5 Forward Processing Instrument Teams for Landsat TIR data acquired in 2000 and later, while Modern Era Retrospective analysis for Research and Applications Version 2 was used for those data acquired from 1982 to 1999 [17].
For LSE estimation, the Advanced Spaceborne Thermal Emission Radiometer (ASTER) Global Emissivity Database (GED) was adopted as the cornerstone of surface emissivity and was modified to account for spectral discrepancies, vegetation phenology, and snow cover [24] ε where ε Landsat is the retrieved Landsat 9 LSE, ε Aster is the ASTER GED LSE that has been spectrally adjusted for the Landsat 9 TIR band, ε bare is the bare soil component emissivity, ε veg is the vegetation component emissivity, and f v,Landsat or f v,Aster is the fractional vegetation cover that corresponds to the normalized difference vegetation index of Landsat or ASTER.

B. Landsat 9 Collection 2 Level-2 Product
To analyze the performance of the Landsat 9 LST product, Landsat 9 Collection 2 Level-2 surface temperature products were downloaded from the Earth Explorer website. A total of 298 images were used in this article, including 63 images that were used for validation using the T-based method, 189 that were images used for cross-validation based on the Landsat 8 LST, 30 images that were used for cross-validation based on the Landsat 7 LST, and 16 images that were used for crossvalidation based on the MODIS LST. The distribution of the Landsat 7/8/9 images used for cross-validation and in-situ sites used for T-based validation is shown in Fig. 1.

C. MODIS LST Product
Terra MODIS LST products (MOD11_L2 and MOD21) Version 6.1 are used in this article. MOD11_L2 and MOD21 products are retrieved from the generalized split-window algorithm [25] and temperature emissivity separation (TES) algorithm [26], respectively. Both products were downloaded from the website (https://earthdata.nasa.gov/) and were reprojected from the sinusoidal projection to the Universal Transverse Mercator projection at a 1000-m spatial resolution.

D. LST Validation Sites
Ground measurements from the surface radiation budget network (SURFRAD) and baseline surface radiation network (BSRN) sites, which have been widely used to validate LST products [27], [28], [29], were used to validate the accuracy of the Landsat 9 LSTs. The detailed ground measurement information is shown in Table I. The in situ LSTs from the site-measured surface upward longwave radiance and downward longwave radiance observations were calculated as where T s is the ground LST, F ↑ is the measured surface longwave upward radiance, F ↓ is the measured surface longwave downward radiation, σ is the Stefan-Boltzmann constant (5.67×10 −8 W/m 2 /K 4 ), and ε b is the broadband emissivity (BBE), which is calculated from the five ASTER narrowband emissivity using [30]: where ε 10 -ε 14 are the ASTER narrowband emissivity of five channels.
The uncertainty of the ground LST comes from two parts: one is the uncertainty of the measured longwave upward/downward radiance, and the other is the uncertainty of the assumed BBE. The accuracy (precision) of the Eppleypyrgeometer is about 4.2 (2.0) W/m 2 for daytime measurements, which gives rise to LST uncertainty of less than 1 K [31]. Considering that the measurement error in the CNR1/CNR4 net radiometers is −8 W/m 2 [32], the LST uncertainty was determined to be 1.62 K in the daytime [6]. However, there are various tower-based instrument pyranometers installed at different BSRN sites, and the LST uncertainty is still unknown and needs to be resolved. Following the method used by Wang et al. [33], we calculated the sensitivity of the LST to the BBE at each site. First, the multiyear-average surface longwave upward/downward radiation at each site was calculated and is shown in Table II. The multiyear-average surface longwave upward/downward radiation at most sites was calculated using the surface longwave upward/downward radiance for the period of 2017-2021, with the exception of the BUD site, which used data collected in 2019-2021. Then, assuming that the BBE varies between 0.92 and 0.99 with an interval of 0.01, the LST can be calculated using (5). As shown in Fig. 2, the greatest sensitivity is 0.37 K/0.01 BBE for the IZA site, followed by the DRA, TBL, GOB, BND, and FPK site, with a sensitivity greater than 0.20 K/0.01 BBE. The lowest sensitivity is 0.13 K/0.01 BBE for the CAB site. The result shows that the LST sensitivity derived by the BBE depends on the contrast in the longwave upward radiance and longwave downward radiance, which also confirms Wang's inference that the greater the contrast, the larger the sensitivity [33].

A. T-Based Validation Results
Since Landsat 9 was first launched on 27 September 2021 and has a 16-day revisit cycle, only part of the Landsat 9 LST product can be matched with ground measurements. In total, 63 matched images were obtained at the SURFRAD and BSRN sites, with 35 (28) images being obtained at the SURFRAD (BSRN) sites. The Landsat 9 LST validation results are shown in Fig. 3. Compared to the ground measurements at the SURFRAD sites, the Landsat LST retrieved from the RTE algorithm is overestimated by about 0.27 K but has a root mean square error (RMSE) of 4.70 K. The average difference (bias) between the Landsat LST and ground-measured LST is −0.86 K at the BSRN sites, while the corresponding RMSE is 3.70 K. As shown in Fig. 3, two outlier points in the red box have a bias larger than 10 K. After filtering the outliers with the "3σ-Hampel identifier" [34], [35], the bias (RMSE) at the SURFRAD sites is 1.18 K (3.16 K).
To analyze the possible reason for the outliers, false color composite images (bands 5, 4, and 3) and the corresponding Level 2 Pixel Quality Assessment band are shown in Fig. 4. As shown in Fig. 4, the larger bias is due to the effect of the cirrus. Although the QA value on 11/04/2022 indicates the pixel where the TBL site is located is clear, we can see the suspected cirrus on the false color composite image. Combined with the validation results, we can infer that the large bias is due to the uncertainty in cloud detection. As shown in Fig. 3, the overestimated LST are mainly located at the TBL site. As discussed in [14] and [24], the influence of local orography may affect the inconsistency between the measured LST and the LST estimated at the satellite pixel scale. After excluding the matched data at the TBL site, the bias (RMSE) is 0.14 K (1.52 K) for the other SURFRAD sites. As for the validation results at the BSRN sites, the larger biases are  mainly located at the GOB site and the IZA site. For the IZA site, the larger biases may be explained by the spatial heterogeneity, which is more significant than that at other BSRN sites [19]. For the GOB site, previous article also found the Landsat LST estimated by various algorithms was overestimated about 2-3 K [28], [29]. As discussed in Ermida et al. [28], the upward and downward radiance sensors at the GOB site were deployed in different locations, with a distance of approximately 7000 m. The collocation of site measurements and satellite LST images may introduce additional uncertainty into the evaluation result. When extracting the Landsat 9 LST values at the location of the upward radiance sensor for evaluation, the bias and RMSE were equal to 0.98 and 2.35 K, but when using the location information provided on the PANGAEA website, the bias and RMSE were equal to 1.16 and 2.82 K.

B. Cross-Validation Results Based on Landsat 8 LST
To evaluate the accuracy of the Landsat 9 LST product, the Landsat 8 LST was used for cross-validation. From November 12-16, 2021, Landsat 9 flew in the "underfly" mode in position with Landsat 8 to allow both to collect data simultaneously. Although the observation areas of the pair of Earth-observing satellites did not match exactly, substantial overlapping observation areas are available for comparison. The performance of the Landsat 9 LST product was assessed through four error metrics: bias, RMSE, mean absolute error (MAE), and correlation coefficient (R). As shown in Fig. 5, the estimated Landsat 9 LST is similar to that of the Landsat 8 LST. The percentage of absolute bias less than 0.5 K is approximately 95.8%, with an average bias of 0.08 K. The MAEs (RMSEs) between the Landsat 9 and Landsat 8 LST images are less than 1.50 K (3.00 K) for most of the images, with a ratio of about 90.5% (93.7%) and an average value of 0.64 K (1.04 K). The small difference that exists in the LST images can be explained by the following two reasons: first, according to the statistics on satellite transit time, the time delay between the image acquisition of Landsat 9 and Landsat 8 is typically between 0 and 3 min. Second, the Landsat 8/9 LSTs were retrieved using a consistent LST retrieval algorithm combined with the same atmospheric compensation and LSE correction methods. Moreover, the evaluation result is consistent with the initial performance testing of TIRS-2 by NASA and the U.S. geological survey that OLI-2 and TIRS-2 are of excellent quality and expected to match or improve on Landsat 8 data quality.
To analyze the possible reasons for large RMSEs (>3.0 K) in the evaluation results, the spatial distribution of the Landsat 8/9 LST images is shown in Fig. 6. By carefully analyzing those images, we found that 1) the distribution of the Landsat 9 LST is similar to that of the Landsat 8 LST, and the LST values are basically the same in most areas; and 2) the red color and green color in the LST difference images are due to the undetected cloud and cloud shadow. When the Landsat 9 (Landsat 8) LST pixels are clear (affected by clouds), there is red in the LST difference images; otherwise, green appears in the LST difference images. Due to the time delay between Landsat 9 and Landsat 8 image acquisitions, the positions of clouds and cloud shadows do not coincide identically, so red and green appear in pairs in the LST difference images.
The results of the above evaluation show that the newly released Landsat 9 LST product is in good agreement with the Landsat 8 LST product over different land surface types and that it has an average bias of less than 0.10 K.

C. Cross-Validation Results Based on Landsat 7 LST
After on-orbit testing, Landsat 9 moved to the orbit of Landsat 7; partnered with Landsat 8, Landsat 9 measured global land and near-shore information and operated on an 8-day cycle. Therefore, the Landsat 9 LST can be evaluated using the Landsat 7 LST acquired on the same day. Although the orbit of the Landsat 9 satellite is the same as that of the Landsat 7 satellite, the time delay between the image acquisitions of Landsat 9 and Landsat 7 is more than 1 h. Compared to other surface types, such as trees and soil, water is more homogeneous and is less dependent on surface heating/cooling, so only images mostly comprising water-covered areas are retained for cross-validation. A total of 30 Landsat 9 LST images were used for cross-validation based on the Landsat 7 LST. The evaluation results of the Landsat 9 LST images are shown in Fig. 7.
As indicated in Fig. 7, the average differences between the Landsat 9 LST and Landsat 7 LST are positive in most images, with the absolute LST differences ranging from 0.06 to 0.72 K. The RMSEs and MAEs of the LST differences between Landsat 9 and Landsat 7 are between 0.26 and 1.23 K and 0.20 and 1.01 K, respectively. Since Landsat 9 images are taken approximately 1 h later than Landsat 7 images, water warming can explain the positive bias. To analyze the possible reasons why large RMSEs (>0.8 K) or small Rs (<0.5) exist in the evaluation results, the spatial distributions of the Landsat 7/9 LST images are shown in Fig. 8. Those evaluation results can be explained as follows: 1) By comparing the visible-near infrared images, we can infer those negative biases are mainly affected by the undetected cirrus [e.g., Fig. 8(a), (f), (g), and (h)]. 2) As the time delay between the image acquisitions of Landsat 9 and Landsat 7 is more than 1 h, the positive biases are mainly due to the top layer being heated by the sun. Moreover, the near-shore positive biases may be affected by thermal pollution, which can modify the thermal distribution of the water [e.g., Fig. 8(a), (c), (e), and (f)].
3) The low correlation is mainly due to the fact that the LST values are too concentrated and with a small variation range [e.g., Fig. 8(e) and (f), <2.0 K]. In addition, the outliers existing around the cloud and the land surface will also reduce the correlation. The above results show that the newly released Landsat 9 LST product has good consistency with the Landsat 7 LST product on images of water surfaces, with an average bias of less than 0.25 K despite a time delay of 1 h or more.

D. Cross-Validation Results Based on MODIS LST
The first Landsat 9 mosaic image used for cross-validation was acquired on the 30th of January 2022; is centered at around 35°N, 116.5°E; and is mainly covered by the cropland land surface type mixed with bare land, impervious surfaces, and water bodies. The second mosaic image was acquired on the 8th   differences in the specific heat capacity and thermal inertia of bare land, grassland, and water bodies, and the LST of bare land increased more rapidly in the daytime compared to the grassland and water body land types. Therefore, the absolute bias in the second image is larger than that of the first image.
As shown in Fig. 12, the bias (RMSE) between the Landsat 9 LST and MOD11 LST is 2.50 (3.65) and 1.10 K (1.91 K), respectively, for the third and fourth images. For the evaluation results based on the MOD21 LST, the bias (RMSE) is 1.76 (3.26) and 1.08 K (2.03 K), respectively. Although the LST values of the MOD21 product over barren surfaces in winter are higher than those of the MOD11 product, the performance of the two products is comparable in summer. This phenomenon can be attributed to the performance of the two LST products in different regions. Although the cold bias existing in the MOD11/MYD11 product has been corrected in the MOD21/MYD21 LST retrieved from the TES algorithm [36], the LST difference between the MYD11 and MYD21 products varied across land cover types [37].

A. Improvement of Landsat 9 Instruments
First, the OLI-2 and TIRS-2 aboard Landsat 9 are recorded at 14-bits compared to the 12-bits used for the Landsat 8 OLI and TIRS data. The radiometric resolution of an imaging system describes its ability to discriminate between the very slight energy differences. The finer the radiometric resolution of a sensor, the more sensitive it is to detecting small differences in reflected or emitted energy [38]. Target detection and classification capabilities may be increased due to the improved radiometric resolution of Landsat 9 data. Second, the TIRS-2 has addressed known issues, including stray light incursion and a malfunction in the scene select mirror [17]. Fig. 13 shows brightness temperature (BT) images from Landsat 8/9 bands 10 and 11. As displayed in Fig. 13(a) (the red rectangle box) and Fig. 13(c) (the red ellipse box), the overlap region at the boundary between the two sensor chip assemblies (SCAs) is visible in Landsat 8 TIRS bands 10 and 11. This observed boundary in the Landsat 8 images is due to the stray light, which changes the calibration of two adjacent SCAs, making the overlapping area visible. As for Landsat 9, the boundary is invisible, which indicates that stray light has little effect on Landsat 9 TIRS-2 data.
To clarify the effect of stray light on the Landsat 8 and Landsat 9 BTs more clearly, horizontal profiles of the BTs for Landsat 8/9 bands 10 and 11 were randomly selected from near the middle of the image for pixel-by-pixel analysis. As shown in Fig. 14, the  difference in the BT between Landsat 8 and Landsat 9 band 10 (band 11) is between −1.01 K (−0.67 K) and 0.25 K (1.15 K). Significant BT differences between Landsat 8 and Landsat 9 are located around the boundary between the two SCAs. From the above analysis, Landsat 9 TIRS-2 data is not affected by stray light and will improve the usability of split-window thermal data.

B. Shortcoming of Landsat 9 LST Product and This Article
According to the evaluation results above, the Landsat 9 LST product has good consistency with the Landsat 7/8 LST products, but some of the phenomena in the Landsat 9 LST product need to be strengthened. First, the blank banding was still very noticeable in the Landsat 9 LST images [ Fig. 15(b)]. This is due to the effect of the invalid pixels in the ASTER GED dataset that were used in the RTE algorithm. This phenomenon can be eliminated by using the gap-filling approach [39] or another algorithm to estimate the LSE [40]. Second, as mentioned in Fig. 3, clouds and cloud shadows introduce uncertainty into the evaluation results of the LST products, which can be reduced by adopting high-precision cloud detection and cloud shadow detection methods [41], [42].
Although the Landsat 9 LST product was evaluated with a T-based method and cross-validation, the evaluation is insufficient due to the lack of site-matched Landsat 9 LST products. We will try to collect more Landsat 9 LST images and in situ measurements to continue the evaluation in future article. Furthermore, without the need for a high-precision atmospheric profile, the split-window algorithm performs well under global conditions and is theoretically superior to the single-channel algorithm [43]. The research of Ye et al. [44] also indicated the performance of Landsat 9 LST retrieved from the split-window algorithm outperformed Landsat 9 LST product. We will also try to develop the split-window algorithm to improve the accuracy of Landsat 9 LST retrieval in the next stage of this article.

V. CONCLUSION
This article analyzes the newly released Landsat 9 LST product via T-based validation using ground measurements collected from SURFRAD and BSRN sites and cross-validation with Landsat 7/8 and MODIS LST products. The major findings can be summarized as follows: 1) The T-based validation results indicate that the Landsat 9 LST product overestimates the LSTs at the SURFRAD sites, with an average bias of 1.18 K and an RMSE of 3.16 K. In contrast, the Landsat 9 LST product underestimates the LSTs at the BSRN sites, with a bias of −0.86 K and an RMSE of 3.70 K. As discussed in previous article, the TBL, GOB, and IZA sites may not be suitable for validating LSTs with a 30-m spatial resolution, and in most cases, the biases are larger than 3.0 K. 2) When Landsat 9 flew in "underfly" mode in the same position as Landsat 8, the Landsat 9 LST product exhibited a strong correlation and good agreement with the Landsat 8 LST product, with an average correlation value of 0.953 and an average bias of 0.08 K. The slight time delay between image acquisitions and the consistent LST retrieval algorithm combined with the same atmospheric compensation and LSE correction methods can explain the good agreement between the Landsat 9 and Landsat 8 LSTs. When Landsat 9 moved to the same orbit as Landsat 7, the Landsat 9 LST product also showed good agreement with the Landsat 7 LST product over water surfaces, with RMSEs and MAEs of less than 1.23 and 1.01 K, respectively, despite the time delay being 1 h or more. 3) As for the cross-validation results using the MOD11 and MOD21 LST products, the biases between the two Landsat 9 LST images and the corresponding MOD11 (MOD21) LST images were 0.009 (−0.652), −0. . This article mainly focuses on evaluating the performance of the Landsat 9 LST product derived from the RTE algorithm. If the long-term time-series Landsat 9 LST product becomes available, we will conduct a comprehensive evaluation of the Landsat 9 LST product. Moreover, as described in Section IV, stray light has little effect on Landsat 9 TIRS-2 data, and the split-window algorithm may be superior to the single-channel algorithm.

ACKNOWLEDGMENT
The authors wish to thank the Landsat 9 Project Teams at NASA and USGS. We also thank the USGS EROS Landsat ground system team for producing Landsat Collection 2 data products. The Landsat and MODIS products are downloaded from https://earthexplorer.usgs.gov/ and https:// search.earthdata.nasa.gov/. in situ measurements can be downloaded from https://gml.noaa.gov/grad/surfrad/sitepage.html/ and https://www.pangaea.de/.