Image-to-Image Training for Spatially Seamless Air Temperature Estimation With Satellite Images and Station Data

Air temperature at approximately 2 m above the ground (<inline-formula><tex-math notation="LaTeX">$T_{a}$</tex-math></inline-formula>) is one of the most important environmental and biophysical parameters to study various earth surface processes. <inline-formula><tex-math notation="LaTeX">$T_{a}$</tex-math></inline-formula> measured from meteorological stations is inadequate to study its spatio-temporal patterns since the stations are unevenly and sparsely distributed. Satellite-derived land surface temperature (LST) provides global coverage, and is generally utilized to estimate <inline-formula><tex-math notation="LaTeX">$T_{a}$</tex-math></inline-formula> due to the close relationship between LST and <inline-formula><tex-math notation="LaTeX">$T_{a}$</tex-math></inline-formula>. However, LST products are sensitive to cloud contamination, resulting in missing values in LST and leading to the estimated <inline-formula><tex-math notation="LaTeX">$T_{a}$</tex-math></inline-formula> being spatially incomplete. To solve the missing data problem, we propose a deep learning method to estimate spatially seamless <inline-formula><tex-math notation="LaTeX">$T_{a}$</tex-math></inline-formula> from LST that contains missing values. Experimental results on 5-year data of mainland China illustrate that the image-to-image training strategy alleviates the missing data problem and fills the gaps in LST implicitly. Plus, the strong linear relationships between observed daily mean <inline-formula><tex-math notation="LaTeX">$T_{a}$</tex-math></inline-formula> (<inline-formula><tex-math notation="LaTeX">$T_{\rm{mean}}$</tex-math></inline-formula>), daily minimum <inline-formula><tex-math notation="LaTeX">$T_{a}$</tex-math></inline-formula> (<inline-formula><tex-math notation="LaTeX">$T_{\min}$</tex-math></inline-formula>), and daily maximum <inline-formula><tex-math notation="LaTeX">$T_{a}$</tex-math></inline-formula> (<inline-formula><tex-math notation="LaTeX">$T_{\max}$</tex-math></inline-formula>) make the estimation of <inline-formula><tex-math notation="LaTeX">$T_{\rm{mean}}$</tex-math></inline-formula>, <inline-formula><tex-math notation="LaTeX">$T_{\min}$</tex-math></inline-formula>, and <inline-formula><tex-math notation="LaTeX">$T_{\max}$</tex-math></inline-formula> simultaneously possible. For mainland China, the proposed method achieves results with <inline-formula><tex-math notation="LaTeX">$R^{2}$</tex-math></inline-formula> of 0.962, 0.953, 0.944, mean absolute error (MAE) of 1.793 <inline-formula><tex-math notation="LaTeX">$^{\circ }$</tex-math></inline-formula>C, 2.143 <inline-formula><tex-math notation="LaTeX">$^{\circ }$</tex-math></inline-formula>C, and 2.125 <inline-formula><tex-math notation="LaTeX">$^{\circ }$</tex-math></inline-formula>C, and root-mean-square error (RMSE) of 2.376 <inline-formula><tex-math notation="LaTeX">$^{\circ }$</tex-math></inline-formula>C, 2.808 <inline-formula><tex-math notation="LaTeX">$^{\circ }$</tex-math></inline-formula>C, and 2.823 <inline-formula><tex-math notation="LaTeX">$^{\circ }$</tex-math></inline-formula>C for <inline-formula><tex-math notation="LaTeX">$T_{\rm{\rm{mean}}}$</tex-math></inline-formula>, <inline-formula><tex-math notation="LaTeX">$T_{\min}$</tex-math></inline-formula>, and <inline-formula><tex-math notation="LaTeX">$T_{\max}$</tex-math></inline-formula>, respectively. Our study provides a new paradigm for estimating spatially seamless ground-level parameters from satellite products. Code and more results are available at <uri>https://github.com/cvvsu/LSTa</uri>.

important parameters for a wide range of research fields such as climatology [1], [2], hydrology [3], human health [4], and climate change [5]. The patterns of spatio-temporal variations of T a are critical to understand several near-surface environmental and biotic processes as most terrestrial life lives within the near-surface of the earth [1].
The collected T a from the meteorological stations is sparsely and unevenly distributed in the spatial dimension, leading to inadequate or biased comprehension of the spatio-temporal patterns of T a that are strongly determined by the surface properties varying in both space and time [6], [7].
Spatially seamless T a can be generated from weather station data by leveraging spatial interpolation methods such as Kriging and inverse distance weighting interpolations, while the interpolation accuracy may not be satisfying if the density of the weather stations is low or the topography is complex [8].
Microwave remote sensing and climate reanalysis datasets can also be applied to estimate spatially seamless T a since they are insensitive to clouds. Nevertheless, the spatial resolutions of these datasets are typically rather coarse [19].
Daily mean T a (T mean ), daily minimum T a (T min ), and daily maximum T a (T max ) are three attributes of T a . Previous studies usually estimate T mean , T min , and T max separately, which may omit the relationships between T mean , T min , and T max . Additionally, estimating T mean , T min , and T max separately makes the estimation process complex. In practice, estimating T mean , T min , and T max one at a time will make data preparation laborious and error-prone, especially when different weather conditions are also considered [9], [18].
Since clouds cover a large area of the earth's surface [19], [20], a method that can directly use LST products that contain missing values to predict T a is necessary. Inspired by image completion [21] and the image-to-image translation studies [22], we adopt an image-to-image training strategy with the UNet 1) We find that T mean , T min , and T max have strong linear relationships, illustrating the potential of estimating them simultaneously.
2) The image-to-image training strategy can grasp the spatial relationships in LST images and T a images, which fills the gaps in LST images implicitly and obtains spatially seamless T a . 3) Our study provides a new prototype to estimate spatially seamless ground-level parameters from satellite products, which is simple and straightforward. Experimental results validate that the MODIS LST products can also be used to estimate near-surface relative humidity (RH).

A. Study Area and Observed T a
Mainland China is selected as the study area due to its vast and versatile land area. China covers approximately 9.6 million km 2 , between 3 • N to 54 • N latitudes, and 73 • E to 136 • E longitudes (see Fig. 1). China has a variety of climate types, such as tropical, subtropical, warm-temperate, temperate, cold-temperate, and highland climates. The elevation in China, ranging from -154 to 8848 m, is generally high in the west and low in the east.
In addition, China also has different land cover types, such as deserts, grasslands, croplands, forests, and urban areas [23]. The land cover types and topographical characteristics make the landscape in China versatile. Numbers of subcategories exist for each land cover type. For example, in China, forests include subtropical forests, boreal forests, subalpine forests, and so on.
There are 836 weather stations that offer the observed T a (see Fig. 1), which is available from the China Meteorological Data Service Center (CMDC). 1 For the observed T mean , T min , and T max , we discover that they statistically have strong linear relationships (see Fig. 2), 1 CMDC: http://data.cma.cn/en   illustrating the potential of estimating them simultaneously. In [18], a model is trained separately several times to estimate T mean , T min , and T max . In [9], different models are trained to estimate T mean under three different sky conditions, which is rather complex. Therefore, estimating T mean , T min , and T max concurrently by leveraging only one model will be efficient. The strong linear relationships can not only be found for T a measured in mainland China, but also worldwide. 2 We display T a collected from stations worldwide in 2012 and 2015 (see Fig. 3 and Fig. 4). To see the strong linear relationships in other years, please refer to our code. Thus the strong linear relationships seem to be a statistical property of T a . Though the underlying mechanism is not clear, the linear property makes the estimation of T mean , T min , and T max possible. The proof is given below.
Assuming that we are estimating T mean y utilizing the satellite products x. That is, we find a model f where y means the observed T mean andŷ means the estimated T mean . We use coefficient of determination (R 2 ), mean absolute error (MAE), and root-mean-square error (RMSE) to evaluate the model f , and then whereȳ is the mean of y. Taking T min (or T max ) z as an example, since z and y have a strong linear relationship, we have a linear model h, and where a and b are the parameters in the linear model h, and a > 0. Then To illustrate that it is possible to estimate T minẑ p according to the strong linear relationships between z and y, we can suppose thatẑ and According to Minkowski inequality [24] As a result Similarly, The above proof illustrates that within some error range (R 2 , MAE, and RMSE), we can estimate T mean , T min , and T max simultaneously just by applying the strong linear relationships.

B. Remotely Sensed Data
We use LST data from the MYD11A1 version 6.1 daily product from MODIS Aqua satellite as the overpass times of MODIS Aqua are quite near the times that T min and T max occur [25].
The daytime LST (LSTD) and nighttime LST (LSTN) products are downloaded from Google Earth Engine [26] with a Python package geemap [27]. Since we try to estimate T mean , T min , and T max simultaneously, both the Aqua LSTD and LSTN products are used. The daily LSTD and LSTN are stacked to obtain LST images. Suppose that the size of the LSTD image is m × n pixels, then the size of an LST image in this article is m × n × 2 pixels. Because the study area is quite large and if 1 km spatial resolution were applied, the image size of an LST image would be 3492×6862×2 pixels. Thus, we reproject the LST products to the World Geodetic System 1984 (EPSG: 4326) and resample LST images to 5 km spatial resolution (789 × 1373 × 2 pixels).
Only the good-quality pixels whose average LST error is less than or equal to 1 K are kept, and other pixels are seen as pixels with missing values. Since southeastern China is with high humid, most of the days for LSTD and LSTN have no valid value (see Fig. 5). In [9] and [14], the missing values in LST products are replaced using temporal gap-filling methods. But for the study area utilized in this article, some missing data cannot be replaced since valid temporal adjacent LST values are not available (see Fig. 5).
Contrary to [9] and [14], we replace the missing values in LST images with a constant value rather than a temporal nearby value. The filled value in this study has no physical meaning and is only used for numeric computation. Since this investigation is not an LST gap-filling study, the filled value has no effect on the real LST products. The effects on estimation results caused by the filled value will be discussed in Section III-A.

C. Previous Statistical Methods
There are many other auxiliary datasets, such as normalized difference vegetation index (NDVI), that can be used together with LST for T a estimation. For simplicity, we only use LST to illustrate the estimation steps of previous statistical methods.
To estimate spatially seamless T a , no matter which machine or deep learning model is applied, there should be a method to impute the missing values in LST in advance [9], [14]. After imputing the missing values, then the data pairs between LST and T a were established (green pixels in LST and T mean images as shown in Fig. 6). A model was trained based on the data pairs and then applied to other pixels with valid values (blue pixels in the LST image, Fig. 6). However, as shown in Figs. 5 and 6, there may be pixels that cannot be filled by nearby clear-sky values when there is an area where most pixels are with missing values. Furthermore, if a model was trained on the data pairs, it may lose the spatial relationships between pixels in LST images and the spatial relationships in T a images.
Therefore, previous methods have to use land cover or elevation information to explicitly illustrate the spatial relationships to the model. In fact, the information about NDVI, land cover, and elevation has already been implicitly encoded into the LST products. That is, different land types or elevation leads to different LST values.

D. Daily T a Images
The T a (T mean , T min , T max ) measured in meteorological stations are data samples. Since T a are not independent in space, we convert the observed T a samples to a T a image for each day (see Fig. 7) to keep the spatial relationships between the observed T a samples. The detailed steps are as follows.
1) An image with the size of m × n × 3 is created (stacked T mean , T min , and T max images, respectively), and all pixels are given a constant value.
2) The values of those pixels whose geographical locations coincided with the location of meteorological stations are reallocated by the observed T mean , T min , and T max . The constant value in T a images will not affect the estimation results as the specific value will not participate in the loss calculation (illustrated in Section II-G).

E. Image Completion
Matrix completion is a method to predict the missing entries in a matrix by assuming that the matrix is highly sparse [28], [29], [30], and it has been widely used in many fields, such as computer vision and recommendation systems [31].
The sparsity property is also known as a prior for images [32], [33], [34]. According to [35], convolutional neural networks (CNNs) are popular tools for image generation and restoration as they can learn realistic image priors. For instance, CNNs are able to restore an image with a complex degradation [35], such as denoising, super-resolution, and inpainting.
Both studies on image inpainting and matrix completion demonstrate that there are priors in matrices and images, and according to these priors, it is possible to recover the missing entries in matrices and images. That is where x is a matrix or an image, x 0 is x with missing entries, and f 0 is a matrix or image completion method. Though LST and RGB images have different imaging mechanisms, they share some common priors, such as sparsity and locality. In this case, we can use CNNs to complete the missing Fig. 7. Illustration of the image-to-image training strategy with UNet. Given an input LST image, the UNet model will output the estimated (a) T mean , (b) T min , and (c) T max simultaneously. During the training phase (585 stations), the loss is calculated between the data pairs, and thus the filled value in T a images (d, e, and f) has no effect on estimation results. entries in LST images and then use the gap-filled results to estimate spatially seamless T a .
Since we only focus on T a estimation, a composite function f can be applied to estimate T a from LST images with missing values directlyT where x 0 ∈ R m×n×2 in this study is an input LST image, f 1 is a mapping function that represents the relationships between spatially seamless LST images and T a images, andT a ∈ R m×n×3 is an estimated T a image. We need to make sure that the estimated T a images are as close as possible to the observed T a for paired pixels. Thus, the inpainting processing for input LST images is implicitly performed. It should be noted that, unlike image and matrix completion methods, the composite function f should also contain the priors in observed T a .

F. UNet and Image-to-Image Training Strategy
As the relationship between LST and T a is complex and nonlinear, advanced machine learning or deep learning models can achieve better estimation results than linear models [7], [14], [17].
Deep learning techniques have powered various study areas, such as remote sensing, computational physics, and environment [36], [37], [38], [39], [40], [41]. UNet [42] has been widely applied to remote sensing and medical image processing according to its capability on pixel-level tasks such as semantic segmentation [43]. UNet is a specific encoder-decoder architecture with skip connections, and through the skip connections between mirrored layers, both the low-level and semantic features are available for the final estimation [22]. Therefore, we adopt the UNet model to perform the image-to-image training for spatially seamless T a estimation (see Fig. 7). In other words, UNet is utilized to learn the composite mapping function f .
For each day in 2012-2016 (1827 days in total), the input is an LST image with the size of 789 × 1373 × 2 pixels and the output is a T a image with the size of 789 × 1373 × 3 pixels.

G. Training Details
We randomly split the 836 stations into training, validation, and test sets with a ratio of 70% (585): 10% (83): 20% (168). To avoid overfitting, during the training phase, we pad the input images and then randomly crop the desired size out (789×1373). As a result, the UNet model only "see" a portion of the whole input image during the training phase. Horizontal random flip transformation is also applied during training.
There are 200 epochs and the batch size is 8. The initial learning rate is 1.6 ×10 −4 , and the cosine annealing schedule for the learning rate is utilized [44]. Please refer to the code 3 to see more training details and results.
Only the data pairs are utilized for loss computation. The smooth L 1 loss l smooth [45] is used as the loss function where y is the observed value andŷ is the estimated value. Thus, the filled value in T a images have no effect on estimation results (see Fig. 7). R 2 , MAE, and RMSE are applied to evaluate the estimation results. Only the test results are reported if not specified.

A. Effects Caused by Filled Values
The filled value in LST images is a hyperparameter, and the validation set is utilized to determine the value. To make the filled value distinguishable from the valid value ranges in LST images, we use −100, −200, and −500 • C to fill the gaps. According to Table I, the filled values do not have a significant effect on the estimation results, in terms of R 2 , MAE, and RMSE values. We use -100 • C to fill all the missing data in LST images since it achieves the best estimation results.

B. Overall Estimation Results
As shown in Table II .808 • C, and 2.823 • C for T mean , T min , and T max , respectively. Though T mean , T min , and T max are estimated concurrently, the estimation results of T mean is the better than that of T min and T max .
We select four special days in 2015 to visualize the estimation results (see Fig. 8) and videos that display the daily changes of T mean , T min , and T max during 2012-2016 is available online. 4 Note that the Summer solstice is the hottest day in a year, and the figure cannot display distinguish difference for T max in most regions (please zoom in to see the differences in different areas, Fig. 8). The estimated results display elevation information and clear seasonal changes and daily changes of T a (please see Fig. 8 and online videos). Fig. 8 and the online videos verify that the proposed method can implicitly fill the missing data in LST images and estimate spatially seamless T mean , T min , and T max , simultaneously. The filling process is similar to image completion studies [21], while we do not intend to fill the gaps in LST images but T a images. Though we only apply the LST images as the input and no other auxiliary datasets, such as normalized difference vegetation index and land cover, are used, the estimated T a displays the spatial heterogeneity in the study area. Southwestern China has a high elevation (the Qinghai-Tibet Plateau, Fig. 1), and the estimated T a in this area shows relatively lower T a compared with its nearby regions (see Fig. 8), indicating that the proposed method also captures the elevation information automatically. The possible reason is that the image-to-image training strategy grasps the spatial relationships in LST images and T a images.

C. Effects of Missing Values
To show the effects caused by the missing values, we split T a into three parts: the paired pixels in LST are recorded under clear-sky conditions (part A); the paired pixels in LSTD or LSTN are with missing values (part B); the paired pixels in LST are with missing values (part C). Note that LST images have two channels (see Fig. 7).
According to Fig. 9

D. Estimation Results Analysis
To analyze the estimate results, MAE values over space (see Fig. 10), time (see Fig. 11), and elevation (see Fig. 11) are calculated.
The MAE values of most stations are less than 3.000 • C (see Fig. 10), illustrating that the estimation results are stable in most regions. The stations with larger MAE values (≥ 3.000 • C) are mostly located in western China (see Fig. 10) where the density of stations is low, and stations with smaller MAE values (1.000 • C-2.000 • C) are located in eastern China where the density of stations is high. However, since the topography is much more complex in western China (see Fig. 1), we cannot state that the density of stations is the main reason accounting for the estimation accuracy with 100% confidence. For instance, in a flat region, the representativeness of a station should be better than the station in a region whose topography is complex. Thus, the representativeness of a station is not only determined  by the number of stations nearby, but also by the topographic complexity.
The MAE values are similar between different years, following the same pattern: larger MAE values in winter and lower MAE values in summer (see Fig. 11). The reason may be that in winter the variation of T a is higher compared with the variation in summer, making T a collected in winter more difficult to be estimated. The seasonal pattern that T a changes more significantly in winter than summer has also been reported in the United States. 5 Another possible reason is that the relationship between LST and T a is strong in late summer and fall, and weak in winter and early spring [46].
The MAE values rise first and then fall with the increase of elevation (see Fig. 11). However, the estimation accuracy is relatively stable at different elevations as the MAE values are generally less than 3.000 • C.

E. Estimation of Other Parameters
Since relative humidity (RH, whose unit is %) is highly related to T a , we use the LST images to estimate RH by leveraging UNet and the image-to-image training method. On the test set, the estimation results are with R 2 of 0.713, MAE of 6.725%, and RMSE of 8.713% (see Fig. 12). To the best of our knowledge, we are the first ones who try to estimate spatially seamless RH from MODIS LST products. 6 The estimation results of RH show the potential of using other satellite products that contain missing data to estimate spatially seamless near-surface parameters. For instance, aerosol optical depth (AOD) products, which can be used to estimate PM 2.5 [47], [48], [49], are also sensitive to cloud contamination. By leveraging the proposed method, it should be possible to obtain spatially seamless PM 2.5 from the AOD products that contain missing data.

F. Comparison With Other Methods
In 2022, a method is proposed in [18] to estimate T mean , T min , and T max by training a model several times. The RMSE values reported in [18] are in the range of 2.15 • C-3.20 • C for T max , 1.68 • C-2.79 • C for T min , and 0.86 • C-1.60 • C for T mean for five different parts of mainland China. We use only one model and we do not use other auxiliary products, while the error level is similar for the proposed method and the method proposed in [18]. The climate reanalysis dataset ERA5 [50] has the RMSE values of 2.543 • C, 3.324 • C, and 3.393 • C for T mean , T min , and T max , respectively. Compared with ERA5 and Fang's results, our method is simple yet efficient. In addition, the spatial resolution for the proposed method is 5 km, while it is around 27.8 km for ERA5 and 10 km for Fang's results.
In [7] and [9], 1 km spatial resolution T max and T mean are estimated, respectively. However, they directly split the data pairs to train models, and maybe independent stations are needed to verify the spatial estimation results of their methods. Plus, the estimation results in [7] are spatially incomplete, and in [9], there is no value for water-body regions.
Despite that our method is simple and straightforward, our method considers the spatial relationships between pixels in LST images and the spatial relationships between observed T a . As a result, we do not have to explicitly provide other variables, such as land cover and elevation, to the model to obtain estimated T a .
We only consider the worst case for estimation. For instance, we only use MODIS Aqua products, and according to [14], the combination of MODIS Terra and Aqua products will obtain better estimation results. We use the LST error flag ≤ 1 K, while we find that the 3 K error flag will provide better estimation results. This is not surprising since these pixels within the 3 K error flag are much better than the pixels filled with the constant value (−100 • C). In this case, we can say that our method would obtain better results if more LST products and ERA5 datasets were utilized as the input variables. However, to obtain the best numeric accuracy is not our purpose, and we try to provide a new paradigm for the estimation of T a and other surface parameters.
Our method also works well on fine-scale satellite products, and please refer to our code to check the estimation results for Finland with a 1 km spatial resolution.

IV. CONCLUSION
In this article, we propose a deep learning method to estimate spatially seamless T a (T mean , T min , and T max simultaneously) with the presence of missing data in LST products. Encouraging estimation results are obtained with an image-to-image training strategy and the UNet model. The estimation results of RH illustrate that the proposed method provides a new paradigm for estimating near-surface parameters from satellite products that contain missing values.
In the future, the question of whether the topographic complexity or the density of weather stations contributes most to the estimation accuracy should be further investigated. Additionally, the underlying mechanisms of why T mean , T min , and T max statistically have strong linear relationships need to be explored and the image-to-image training strategy calls for lightweight convolutional neural networks to deal with large-scale remote sensing images. Peifeng Su received the Ph.D. degree in doctoral programme in atmospheric sciences from the University of Helsinki, Helsinki, Finland, in 2022.
His main research interests include deep learning, remote sensing, and computer vision.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
Temesgen Abera receives the Ph.D. degree doctoral