Multicascaded Feature Fusion-Based Deep Learning Network for Local Climate Zone Classification Based on the So2Sat LCZ42 Benchmark Dataset

A detailed investigation of the microclimate is beneficial for optimizing the urban inner/spatial pattern to enhance thermal comfort or even reduce heatwave disasters, whereas accurately classifying local climate zones (LCZs) severely restricts analysis of thermal characterization. Generally, deep learning-based approaches are effective for adaptive LCZ mapping, yet they often have poor accuracy because inadequate cascade feature extraction patterns may not adapt to the fuzzy LCZ boundaries produced by intricate urban textures, especially when using large-scale datasets. To address these issues, we propose a novel CNN model in which we design a strategy that incorporates a triple feature fusion pattern to enhance LCZ recognition based on the So2sat LCZ 42 large-scale annotated dataset. The approach connects multilayer cascaded information to participate in category judgment, which avoids the loss of effective feature information via continuous cascade transformation as much as possible. The results show that the overall accuracy and kappa coefficient of the proposed model reach 0.70 and 0.68, respectively, manifesting an improvement of approximately 4.47% and 6.25% over advanced LCZ classification approaches. In particular, the accuracy of the proposed approach improves even further after the fusion structure or layer depth is partially removed or reduced, respectively. Finally, we discuss several items, including the effectiveness of different parameters and cascaded feature fusion patterns, the superiority of multilayer cascade feature fusion, the mapping impact of seasons and cloud cover, and even some other issues in LCZ mapping. This article will facilitate improvements in the research precision of urban thermal environments.


I. INTRODUCTION
T HE classification criteria of local climate zones (LCZs) [1] are crucial for breaking the gap between urban texture and intracity climate distribution. In general, in accordance with the conventional concept of urban heat islands, the intensity of urban heat has been determined by quantifying the temperature increment in both the urban and surrounding countryside. Nevertheless, this measurement method has often been doubted for various reasons. For example, 1) the urban-rural boundary is hard to define, especially since many of the adjacent rural areas are subject to variation via urban expansion; 2) the "absolute" increment of urban temperatures over suburban regions cannot be accurately quantified; 3) there is insufficient evidence that the variation in urban temperature significantly depends on suburban substrates [2]. Hence, 17 categories of LCZs are proposed by Oke et al. [1] to describe the urban underlying land in terms of different physical properties (as shown in Fig. 1). Generally, LCZs can describe urban form and function [3], the understanding of which is significant for exploring the related factors of affecting thermal variation.
Remote sensing technology provides a key data basis for current LCZ classification. To date, two types of data sources, the geographic information system (GIS) and remote sensing (RS) have been commonly adopted to classify LCZs. Nevertheless, the GIS-based approach is also limited by three factors: 1) inadequate update, accuracy, and accessibility of data; 2) a lack of available GIS data in many global cities; 3) the quality imparity of GIS data between cities. In contrast, RS-based techniques often can provide large observation data to implement LCZ mapping [4], [5]. Numerous studies are currently being carried out to classify LCZs using remote sensing image products acquired by the Landsat [6], [7] and Sentinel [8], [9] series of satellites. Additionally, some open-access RS platforms, including the World Urban Database and Portal (WUDAPT) [10] and Google Earth Engine [11], have already become common tools for classifying LCZ [12]. Users can use WUDAPT to annotate training sets of online high-resolution RS images and achieve mapping by using a random forest algorithm. In addition, some studies have been carried out on large-scale LCZ mapping supported by remote sensing imagery [13], [14], [15].
The rapid development of deep learning techniques in remote sensing image processing has made intelligent LCZ classification possible. According to the procedure, the LCZ mapping is similar to semantic segmentation, which fits the relationship between the pixels and the targets. In recent years, deep learningbased models that fit nonlinear relationships via ensembles of multiple layers have been proven to be efficient in RS-based semantic segmentation [16], [17], [18]. Convolutional neural networks (CNN), which are efficient approaches for realizing semantic segmentation in deep learning have gradually become available approaches for LCZ mapping [19], [20], [21].
Data coupling is critical for CNNs to gain the appropriate parameters to identify local climate zones. Typically, Qiu et al. [19] analyzed the feature importance of Sentinel-2 imagery to facilitate a residual convolutional neural network (ResNet) to identify LCZs. The work of Zhou et al. [22] and Fung et al. [23] showed that adding two types of land property information, namely, building height and imperviousness, can also boost the accuracy of recognizing LCZs. Nonetheless, data accessibility and annotation variability often limit the acceptance of these methods. Consistent dimensionality between training and prediction data is a requirement for CNN recognition, whereas the restricted land property description data and differences in expert annotation knowledge make approaches difficult to reproduce. To settle this issue, Zhu et al. [24] from the German Aerospace Center proposed the So2Sat LCZ42 dataset with tens of thousands of images obtained by multibands Sentinel-1 and Sentinel-2 sensors from 42 global cities.
CNN-based approaches must explore the effect pattern of cascade feature extraction to discriminate LCZs. In general, classifying LCZs often requires that they be jointly identified by the image pixels that describe climate-associated surface properties [22], which may exhibit a combination of multiple land components rather than the clear feature boundaries of the objects, such as in semantic segmentation. Retaining valid object information is a challenge for CNNs. According to existing studies, most CNN-based approaches drew LCZs by migrating/self-built semantic segmentation models, such as [6], [7], [20], [22]. However, these techniques still adopt the basic pattern of a continuous cascade of semantic segmentation CNNs; i.e., these techniques build networks by using multiple convolution-pooling connections. Such CNNs may be effective in classifying LCZs for a local region or by using a certain number of data, but the ability of these CNNs to maintain performance still needs to be verified because the common continuous cascade model of CNNs loses much texture information as the network deepens while processing the input information; this loss is detrimental to discriminating LCZ boundaries. Additionally, the typical CNNs used for RS image classification exhibit poor accuracy on large-scale datasets, as So2Sat LCZ42 also demonstrates that insufficient or mismatched features extraction abilities are unsuitable for global LCZ classification [24]. In fact, we believe that retaining the maximum amount of referenceable information from the input to the category discriminant by mutually fusing multiple cascade features is crucial for making CNNs more appropriate for classifying LCZs.
Inspired by the structure of UNet, this article designs an endto-end network architecture for LCZ classification (MCFUNet-LCZ). The proposed CNN incorporates triple cascaded feature extraction, which contains two feature concatenation fusions and one global cascaded feature fusion module. Additionally, we adopt the So2sat LCZ 42 large-scale annotated dataset as the data basis to compare the classification performance of the proposed model with the performances of baseline approaches. Nevertheless, our approach adds more options for derivative structures, because the proposed approach can maintain the performance when the layer depth or feature fusion components are reduced. The experiments show that the proposed model is more accurate than other cutting-edge CNN-based LCZ classifiers.
The contribution of this article is to propose a novel CNN structure that can obtain more accurate LCZ classification results, thereby further enhancing the exploration accuracy of urban thermal patterns. The main advantage of this structure is that reducing the number of network components allows for a controlled accuracy floating interval, which provides the flexibility to adapt the model size to the computing conditions. In addition, the So2sat LCZ 42 dataset is one of the few large-scale, multisensor datasets produced by a professional institution. We validate the approach on this dataset, so as to assist the application of LCZ mapping. Additionally, our approach may provide a technical reference for other scholars or beginners. The rest of this article is organized as follows. First, we emphasize the feature fusion strategy and the entire structure of the proposed approach, and then describe the compared baseline models in Section II. Section III mainly illustrates the So2Sat LCZ42 dataset, and the experimental setup. The LCZ classification results are demonstrated in Section IV, and we subsequently discuss the effects of setting the parameters and using different feature fusion patterns in Section V. Additionally, we especially address prominent issues that arise from LCZ classification in Section V. Finally, Section VI concludes this article.

II. METHODS
This section focuses on the three feature fusion models used in this article. This section then outlines the benchmark CNNs used for comparison in the experiments.

A. MCFUNet-LCZ for LCZ Classification
Enhancing the ability to extract refined features should be a core strategy for load-bearing in neural networks. Nevertheless, how to design the cascade mapping for drawing the outline characteristics of LCZs is still an issue that should be further probed.
This article proposes a novel CNN network that fuses multilayer and multilevel features to fit the suitable parameters of CNN (as shown in Fig. 2). We design three cascaded feature fusion patterns in this approach, including a dual-pooling fusion (DPF) module, a coding-decoding fusion (CDF) module, and a global probability fusion module (GPF) module. Of them, the design strategy for the DPF module is inspired by the work of Qiu et al. [25] In the following sections, we will further discuss these approaches in detail.
1) The DPF strategy involves jointly performing a multiconvolution (MC) and multilevel abstraction. Here, we adopt the combination of two convolution blocks to obtain the feature maps. Each convolution block incorporates a convolution layer, a normalization process, and an rectified linear unit activation function. In general, in a CNN structure, the function of max-pooling is to abstract the object feature obtained from convolution operations. Max-pooling can be used to capture the outline of the land surface elements in RS images by using the extreme value of the image digital number (DN) and weaken the background information simultaneously. However, for LCZs, urban background information is critical for improving contextual textures. Hence, we implement a concatenate mode that contains a max-pooling layer and an average pooling layer; this mode retains diverse feature information by concatenating max-pooled coherent texture and average-pooled smoothly background feature (as shown in Fig. 3). First, we employ a two-layer convolutional network to characterize the feature information. To avoid vanishing gradients, we add a batch normalization (BN) layer after the convolution operation to return the discrete the values to the normalized distribution to speed up the convergence of the algorithm. The formula for the convolutional layer passing through the BN layer can be where L n (x) is the results of the nth convolutional layers, ∧ α β represents the standard deviation of the mini-batch during training, ∧ θ β is the mean of the mini-batch, γ and β represent the stretch scale and offset shift, respectively, which can be obtained by network training, and the shape on both is consistent with the input L n (x). To facilitate information extraction for pooling layers, we added an activation layer to map all values to the first quadrant. The function is as follows: Assume Mkij represents the max-pooling output of area F ij from the kth feature map, ij represents the pixel index of the feature map, and Xkab is the pixel value of the subarea within a stride. The max-pooling, average-pooling, and dual-pooling fusion are, respectively, shown in where n is the layer index of the DPF. Since the main goal in the encoding path is to characterize the object features, we use the global maximum pooling method to reduce dimensionality, which can weaken the influence of the backgrounds. The specific formula is shown in 2) Regarding CDF, upsampling is a decoding operation that is used to further expand the convolutional feature domain. Because the convolution operations will cause the upsampling to lose boundary information, we concatenate both layers in each network depth to retain as much descriptive information as possible. Additionally, we add a dropout operation (with 0.1) to avoid overfitting and gradient descent hindered by the transmission between plentiful neurons. All concatenate layers after upsampling are sent to the global average pooling to reduce the dimensionality of massive feature maps, thus facilitating the classification of the classifier. Assuming that each concatenation between coding and decoding is C, then the final results of the kth feature map of global average pooling can be expressed as |Cij| is the total number of pixels and (a,b)∈Cij Xkab denotes the summation worth of each image pixel. Eventually all features form a one-dimensional (1-D) vector.
3) We adopt the GPF module to transform the feature space from global pooling operations into the vote probabilities, which is used to make the class decision of the LCZ (as shown in Fig. 4). The purpose of the global pooling layers is to reduce the number of neurons brought by block operations, thus releasing computing costs caused by massive feature maps. After the process of global pooling layers, all feature maps will be transformed as 1-D feature vectors to join the classification. The classifier in dense layer is then employed to compute the voting probabilities of each feature vectors toward per LCZ class. In this article, we used the Softmax classifier to finish the calculation. The formula of Softmax is shown in (8). Ultimately, the LCZ to which each feature vector belongs will be determined by averaging the voting probabilities from all dense layers in MCFUNet-LCZ. Once the training is complete, the model will record the parameters of the network so as to make the prediction on new data Softmax probability(pi) = e pi e pi for i = 1, . . . , N (8) where pi is the results of the pooling layers and N represents 17 LCZ classes.
With regard to the scaling of feature maps, the convolutional operations change only the magnification of the input feature map, and the main function of the pooling layers is to reduce the size of the feature map by an exponential of 2. The explicit description of the model structure is shown in Fig. 4. For instance, during the second layer processing, the convolution layers expand w , and then further refine to w 4 × h 4 × 2f via max pooling. Obviously, the input images are successively downsampled from w × h × f to w 32 × h 32 × 16f after undergoing the operations of convolution and pooling blocks (as shown in Fig. 5). The extracted feature contours are then progressively mapped to their original dimensions by upsampling, thereby enhancing the eigenvalue that the model learned. As much, the concatenate operation is used to connect the downsampling and upsampling features together to help determinate the category judgement, instead of only referencing the results that acquired from the information delivery when the layer computing ends. Specifically, the structure of our proposed method is a comprehensive estimate mechanism for each local climate zone to which each feature belongs.

B. Baseline Models
To test the adaptability of semantic segmentation techniques for cutting-edge deep learning-assisted image processing in LCZ classification, we selected some typical and recent models that are often applied to remote sensing image processing or image processing-based pattern recognition problems to implement experiments and compare with our proposed approach. The explicit information of the selected models is listed in Table I.

III. EXPERIMENTAL SETUP
In this section, we will describe both the image dataset that we used for LCZ classification and even the operation stream for the data. Next, we explicitly illustrate the experimental flow in which the parameters are determined, the baseline of model training, and the performance estimation indicators.

A. So2Sat LCZ42 Dataset
The So2Sat LCZ42 dataset [24] is a collaborative effort of experts who selected 42 cities across the world (and 10 additional cities in which small-scale regions cover Guangzhou, Moscow, Jakarta, Mumbai, San Francisco, Munich, Nairobi, Santiago de Chile, Sydney, and Tehran) to meticulously design a method for annotating the LCZ regions in these cities. In the dataset, each image patch has a size of 32×32 and was acquired from Sentinel-1 and Sentinel-2 satellite images (as shown in Fig. 6). For the annotations, the authors depended on the WUDAPT [41] to make labels and then divided all of the obtained images into training sets, test sets, and validation sets; the sets were completely independent and nonoverlapping. The So2Sat LCZ42 dataset is one of the few open datasets available to assist machine learning algorithms in implementing LCZ classification. Additionally, the dataset, which was rigorously quality-assessed by professional staff, achieves an overall confidence level of 85% for the 400,673 LCZ labels. Also, the dataset provides an interface to the TensorFlow framework (https://www.tensorflow.org/datasets/catalog/so2sat), and  world, while the validation set comprises samples from the western half of 10 additional cities. Version 2 adds a test set to the training and validation sets; the test set consists mainly of the eastern half of the annotations for an additional 10 cities. For version 3, the designer provides mainly three-division datasets.
1) The random split in which the samples are classified 80% training and 20% testing and randomly sampled in every city.
2) The block split that is split per-city in a geospatial 80%/20%-manner.
3) The cultural 10 refers to 10 culturally diverse cities that are used for testing. Additionally, using the So2Sat LCZ42 dataset provides two advantages that are conducive to approach extension. First, it has been shown that image sizes from 32×32 to 64×64 are more beneficial for training deep learning-based models to identify LCZ regions [20]. The size of each image in the So2Sat LCZ42 dataset is 32×32 and, thus, satisfies this inference. Second, deep learning models often have a large demand for training samples, and the quality of self-labeling will seriously restrict the usability of the model. In contrast, the So2Sat LCZ42 dataset is a large-scale dataset that was annotated with tens of thousands of samples by a professional team. Such traits are critical for quantitative model testing and automated LCZ classification.
In this article, we use the second version of the So2Sat LCZ42 dataset, which only adopts Sentinel-2 data samples with 10 realvalued bands and the spatial resolution tends to be between 10 m and 20 m for experimental utilization, as listed in Table II. All the bands that originally had a spatial resolution of 20 m were upgraded to 10 m by an upsampling ground sampling distance algorithm to facilitate the unified loading of data.

B. Experimental Setting and Assessment Metrics
1) Experimental Setting: To control the variables, we adopted a uniform combination of parameter settings, optimizers, and output patterns for all models in the experiments. Of them, we draw the network layers by using the Keras framework (https://keras.io/guides/), which has a well-received programming toolbox that is often applied to implement deep learning models. All of the models incorporate a Nesterov Adam optimizer, which adds the gradient accelerated convergence on the basis of Adam during the training iterations. A batch size of 32 patches is introduced as the data capacity for both the input images and LCZ labels in per-loading. To avoid overfitting as much as possible, the initial learning rate is set as 0.02, and we take a rating decay of half after every five epochs. Each training has been given 100 epochs, after which the training will end when the loss values of the validation set do not converge again after 40 epochs. Meanwhile, the model will dynamically update the model parameters with the smaller loss value, thereby making the prediction on new data. 2) Metrics: To assess the performance of each model, we adopted several evaluation metrics, including the overall accuracy (OA), Kappa coefficient, and confusion matrix. Moreover, the machine learning test metrics integrated into the scikit-learn tool library (https://scikit-learn.org/stable/) have also been used in the experiments to measure the classification accuracy of the model; these metrics include precision, recall, the F1-score, the macro-average accuracy (MA), and the weighted-average accuracy (WA). The MA is the average result of each precision, recall, and F1-score separately; and the WA is an average accuracy calculated by giving weights in accordance with the proportion of the sample size in each class. Besides, we employed three metrics from CNN segmentation-precision, recall, and the F1-score-to analyze performance differences on each LCZ class.

IV. RESULTS
In this section, we first show the LCZ classification and mapping results of the proposed CNN. And then, we analysis the classification accuracy of the proposed approach on each LCZ. Finally, we compare the proposed model with some semantic segmentation CNNs with outstanding performance, and several state-of-the-art CNNs for LCZ classification.

A. Mertric Result of MCFUNet-LCZ
The So2sat LCZ42 dataset is one of the few large datasets that have been annotated by professionals and includes multiple remote sensing information sources not only in the visible light bands but also in the near infrared and shortwave infrared bands. Exploring a better adapted algorithm on such a dataset will greatly benefit the implementation of other applications and the promotion of LCZ. After experimental validation, the MCFUNet-LCZ proposed in this article achieves an OA of 0.70 and a Kappa coefficient of 0.68 on this dataset; these results are both better than those obtained by the advanced CNN-based LCZ classification methods.
Designing a high-efficiency structure based on the So2Sat LCZ42 dataset is significant for promoting the development of the LCZ domain. As mentioned previously, the annotation quality of the dataset is crucial for CNN segmentation. The production of So2Sat LCZ42 conforms to three requirements: unified annotation knowledge, broadly urban location, and multisensor data source. Such traits address the reasons that public datasets may not be adopted as the available dataset for other applications; these reasons include limited amounts, local locations, and difficult-to-invoke interfaces. Nonetheless, the OA obtained by current CNN-based LCZ models tends to be approximately 0.6, which is not conducive to the applying large-range LCZ mapping. Consequently, the contribution of this article is to propose a CNN structure that can improve the OA to reach 0.7. Inspired by this article, follow-up research can further explore methods with better accuracy and ultimately achieve high-quality global LCZ mapping products.
An intricate urban context often results in a confusion of the categories to which LCZs belong. Examining the results of the confusion matrix for the model, as shown in Fig. 7(a) reveals many confusions for both the categories containing buildings and the suburban context categories. These confused classification results tend to be mainly between urban-urban or suburban-suburban categories. Sometimes, confusion may occur when the classification relies only on the ability of the method used for recognizing local climate zones in large cities. At present, some studies have tried to use road network data for guidance based on deep learning [22]. However, the road network is only for the possible delineation of urban functional areas and thus for the classification algorithm to identify LCZs to provide more data for fitting the classification surface. The real need for counterpart-assisted LCZ classification is perhaps from the perspective of urban canopy texture description information, such as the fusion of urban building height data with vegetation distribution, especially for trees scattered among buildings. Otherwise, it will be hardly for the approach to discern the fuzzy boundaries between each LCZ class, leading to inevitable confusion and misclassification.
In this article, the LCZ mapping performance of the proposed method is verified using the urban built-up area of Beijing, the capital city of China, on August 10, 2021 (as shown in Fig. 8). The identification results show that MCFUNet-LCZ can identify all of the analogs from 1-17 local climate zones in place, and the accuracy is such that the method can provide support for analyzing the relevant problems. In Fig. 8, relying on the macroscopic map, we further select the following four typical regions to demonstrate the effectiveness of the method: water regions with vegetation spectral features, regions characterized by a mix of compact buildings and water bodies, and regions characterized by a mix of buildings and vegetation. The results  show that the approach proposed in this article can predict the LCZ regions to which an LCZ belongs via a variety of ground cover objects.
Also, we have found a special case in which dense vegetation will wither in winter at high latitude areas, thus losing the spectral information to produce the inevitable variation of LCZs during a year. For this case, we will further discuss how to balance the time scale to address such differences in seasonal variation in the discussion sections.

B. Accuracy Analysis of Each LCZ
Each LCZ also exhibits prominent differences among the classification results. Fig. 7(b) shows the detailed results of each LCZ in terms of the evaluation metrics. In contrasts to semantic segmentation, local climate zones that consist of multiple types of ground objects create certain challenges for CNN fitting. The results show that the accuracy of each type of LCZ is not the same, and there is no obvious positive correlation with the support of the test sample, indicating that MCFUNet-LCZ still has shortcomings in fitting for some categories. Even so, MCFUNet-LCZ still can achieve better accuracy than the existing CNN, indicating that the structure proposed is more effective for LCZ classification.
Furthermore, the results clearly show that there is a drop in accuracy for some classes. For instance, the prediction accuracy of MCFUNet-LCZ decrease significantly for LCZ 1, LCZ 7, LCZ C, and LCZ E. there are two reasons for this decrease as follows.
1) Image Resolution: The pixel features of greenery are lost because the image resolution of remote sensing compresses much land information of the objects. This situation will make it too difficult for the CNN to extract subtle textures. Unfortunately, the scarcity of remote sensing data that are currently openly available for producing of large-scale datasets from global cities makes resolution limitations difficult to overcome; consequently, CNNs are still the most convenient way for us to use the 10-30 m resolution data to construct the LCZ sample library.
2) Boundary Confusion: Diverse backgrounds often create great challenges for CNNs to identify the boundaries of the LCZ. In fact, the most important aspect of CNNs is to obtain the best classification parameters for the neural network layers through training. Therefore, making CNN processing more conducive to digesting load data is necessary to improve accuracy. Here, some deep learning techniques are still unused and unproven in LCZ classification even though they have the potential to assist in recognition, such as width learning and knowledge graphs. These techniques still need to be investigated to verify whether they can help increase CNN performance. Additionally, enhancing the receptive field of CNN is a way to extract effective image context information. In addition, it is necessary to analyze the image characteristics of LCZs, thus finding a comprehensive technique for high-accuracy classification because each LCZ perhaps exhibits a unique rule of composition.

C. Comparison With Baseline Models
We select 14 typical semantic segmentation CNNs and 4 cutting-edge LCZ CNNs as benchmark methods for comparison. The experimental results show that the proposed approach can obtain better accuracy on the So2Sat LCZ42 dataset (as shown in Fig. 9). Meanwhile, the outstanding results also demonstrate that the structure of MCFUNet-LCZ is appropriate for training and predicting LCZs.
Concerning model comparisons, SegNet, PSPNet, Deeplabv3plus, and DenseUNet, all outperformed other semantic segmentation models; this finding may provide a valid reference to explore other effective deep learning-based models for determining the classification boundaries among LCZs. For the obtained precision, recall and F1-scores in all LCZ predictions, the proposed method can converge the accuracy to a better range both in the advanced semantic segmentation model and the LCZ-CNN, LCZ-CNNWP, and MFCNN dedicated to LCZ classification, indicating that the proposed model can exhibit superior predictive ability in LCZ classification. In addition, the multilevel feature fusion based-CNN proposed by Qiu et al. [25] also showed excellent performance, indicating that strengthening the cascaded feature fusion between network layers is an effective design pattern for LCZ classification.
Some semantic segmentation models also perform better in classification. SegNet and PSPNet manifest the good control of the floating range of results in three metrics; this finding is an insight for us to analyze the structural superiority of these models to modify the fitting capability of deep learning approaches for local climate zones. These two types of models also use the multifeature fusion strategy to build the model: SegNet employs the feature combination of the encoding network and the decoding network, while PSPNet adopts a fusion of the results of multifilter scale segmentation. Additionally, some approaches, such as Deeplabv3plus and PSPNet, with high computing costs demonstrate excellent accuracy, yet large parameters may lead to severe hardware requirements or time costs; therefore, users must select a suitable approach according to their equipment conditions.
Apparently, some LCZ categories from architectural attributes and natural attributes show a low-accuracy trend in multiple CNNs possibly because the feature extraction mechanism of CNN does not successfully characterize the classification features or boundaries of these LCZ categories. These categories include LCZ 1, LCZ 7, LCZ 10, LCZ C, and LCZ E (as shown in Fig. 10). In particular, the DeepLabv3plus model is characterized by embedding atrous convolution modules to expand the receptive field of the CNN. In the results of DeepLabv3plus, the accuracy of LCZ C and LCZ E has been increased significantly; this finding indicates that expanding the receptive field may help improve the resolution accuracy of these two categories.
Overall, compared to other benchmark CNNs, our proposed method achieves the best classification performance; this finding indicates that the strategy of multicascade feature fusion is effective for LCZ discrimination. In the future, more scholars can adopt this strategy to explore better CNNs.

V. DISCUSSION
In this section, we will explore several core issues in detail, mainly discussing the sensitivity of the model to different parameters, respectively, the effectiveness of feature fusion modes, the performance variation of different GPF situations, the superiority of the multicascaded feature fusion, the differences in MCFUNet-LCZ mapping under seasonal and atmospheric conditions, and share perspectives on several key issues in LCZ mapping. We will divide our discussion of the effectiveness of multicascaded feature fusion into two parts to explore the role of each feature fusion pattern in a targeted manner. The first part discusses DPF and CDF with concatenate relationships in encoding-decoding, and the second part discusses GPF, which integrates information from the entire network. Details will be described in the following sections.

A. Sensitivity of Initial Parameters
The initial filter for feature extraction and the dropout ratio are two necessary parameters to control the performance of the overall network structure. In this article, the detailed performance differences of the methods are reflected by verifying the values of different parameters. Typically, the initial filter can be set to an exponential power of 2. The initial filter tends to control the range of perceptual fields for feature extraction across the network layers, and the subsequent convolution block will refine the feature map with exponential multiples of the initial filter as the new filter parameters.
According to the results in Fig. 11, the proposed method performs better as the initial filter increases. The performance of the proposed method improves with the exponential growth of the initial filter. The experiments tested three initial filters, which are set to 8, 16, and 32, that are commonly used in CNN-based LCZ classification. Larger initial filter parameters will incur more computational and training time costs. In particular, with 8 to 16 GB of memory configured on conventional computers, larger initial parameters tend to lead to memory overflow, hindering the training of the CNN.
After experiments, the proposed method can balance arithmetic power and time well when the initial filter is set to 32. Also, this initial filter can control each iteration within 7 min. Additionally, the algorithm converges to the best validation loss accuracy when the eight epoch, and then the training is terminated after 40 iterations because a better loss convergence. Moreover, the overall training time can be controlled within approximately 6 h to obtain a usable classification model. Dropout ratios are effective in optimizing the time-consuming and often overfitting phenomena that can occur in a deep learning model during classification. In general, deep learning models are still vulnerable to overfitting, and once this situation happens, the model will fail. To address this problem, many scholars have adopted an integration approach to collaboratively investigate and evaluate classification objects to alleviate the pressure of fitting a single model in the data space. Nevertheless, the time consumption of the integrated model increases the computational cost of the method. In contrast, the dropout operation  allows the model to randomly mask neurons in one operation at a certain rate, thus transforming the overall network to fit the sample space into the collaborative work of multiple mininetworks. Fig. 12 shows the results of the experiments conducted in this article with dropout operations at ratios from 0 to 0.9. The figure shows that the model adapts best precision when the dropout is 0.1 and 0.9, with an overall accuracy of 0.70. Most of ratios are obtained the results of 0.68 and 0.65, respectively, in overall accuracy and Kappa coefficients. The worst result is the dropout of 0.5, with an overall accuracy and Kappa coefficient of 0.68 and 0.65. The change in the dropout ratio will not cause the increase or decrease of the overall parameters of the algorithm, but the change will adjust the adaptability of the model structure to the data type and prediction target. In fact, the essence of training is to allow the artificial neuron nodes, which mimic human brain  nerves, to obtain the appropriate coefficients and parameters in the function, thereby allowing the algorithm to compute the best class of image elements. Adding the dropout operation ensures that this process is not affected by factors such as noise, thereby falling into a local optimum.

B. Effectiveness of DPF and CDF
DPF and CDF enable CNNs to obtain more information by combining multiple layers of feature maps in the network. This section discusses mainly the accuracy variation acquired from the designed feature fusion patterns, which contain the DPF direction and CDF fusion direction. We explore the performance differences in the classification results produced by the presence or absence of two fusion parts in each layer separately. Additionally, we change the depth of all network layers to explore whether the proposed model is still usable under various abridged structures; this exploration will be beneficial in 1) systematically excavating the deformability of the model and 2) helping the user select the appropriate form according to the operation conditions. We divide MCFUNet-LCZ into four depths, and the highest depth is 5N+2, which incorporates a five-layer convolutional network plus global pooling and a dense layer. Then, the depth of the network layers will be reduced in order to 4N+2, 3N+2, and 2N+2. Whenever one layer is reduced, explicitly explore the performance variation with regard to adding or deleting feature fusion modes. All the results and total parameters are shown in Table III.
The results show that even if the depth of the feature convolution layers is reduced to the lightest 2 N, the model can still control the classification precision at an overall accuracy of 0.68 and kappa coefficient of 0.65 with fewer parameters. This result shows outstanding performance compared with the LCZ classification performances of other semantic segmentation models and deep learning approaches; this finding indicates that the proposed deep learning architecture has better adaptation and algorithmic stability for segmenting underlying surfaces that pertain to each urban local climate. Additionally, irrespective of network depth, the approaches that form both feature fusion patterns can obtain the best accuracy; this finding again manifests the structural stability of the model. The significant improvement in the overall accuracy and kappa coefficient of the MCFUNet-LCZ structure when reaching the fifth-level depth of the feature convolution layer demonstrates that the convolution filter refinement to 512 is more sensitive to extracting of local climate zones. Furthermore, the setting range of the convolution filters that are most beneficial for the neural network to detect LCZs should be as close as possible to between 32 and 512, specifically including 32, 64, 128, 256, and 512, on which are based the previous results of different filtering parameters.
Different LCZs exhibit distinct fluctuations in classification indicators. In essence, the LCZ is a criterion that is established according to the physical properties of urban texture in which may divide the local climate to explore the reason behind the urban heat distribution and thermal anomaly. Unlike common land cover classification methods, which consider particular object characteristics, distinguishing LCZs sometimes cannot rely on remarkable boundaries due to regional intersection. Accordingly, achieving classification via LCZ often blurs feature extraction in the recognition algorithm.
Concerning the proposed approach, we survey the accuracy on each LCZ, as shown in Fig. 13. The results for LCZ 4, 8, A, D, and G were brilliantly accurate; in these cases, the F1-score remained almost above 0.7, especially for water (LCZ G), and the F1-score tended to be 1.0. Three LCZs often obtained weak indicator results that were lower than 0.4 in the F1-score; these LCZs included LCZ 7, C, and E. Such results indicate that the proposed approach possesses poor judgment in the case of lightweight low-rise regions, regions with bush and scrub, regions with bare rock, or paved regions. Additionally, in these cases, the objects generally have common characteristics, such as being small and staggered, that may require micro-object extraction for the data and algorithm. Hence, we can attempt to improve this situation by adopting high-resolution remote sensing images, and strengthening the small-object extraction of the classification models in the future. Moreover, we have clarified the performance of possible derivative models in each LCZ; this clarity is helpful for scholars in selecting appropriate methods to classify specific LCZs.
Overall, the advantage of the proposed classification model is that reducing the number of layers or using a different fusion method does not lead to a rapid decline in accuracy because we maintain the stereoscopic concept in designing the model, which makes full use of the feature maps obtained from each layer in which the output is described via global pooling instead of relying only on the individual/partial features after all operations. Overall, the model framework proposed will be more conducive  to the use of applications and subsequent development by other scholars.

C. Performance Differences of Each GPF Condition
The GPF module that utilizes global pooling, which links each layer to fuse the feature information of the entire CNN layer. Also, the GPF carries on the role of probabilistic transformation of LCZ categories and classification in the MCFUNet-LCZ model. Based on this, this section will focus on the efficacy that the presence or absence of GPFs at different network layer depths brings to the CNN. Global max pooling (GMP) and global average pooling (GAP) are two core operations of GPF, and this section will discuss the effectiveness of GPF through these two operations. We do not change the structure of MCFUNet-LCZ, only replace the global pooling operation to the Flatten operation, and sent to the Dense layer for probability conversion.
From the results in Table IV, it can be seen that MCFUNet-LCZ decreases slowly as the network layer depth is reduced. At the smallest network structure of 2N-BP, MCFUNet-LCZ is able to achieve an OA of 0.68 and a kappa of 0.64, which is still better than most baseline CNNs in Fig. 9. In addition, the effect of GMP is significantly stronger than GAP during the layer is dwindled. The reason for this phenomenon may be that the GAPs incorporate in MCFUNet-LCZ is established to extract the texture features from the concatenate matrix composed of codingdecoding feature maps, which serves to give more reference information to correct the object outline description extracted by the GMP. Therefore, performance degradation tends to occur when only GAPs are involved in the structure. Moreover, the MCFUNet-LCZ dropping sharply in accuracy metrics when cutting down the layer depths, which also demonstrates that the pattern that the collaboration between GMP and GAP is effective for LCZ classification.

D. Superiority of Multilayer Cascaded Feature Fusion
Continuous cascade transfer (as shown in Fig. 14) is the commonly used structure in LCZ classification, whereas the obtained feature maps often cause much loss for original feature information due to the dealing of layer operations. In this section,   Table V, it can be seen that the multilayer cascade information integrated by the global pooling operations can not only improve the prediction accuracy of LCZs, but also reduce FLOPS and the total number of parameters in the whole CNN. Such results indicate that the mode proposed, which joints voted probabilities from multilayer cascade information to involve in the final LCZ discrimination, is an effective approach for CNN to identify LCZs.
In particular, the results of continuous cascade show a decrease in the two metrics of OA and Kappa when adding the layer depth. The reason for this situation may be the loss of valid feature information causes the CNN unable to extract the appropriate region of LCZ, due to excessive operations of convolution and pooling in deeper layer. According to aforementioned contents, the LCZ's boundaries may be blurred because the fact that it is composed of multiple land cover types. For the continuous cascade mode, the classification severe depends on the probability that transforms from endmost-block feature map. Adding the layer depth will increase the additional operations, which makes the feature too compress to support the segmentation of LCZ. Unlike continuous cascade, multilayer cascade mode is able to determine the final LCZ category following the feature maps from all layers, which completes the classification by comprehensive vote probabilities rather than single output. Consequently, the deeper the network depth, the more feature maps can be brought to participate the classification, which leads accuracy improvement.

E. Comparison of UNet and MCFUNet-LCZ
In this section, we implement the UNet to make a comparison with the proposed CNN, on account of the structure design of MCFUNet-LCZ is inspired by UNet. The results of the metric are shown in Table VI.
Obviously, the metric of OA and Kappa obtained by MCFUNet-LCZ achieves significant improvement compared with UNet. According to the results of FLOPS and the total number of CNN parameters, it is demonstrated that MCFUNet-LCZ achieves higher accuracy and less computing requirements than UNet.
Different from the continuous cascade mode, UNet is concatenated the feature maps from encoding path to decoding path. Then, the final output will be obtained via the probability conversion of the result from which the endmost block. In Table VI, the results of UNet also present the phenomenon, which the accuracy decreases when adding the layer depth. Even though a feature extraction mode of concatenating feature maps between the encoding and decoding paths is employed in UNet structure, it still not exhibits the effectiveness toward to LCZs in the experiments. This phenomenon indicates that the pattern in which the output only depends on the result of endmost CNN block may not be suitable for LCZ classification, because excessive layer operations weaken the feature information in training set, making it difficult for CNN to fit LCZ boundaries. In contrast, the proposed structure shows the accuracy improvement in the results, which also demonstrates the effectiveness of multilayer cascade mode for LCZ classification.

F. Sensitivity of Seasons and Cloud Cover
Different atmospheric states and surface-covered phenology can produce changes in the model's LCZ classification mapping. Usually, the classification of LCZs will change with seasonal extrapolation-induced changes in phenology. For example, the significance of the spectral features of vegetation in remote sensing images of mid-and high-latitude cities decreases continuously from summer to winter. Consequently, the model may confuse the LCZ categories of the underlying surface in winter only on the basis of the weak vegetation representation, and classify the categories that should have vegetation into the categories without vegetation, such as LCZ 4,5,6,9 to LCZ 1,2,3,7,8. Similarly, the seven categories of the suburban environment will also emerge as the cases of class shift. Whether clouds in the atmosphere affect LCZ mapping and what categories are often misclassified in the areas covered by them are questions that need to be further explored. To settle these problems, this article selects Sentinel-2 remote sensing images (10 m) for each month of 2021 for the urban area of Beijing, China, to explore in detail the differences in LCZ mapping under the above observation conditions; the results are shown in Fig. 15.
According to the results of LCZ mapping, the best interannual month range for LCZ mapping in the built-up area of Beijing is from May to November, while all other months have different degrees of omission or confounding of LCZ categories. The ground cover of vegetation decreases significantly with the onset of winter, especially in January, February, March, and December. Generally, the reduction in green vegetation between buildings also affects the determination of building height, thereby leading to the classification of the ground surface also tending to be more toward low buildings only and bare soil and rocky ground. Additionally, this situation is often anabatic in the two months of February and March and then recovers with the gradual recovery of vegetation in spring.
Moreover, cloud cover appearing in February, July, and October in the interannual time series images collected clearly show that areas covered by these atmospheric states may be more liable to be extracted as large low-rise, heavy industry and even partially as the water in February. More accurate LCZ mapping should restrict the acquisition period of remote sensing images to the summer and autumn months as much as possible while attempting to control the pixel-occupied percentage of clouds or fog.

G. Several Issues for LCZ Classification
LCZ classification has made certain progress over many years of development but remains confronted with several issues that deserve to be explored. This section will illustrate some views on these issues.
1) How will deep learning develop in LCZ classification in the future? Deep learning-based classification methods have made great progress in computer vision and digital image processing, but they still encounter interpretability bottlenecks in both LCZ classification and other fields, making their development seriously hampered and questionable. Even so, it is undeniable that deep learning has revolutionized many application areas during one decade. As for LCZ domain, the goal of LCZ mapping is to classify land cover according to canopy parameter, thus offering the possibility to answer the issues about urban temperature evolution or thermal anomalies. Among each LCZ class, a mass of nonlinear relationships is often presented on diverse distributions of urban building textures; the classification work based on this situation may be better accomplished via empirical fitting like deep learning approaches. Nevertheless, such methods often lack interpretability in terms of physical properties. Currently, many studies have used the attention module to explain the feature contribution vectors processed on the deep learning model. Nevertheless, this method may not settle the interpretation very well because different structure of the model perhaps present diverse adaptabilities on each data dimension. Therefore, we need to analysis what information is valid for LCZ classification from a physical perspective, rather than working backwards from the visualization of the contribution of deep learning classifiers. Actually, deep learning technology is only a classifier, and the pattern of "one sample library -one approach -LCZ mapping" applied to global LCZ classification may result in the extracted differences because factors such as latitudes, topography, and seasons will make LCZ mapping have a severe impact. Consequently, the deep learning-based LCZ method in the future will be more inclined to mutual integration with the physical attributes of the ground surface, such as dividing the city degrees in accordance with corresponding topography or latitude to build a comprehensive sample database system that includes the pattern of "multilevel sample library-multimodel integration strategy-appropriate LCZ mapping". Moreover, analyzing the characteristics of different data types to build more reasonable deep learning-based LCZ mapping approaches may be one of the trends in the future.
2) What is the significance of the LCZ classification? The taxonomy of local climate zones provides us with only parametric criteria to describe the type and texture of the substratum corresponding to urban microclimates; these criteria can help scholars emphasize the exploration of geothermal differences and the comparison of related methods. Nonetheless, it should be ensured that LCZ types are constantly changing for varying geographic locations, latitudes and longitudes, and seasonal interconnections, making it difficult to compare each climate zone in various cities. Based on this, the typical significance of local climate zone classification should tend to 1) explore the incremental relationship between land surface cover and urban temperature; 2) find the correlated effect of temperature representation on both one LCZ and surrounding climate zones under the variables control of spatial and temporal; 3) further analyze other factors that influence the ground temperature on the basis of certain underlying land surfaces; for instance, if the temperature of one LCZ from several locations show the diversity heat difference, perhaps we could further probe increment factors, such as air conditioning and CO 2 emissions.
3) What deep learning structure may be more suitable for LCZ classification? An efficient network structure can provide a solid foundation for accurate LCZ classification. Comprehensively incorporating feature maps from multiple neural network layers is an effective way to improve the accuracy of LCZ classification. Because the collaboration of multilayer cascaded features is beneficial to retain the background information, which is crucial for the judgment of LCZ categories. 4) How can the impact of seasonal changes on LCZ classification be balanced? The LCZ categories may change continuously in seasonal environments at different latitudes of locations, which may result in intra-and interannual instability of LCZs. To address this issue, whether the original LCZ categories are maintained across seasons needs to be determined based on the purposes, objects, and time scales of the study. For example, if the objective is to investigate the influence of LCZ on urban heat islands during a particular season, the analysis should be based on the mapping results according to corresponding seasonal time. Similarly, if the task of the study is to anchor the influence of LCZ variation on the annually urban geothermal temperature, the study should retain a consistent LCZ map throughout a year.

VI. CONCLUSION
Achieving accurate LCZ classification is a vital step in breaking the conventional "urban-suburban" description of urban heat island studies and facilitating the scientific and rational optimization of the urban pattern. Deep learning-based techniques have already become an efficient means to adapt LCZ mapping, avoiding much manual work and subjective bias. Even so, the classification accuracy of such approaches imposes serious constraints on local climate mapping because the deficient cascade feature process may not fit the unclear boundary between LCZs. Accordingly, the purpose of our work is to propose a novel network framework for LCZ classification that synchronously attends to the trinity of stability-tunable robustness. The proposed method can maintain stable accuracy regardless of the derived submodels, which partially reduce the number of components to save computing costs. Moreover, incorporating more auxiliary information, such as road network data, point-of-interest data, and nighttime light data, is perhaps also an effective way to strengthen the discernability of deep learning-based classifiers for LCZs. Of course, this is one of the directions that our work will take in the future. The results of this article provide an efficient deep learning structure for LCZ mapping, which also probably offers assistance to other scholars or beginners.