Multi-level Feature Fusion-based CNN for Local Climate Zone Classification from Sentinel-2 Images: Benchmark Results on the So2Sat LCZ42 Dataset

As a unique classification scheme for urban forms and functions, the local climate zone (LCZ) system provides essential general information for any studies related to urban environments, especially on a large scale. Remote sensing data-based classification approaches are the key to large-scale mapping and monitoring of LCZs. The potential of deep learning-based approaches is not yet fully explored, even though advanced convolutional neural networks (CNNs) continue to push the frontiers for various computer vision tasks. One reason is that published studies are based on different datasets, usually at a regional scale, which makes it impossible to fairly and consistently compare the potential of different CNNs for real-world scenarios. This study is based on the big So2Sat LCZ42 benchmark dataset dedicated to LCZ classification. Using this dataset, we studied a range of CNNs of varying sizes. In addition, we proposed a CNN to classify LCZs from Sentinel-2 images, Sen2LCZ-Net. Using this base network, we propose fusing multi-level features using the extended Sen2LCZ-Net-MF. With this proposed simple network architecture and the highly competitive benchmark dataset, we obtain results that are better than those obtained by the state-of-the-art CNNs, while requiring less computation with fewer layers and parameters. Large-scale LCZ classification examples of completely unseen areas are presented, demonstrating the potential of our proposed Sen2LCZ-Net-MF as well as the So2Sat LCZ42 dataset. We also intensively investigated the influence of network depth and width and the effectiveness of the design choices made for Sen2LCZ-Net-MF. Our work will provide important baselines for future CNN-based algorithm developments for both LCZ classification and other urban land cover land use classification.


I. INTRODUCTION
The Local Climate Zone (LCZ) scheme is a classification system that provides a standardization framework for the characteristics of urban forms and functions.Illustrations of the LCZ classes and corresponding remote sensing image patches are shown in Fig. 1.Originally proposed for urban heat C. Qiu and M. Schmitt are with Signal Processing in Earth Observation (SiPEO), Technical University of Munich (TUM), Germany.X. Tong is with Information Engineering University, Zhengzhou, China.Benjamin Bechtel is with the Institute of Geography, Ruhr-University Bochum, Germany.X. Zhu is with the Remote Sensing Technology Institute (IMF), German Aerospace Center (DLR), as well as TUM-SiPEO, Germany.(Correspondence: Xiao Xiang Zhu; E-mail: xiaoxiang.zhu@dlr.de)island (UHI) research, this scheme has shown an increasing impact on various climatological studies, such as the cooling effect of green infrastructure and micro-climatic effects on town peripheries [1], [2], [3], [4], [5], [6], [7].Furthermore, the LCZ scheme can also be used to describe the internal structure of urban areas, providing significant information for various applications such as infrastructure planning and population assessment [8], [9].
An important part of the existing development of LCZ classification is community-based global LCZ mapping using openly available Landsat data and softwares [10], [11].An example is the World Urban Database and Portal (WUDAPT) [12], a community-driven initiative, which was organized by researchers to produce high-quality LCZ maps worldwide.Within WUDAPT, currently almost 100 cities located across the globe have been mapped of moderate quality, providing sufficient details for certain model applications [13].LCZ maps of tens of cities, after quality assessment, are now openly available in the WUDAPT portal.More recently, an LCZ map of Europe was published [14].
The key to efficient large-scale LCZ classification is developing advanced machine learning models with high generalization ability [14], [15].In this regard, tailoring deep learningbased approaches to the peculiarities of remote sensing data is one important strategy that has gained much attention recently [16], [17], [18], [19], [20], [21], [22].A review of these published studies tells us that deep learning, specifically in the form of convolutional neural networks (CNNs), is indeed able to enhance LCZ classification accuracy given a proper dataset due to its powerful feature representation capacity, when compared to random forest approaches [23].Specifically, some LCZs, such as open built-up areas and scattered trees, can benefit from the learned features that incorporate larger neighborhood information, as found in [20].
However, while providing general meaningful insights into the methodology, most of the existing studies are carried out and assessed separately for individual case study scenes.This makes it difficult or impossible to compare different network architectures and fairly evaluate their potential for subsequent applications.More importantly, it hinders the development of more advanced approaches due to limited standard baselines.This dilemma is rooted in the fact that there exist only a few open datasets that are dedicated to large-scale LCZ classification.Taking the rapidly developing field of classification in computer vision as an example, based on the benchmark datasets such as ImageNet [24] and CIFAR [25], research in CNNs has proliferated toward enhanced performance, simpler design, and higher efficiency, with new progress being achieved every year.Exemplary architectures include VGGNet [26], residual neural networks (ResNet) [27], densely connected convolutional networks (DenseNet) [28], Inception [29], and neural architecture search net (NASNet) [30].Aiming at improved performance, researchers have put significant effort into investigating the impact of pushing the depth [31], width [32], and cardinality [33] of networks.In addition to higher accuracy, strategies have also been developed in other dimensions, such as model simplicity and efficiency (e.g., efficient use of model parameters and the trade-off between accuracy and latency) [34], [35], model scaling (with respect to depth/width/resolution) [36], and training strategy (e.g., hyper-parameters tuning and model parallelism) [37].In particular, a great deal of recent effort has been devoted to designing efficiency, i.e., improving accuracy without hitting the hardware memory limit, e.g., by the increasingly popular neural architecture search (NAS) approach [38].However, it is still an open research question whether those conclusions also hold for tasks in remote sensing, the confirmation of which requires a sequence of studies on well-designed benchmark datasets.Unfortunately, only a few benchmark datasets exist for satellite remote sensing, especially when it comes to medium-resolution images [39], [40], [41].All three existing datasets employ Sentinel-2 images and focus on applications such as land cover and land use classification.One dataset, [40] also includes Sentinel-1 images, providing the potential for multi-sensor fusion.
In the case of the LCZ classification scheme, the main reason thre are so few standard datasets is that it is costly, time-consuming, and challenging to collect ground truth for classification schemes with so many difficult-to-distinguish classes.This problem is partly solved by the recently published, openly available So2Sat LCZ42 dataset, which contains both image patches and LCZ labels from 42 cities distributed across the world.Focusing on the specific task of LCZ classification under a specific setup with this dataset, a simple CNN architecture is proposed in this study.The influence of its depth, width, and pooling layers is extensively investigated.The best results are presented along with a wide range of baseline models of varying sizes, enabling a understanding of the correlation between model size and accuracy, and supporting further development toward higher classification accuracy.
The remainder of this paper proceeds as follows: Section II elaborates on the proposed CNN architecture, Sen2LCZ-Net, and the strategy of multi-level feature fusion.Section III details descriptions of the So2Sat LCZ42 dataset, the baseline CNNs to be compared, and the experimental setup.Section IV evaluates the classification accuracy of LCZ and visualizes the classified LCZ results for several sample test scenes at both a city and province scale.The following Section V extensively investigates the influence and effect of the design choice of Sen2LCZ-Net-MF, and discusses multiple feasible approaches for further improvement, based on this study.Finally, Section VI summarizes and concludes the work.

II. LCZ CLASSIFICATION VIA CNNS
A. An adapted CNN architecture and multi-level feature fusion Our priorities are model simplicity and the use of fewer parameters, which benefit from the following two advantages.First, a simple, small model is more feasible for up-scaling LCZ classification-since generally, in big-data scenarios, a huge amount of data needs to be processed at a reasonable cost.In addition, better and more discriminative features are encouraged to be learned by a small network with fewer trainable parameters [42], [43], thus decreasing the chance of overfitting and enabling high generalization ability for LCZ classification on a large scale.
Following the philosophy used by state-of-the-art models, especially those aiming at simplicity, our design follows the template described in [44], [45]: • Convolutional layers are with a fixed filter of a small size (3 × 3) for an efficient use of parameters; • Features maps are down-sampled to half the input resolution by using pooling layers, and the number of the computed feature maps is doubled to 2f after each downsampling operation to enable hierarchical representation learning; • Homogeneous layers are grouped into blocks for network topology to be easily managed.The proposed simple network architecture, that maps LCZs from Sentinel-2 by taking multi-level features into account, Sen2LCZ-Net-MF, is illustrated in Fig. 2, where detailed information (layer names and sizes of feature maps) about input, intermediate learned features, and output are also shown.Sen2LCZ-Net-MF consists of a simple end-to-end CNN, Sen2LCZ-Net, and connections to fuse multi-level features.In Sen2LCZ-Net, there are four sequential blocks that extract features via convolutional layers from the input patch or output of the previous block; they then abstract the learned features via average and maximum pooling layers, providing input for the subsequent block.The use of both average and maximum pooling layers within Sen2LCZ-Net, hereinafter referred to as the "double-pooling layer," ensures that more learned features or information within the input data passes through the network for learning in a later stage.This is especially important when the input patch size is small or has a coarse resolution, e.g., a 32 × 32 Sentinel-2 patch.At the end of the last block, a global average pooling is performed, followed by a softmax classifier for the final prediction.No additional fully connected layers are used, in order to maintain a small model size.
The final prediction is then used for loss calculation and optimization, along with the reference label input as ground truth.When fusion of multi-level features is considered, four predictions are made from four outputs of the four blocks independently, the sum of which is used for loss calculation and optimization, as illustrated by the blue lines in Fig. 2. Similar to the strategy using the double-pooling layer, this multi-level fusion design is intended to better exploit the information in input patches without introducing many more parameters.Low-level features from an early stage of the network can be valuable to distinguish LCZs such as sparsely built areas, while this information is not guaranteed to be available in the final learned, or high-level, features.It is worth mentioning that this fusion idea can be implemented together with any state-of-the-art CNNs, including VGG and ResNet.Furthermore, some state-of-the-art design ideas, such as attention mechanisms and skip connections, can be further integrated into Sen2LCZ-Net-MF; however, for simplicity, this is not recommended for consideration before its potential has been investigated.
Implementation Details.Filter weights are initialized using the algorithm proposed by [46].The kernel sizes of all convolutional layers are 3 × 3 and during convolutions, each side of the inputs is zero-padded by one pixel to keep feature maps a fixed size.The number of output filters for the first convolutional layer, f , is set as 16, and the number of convolutional layers in each of the four blocks, N , is set as 4; experimented for investigations of the influence of network depth and width in Section V experiment with the value of N .Changing the value of f and N results in different topologies of Sen2LCZ-Net.The depth, D, and width of Sen2LCZ-MF can be adjusted with N and f , respectively.Specifically, the depth D = 4N + 1, and the width W depends on the filter number of the first block, f , which is doubled for each subsequent block.The pooling layers use a kernel size of 2×2 with a stride of 2, decreasing the size of feature maps by half.As a result, the sizes of the learned feature maps from the four blocks are h × w, h 2 × w 2 , h 4 × w 4 , and h 8 × w 8 .Specifically, on the So2Sat LCZ42 dataset, the sizes are 32 × 32, 16 × 16, 8 × 8, and 4 × 4. To avoid overfitting during training, we add a dropout layer [47] at the end of the second and third block and set the dropout rate as 0.2.

B. Baseline CNNs
To provide comparisons among baseline results, the following standard CNNs and modules were studied for LCZ classification.The baselines were selected to have a varying number of layers and parameters, to represent a wide range of cases.
• VGG.VGG is composed of convolutional layers with a fixed small size of 3 × 3, max-pooling layers with a size of 2 × 2 and a stride of 2, and three fully connected layers at the end [26].When it was proposed, the authors showed that increasing the depth of the network is able to improve classification accuracy significantly (e.g., when pushing the depth from 16 to 19).Inspired by a earlier work [48], it used only 3 × 3 convolution filters to further connects the layers within a network [28].In DenseNet, each layer takes all preceding feature maps as input, and feature maps of each layer are used as inputs into all subsequent layers.In this way, the problem of vanishing-gradient is alleviated, the feature propagation is strengthened, and feature maps are better reused.The DenseNet used in our experiments has three dense blocks with an equal number of layers (7).The sizes of the feature maps from each block are 32×32, 16×16, and 8×8.Following [28], we also use 1 × 1 convolution followed by 2×2 average pooling as transition layers between two contiguous dense blocks.The hyper-parameter, growth rate, is set as 12.
• Xception.The Xception network is an extreme version of Inception models that assume that cross-channel and spatial correlations can be mapped completely separately.It is built up by stacking depthwise separable filters (a replacement of the Inception modules in Inception models) with residual connections.This results in an efficient use of model parameters, as being shown in [35].We used Xception implemented in Keras, with no adaptation.The final size of the feature map for prediction is 1 × 2048.• Attention mechanism and Convolutional Block Attention Module (CBAM).Attention modules in the context of DL, originally popularized in the field of machine translation [51], are capable of boosting the representation power of CNNs by integrating global contextual dependencies, in both the spatial and the spectral dimension.It works in a way similar to adaptative feature refinement and feature selection or recalibration in one [52] or more [53] dimensions of the feature maps.The implementation of attention mechanism used in this study, CBAM, consists of a channel attention module and a spatial attention module in a sequential arrangement, with channel attention as the first [49].CBAM was used with ResNet-20 and ResNext in our experiments by appending one CBAM for each convoutional layer.
In addition to the aforementioned baselines for comparison, we also investigated the use of CBAM and skip connections within the proposed Sen2LCZ-Net-MF.When using CBAM, it is appended to each convolutional layer without further adjustments.When skip connections are adopted, each block is treated independently with one shortcut connection.Full preactivation is used for the skip connection in each block, i.e., output feature maps of the first convolutional layer are added to outputs of the last convolutional layer [31] C. LCZ classification Procedure After being trained, CNNs can be used for LCZ classification through a sliding window approach, as illustrated in Fig. 3.For each pixel (location) to be predicted, one image patch is extracted from the whole image based on corresponding geo-locations, with a predefined size that is used during model training.Feeding the patch into trained CNNs will output a label corresponding to one of the 17 LCZs.The predicted label for this patch is assigned to this location before continuing to process the subsequent pixel (location).GSD of final predictions can be controlled via the step of the sliding window, which is often 100 m in LCZ-related studies.

III. EXPERIMENTAL SETUP
In this section, we describe the reference and image data and how they were processed, the baseline models, the experimental setup, and accuracy assessment.

A. So2Sat LCZ42 dataset
The So2Sat LCZ42 dataset consists of LCZ labels of 400673 Sentinel-1 and Sentinel-2 image patches (with a size of 32 × 32) in 42 urban agglomerations (plus 10 additional smaller areas) across the world.It was labeled by a group of domain experts in remote sensing, following a carefully designed labeling work flow similar to that in WUDAPT [13].Afterwards, rigorous quality assessment was conducted with independent label voting by domain experts who had not labeled the areas in the labeling stage.The overall confidence of the labeling is 85%.Further details about the So2Sat LCZ42 dataset can be found in [54].
In this study, only Sentinel-2 data was used.For the largescale LCZ classification examples, the satellite imagery used was downloaded from Google Earth Engine after cloud removal processing, as described in [55].Ten of the Sentinel-2 bands were used in this thesis: specifically, the channels with a GSD of 10 m and 20 m.In order to create composites with a consistent image size, the second group of bands was upsampled to a GSD of 10 m using cubic resampling.To summarize, the Sentinel-2 data used in this thesis contains 10 real-valued bands, as listed in Table I.
The whole So2Sat LCZ42 dataset is split into training, validation, and test sets, all of which are spatially separated.Specifically, the training set consists of all the patches of 32 cities and the 10 add-on areas (see [54] for the full list of cities).The remaining 10 cities, distributed across different continents over the world, are: Guangzhou, Jakarta, Moscow, Mumbai, Munich, Nairobi, San Francisco, Santiago de Chile, Sydney, and Tehran.For each of them, we split the labels

B. Experimental settings and metrics for accuracy assessment
Training.For all the CNNs studied in our experiments, the input images and their corresponding reference labels are used to train the network with the Nesterov Adam optimizer implementation of Keras [56].All CNNs were trained from scratch following the same experimental settings in order to make meaningful comparisons.We used a minibatch size of 32 patches.The initial learning rate is 2 × 10 −2 and is decreased by half after every fifth epoch.To control the training time and avoid overfitting, early stopping was used, and the monitored metric is validation loss with patience of 40 epochs, which means that the training stops if the validation loss does not decrease for 40 epochs.After the training, we report the test accuracy from the saved weights with the highest validation accuracy.
Metrics.Metrics used for performance assessment include overall accuracy (OA), Kappa, and average accuracy (AA), which is chosen considering the unbalanced number of samples of different LCZs [57].Additionally, we used weighted accuracy (WA), in which different weights are given to different types of mistakes on the basis of a systematic analysis of the consequent climate impact of those misclassfications, considering such properties as openness, height, cover, and thermal inertia [58].As in [20], overall accuracies for LCZ types in built-up areas (i.e., LCZ1-10; OA b) and LCZ types in non-built-up areas (i.e., LCZA-G; OA nb) are also used as auxiliary metrics.

IV. EXPERIMENTAL RESULTS
Results of the experimental assessment of the proposed Sen2LCZ-Net-MF are given in this section.First, sensitivity analyses are carried out by comparing the performance of Sen2LCZ-Net-MF with different configurations, in order to back up the design choice and also to search for the best

A. Influence of network depth and width
It is well known that the depth and width of a CNN affect its performance.Specifically, bigger models tend to achieve higher accuracy and the accuracy gain quickly saturates [36].There are also contradictory observations where a deeper CNN does not achieve as good accuracy as a shallower counterpart with the same number of parameters [43].To study the influence of depth and width of the Sen2LCZ-Net on the So2Sat LCZ42 dataset, we carried out a series of experiments following the same setup.The results are presented in Table II, where we also list the number of trainable parameters for each Sen2LCZ-Net.
From Table II, we can observe the following phenomena: • When the number of feature maps in the first block, f , is set to 16, better classification results can be achieved as the network depth increases from 5 to 9, from 9 to 13, and from 13 to 17.The improvement from 13 to 17 is smaller than that from 5 to 9 and from 9 to 13.No further improvement is observed when the depth continues to increase from 17 to 21. • When f is 32, unexpectedly, for the same depth (5, 9, and 17), a wider Sen2LCZ-Net does not provide obvious benefit, even though many more parameters are used.One explanation is that the bigger CNNs with more parameters tend to overfit on the training data, resulting in low test accuracy.
• A correlation between model performance and size on the So2Sat LCZ42 dataset, i.e., larger models tend to demonstrate better performance until a certain threshold, is consistent with the literature [36].• With similar amount of parameters, e.g., f16D17 and f32D5, a deeper network provides better performance, which is probably due to the over saturation of the parameters in the shallow network (a phenomena called processing level saturation) [43].• A deeper network with fewer parameters (e.g., f16D13 and f16D9) can perform better than its shallower counterparts (e.g., f32D5), probably due to its developed composition of more general simple functions.

B. Effectiveness of multi-level feature fusion
To demonstrate the effectiveness of multi-level feature fusion for LCZ classification, we carried out 12 experiments with six CNNs with and without multi-level feature fusion.The resulting model performance is shown in Table III.Improvement from multi-level fusion can be consistently observed for all six CNNs of varying size.This improvement is more likely resulting from the representation ability of the additional employed features, since there are only a few additionally introduced parameters, i.e., those from the additional three dense and softmax layers,, as shown in Fig. 2. Another related explanation is that the additionally utilized feature maps from the early layers are larger and provide valuable information, enabling an efficient harness of the training samples, as analyzed in [42].

C. Effectiveness of the double-pooling layer
It is known that (max-)pooling plays an important role in reducing the sensitivity of learned features to shift and distortions and that it also enables translation invariance [59].Also, it makes possible a pyramid-shaped form model and larger receptive field in a later stage of the network.However, it hinders the maximum utilization of information, which is crucial, especially for a dataset of insufficient size.Therefore, both average pooling and maximum pooling layers are used for downsampling within Sen2LCZ-Net.To show the effectiveness of this choice, we compared the results to those from models only using maximum pooling layers for six configurations.The comparisons are presented in Table IV, where it can be seen that it is beneficial to use average pooling in addition to maximum pooling layers.This is probably because more features and information can be exploited for a later stage of the network by simply adding average pooling.This is important for the used medium resolution Sentinel-2 images, because each pixel value represents a rather large ground area and might be crucial in distinguish certain LCZs that often depend on neighborhood morphologies.When only maximum pooling layers are used, certain information can be lost during the feature extraction and abstraction process, and cannot be recovered.

D. Impact of LCZ imbalance in the training dataset
Samples in the training dataset are unbalanced with respect to each LCZ, as can be seen in Fig. 4. To understand its effect on classification performance, we carried out eight experiments with four CNNs and compared the results considering and not considering the class imbalance problem.Class weights were used when considering the class imbalance problem, and weights were calculated based on the sample frequency of each LCZ.Specifically, the weight of each LCZ was calculated by the inverse of its sample fraction in the whole training set.The resulting differences in classification accuracy can be seen in Table V, Counterintuitively, we do not observe obvious benefits by using class weights.One reason might be that the imbalance problem in the So2Sat LCZ42 dataset is not serious, so that no weighting strategy is needed, as the imbalance problem was addressed to some extent during the data preparation process [54].Another reason is that the introduced class weight during training makes it difficult to learn generalized features for the major class, leading to a overall worse results.

E. Comparison among state-of-the-art CNNs for LCZ classification
Table VI presents classification results from Sen2LCZ-Net-MF with selected configurations based on the above analyses, as well as several baseline CNNs, as described in Section II-B.The proposed Sen2LCZ-Net-MF(f16D17), corresponding to a configuration of f = 16 and D = 4N + 1, N = 4 as described  in Section II-A and Fig. 2, provides the best results for all metrics except OA nb, with Kappa and AA being 0.664 and 0.587, respectively.A smaller version of Sen2LCZ-Net-MF(f9D17) also provides top results for all metrics, with Kappa and AA being 0.644 and 0.562, respectively.CNNs providing comparative results include ResNet-11, DensNet, and CBAMbased ResNet-20.Considering that the top results of OA b and OA nb are all close and high, it can be concluded that the best results are from the proposed Sen2LCZ-Net-MF.The bigger models, such as ResNet-50 and Xception, provides worse results.The first reason is that they are probably not adapted well for this task and dataset.Specifically, the information loss in the first layers of these CNNs, which is due to maximum pooling layers and the larger-than-1 stride of convolutional layers, might be harmful for feature representations.Further exploiting CBAM or skip connections provides little improvement, as presented in the last two rows in Table VI.This is possibly due to the sufficient exploitation of the input data by Sen2LCZ-Net-MF and the challenges of distinguishing different LCZs.
It should be mentioned that the comparison is mainly to provide a reference for a preliminary interpretation of the relation between the models and their performance for the specific task of LCZ classification.It becomes clear that optimal performance for LCZ classification is not guaranteed when relying on the models proposed for datasets in computer vision, such as ImageNet.
The confusion matrix resulting from Sen2LCZ-Net-MF(f16D17) is presented in Fig. 5.It can be seen that LCZs with a low producer's accuracy (lower than 50%) include LCZ 5, 7, 10, B, C, and E. The main mis-classifications (higher than 30%) are between LCZ 7 and 3, LCZ 9 and 6, LCZ 10 and 8, LCZ C and D, and LCZ C and F, which are all comparably similar LCZ types.The LCZs with high producer's accuracy (higher than 80%) are LCZ 4, 8, A, D, and G.
The seemingly disappointing classification results are due to the challenging setup in our experiments, where the test samples are from completely unseen areas in spatially disjointed regions compared to the data in the training set.We expect that multiple approaches can be effectively employed for further improvement, which will be discussed in Section V.

F. LCZ classification examples on a large scale
As city-scale LCZ classification examples, we present results from Sen2LCZ-Net-MF(f16D17) for Munich (Germany, Europe) and Nairobi (Kenya, Africa) in Figs. 6 and 7, respectively.As an example of province-scale LCZ classification, we present the result in Henan province in China in Fig. 8.A zoomed-in subregion of a rural area is visualized in Fig. 9.For comparative interpretation, we also present HR image data and references from existing products, i.e., Global Urban Footprint (GUF) [60], [61], Global Human Settlement Layer (GHSL) [62], and finer resolution observation and monitoring of global land cover with 10 m GSD (FROMGLC10) [63].

V. DISCUSSION
The design choice of Sen2LCZ-Net (the suitable depth and width), the effectiveness of multi-level feature fusion and double-pooling layer, as well as the effect of class imbalance have been shown by the extensive experimental results in Section IV.While having fewer parameters and not relying     VI) provide comparable results, they come with unnecessary overhead and high computation complexity.Simply relying on state-of-theart CNNs from the computer vision field might be practically fast and effective for certain tasks on a small scale, but it is harmful not only for large-scale or even global mapping but also for comprehensive understanding and development of methodologies.As explained in Section I, this is not possible without benchmark datasets, which is addressed by relying on the open So2Sat LCZ42 dataset in this study.The focus of this study is to provide effective baselines on the open So2Sat LCZ42 dataset.While promising results have been achieved even in completely unseen areas, demonstrating the potential for further applications, as shown in Fig. 9, our employed CNN, Sen2LCZ-Net-MF, is comparably simple and no sophisticated hyper-parameter tuning is applied on the training process, either.As a result, we expect further improvement from a range of feasible approaches based on the achieved benchmark results in this study.We categorize these possibilities in the following three directions.
• Data-driven approach.It has been shown that the per-formance of a CNN will increase with an increasing amount of data available [26].Therefore, a simple way to enhance the benchmark results is to extend the So2Sat LCZ42 dataset by including more data.A straightforward approach is to resort to data augmentation, such as the multi-scale and horizontal flip approach, and test-time augmentation, which has been used in state-of-the-art research [26].One step further, we can include more data from even more diverse areas.In addition to the amount of data, the quality of samples can also be improved to enable more accurate results.For instance, a more balanced distribution among all classes can help training a more robust model.Another example is to introduce more hard samples such as LCZ 7 (light-weight and lowrise) and LCZ 9 (sparsely-built).In this way, the learned features for hard examples can be more representative and the accuracy for difficult classes can be improved, resulting in a higher overall accuracy [64].Apart from the dataset to train the network, in the prediction phase, multi-source multi-temporal data fusion is also a straightforward and effective approach to further improve the obtained benchmark LCZ classification results.The effectiveness of data fusion has been shown in [18], [65], and [21].Specifically, a final robust result can be achieved via a decision-level fusion of multiple predictions that are obtained from multi-source data, such as SAR and hyperspectral image, with same or different classifiers [66], [67], [68], [69], [70].It should be mentioned that an ensemble does not have to increase the cost of computation time (training a CNN multiple times) [71].Instead, the spirit of the ensemble can be realized at different levels, for instance, by multicolumn and multi-branch architectures, where the training only needs to be carried out once [35], [72], [73].Even with a single-branch architecture, an ensemble can be performed with a special training strategy that passes a range of local minima during the training process [74].More interestingly, advanced algorithms in transfer learning, active learning, and meta learning can also be adapted for the LCZ classification task on the So2Sat LCZ42 dataset.For example, one challenge in largescale LCZ classification is to achieve reasonable results in areas where no or only little reference data is available.
In this case, zero-shot and few-shot learning based on meta learning principles are very promising directions to explore [75], [76].• Application-oriented approach.Depending on the specific cases of application, this study can be extended further.For instance, when an LCZ map of a certain region is required for a surface urban heat island (SUHI) study, pre-trained models can be fine-tuned on the available reference data of this area.In this way, better results can be obtained to satisfy the application, even though the model might overfit to this specific area.Another use case is monitoring of urbanization, which is increasingly attracting attention.In this case, the urban-related LCZs can be combined and multiple predictions from time series remote sensing images can be obtained for a postclassification change detection and analysis [77].

VI. CONCLUSIONS AND OUTLOOK
A range of benchmark results on the open So2Sat LCZ42 dataset were presented in this study.Because of the consistent experimental setup, this work can enable a complete understanding of the performance of CNNs on a large-scale LCZ classification task.We show that a properly designed simple CNN considering multi-level feature fusion can perform better than bigger and more complex models that have been proposed on non-remote sensing datasets.Since the proposed model is simple and light weight, further accuracy improvement within a certain budget on the model size or memory can be expected.As is well known, ultimately its a problem of well-balanced compromise between performance and imposed overhead for the specific use cases.Furthermore, our work will facilitate the development of more advanced models for the challenging task of LCZ classification on a large scale.Our trained models can be used as pre-trained ones, either using the fixed So2Sat LCZ42 features or fine-tuning from the So2Sat LCZ42 initialization, for related studies such as land cover land use classification from Sentinel-2 or Landsat data.

Fig. 1 :
Fig. 1: Illustration of LCZs and corresponding Sentinel-2 and high resolution (HR) data patches.HR data source: Esri.Map image is the intellectual property of Esri and is used herein under license.Copyright c 2019 Esri and its licensors.All rights reserved.

Fig. 2 :
Fig. 2: Architecture of Sen2LCZ-MF.The three light blue lines correspond to the part of multi-level feature fusion.Note that all convolutional layers are followed by batch normalization.The depth D = 4N + 1, and the width depends on the filter number of the first block, which is doubled for each subsequent block.

Fig. 3 :
Fig. 3: Process of LCZ classification from image data by a sliding window approach with trained CNNs.

Fig. 5 :
Fig. 5: Confusion matrix of classification results from Sen2LCZ-Net-MF(f16D17), normalized by the number of total samples of each LCZ.

Fig. 8 :
Fig. 8: An example of large-scale LCZ classification in Henan province, China, with a total area of about 167,000 km 2 .

Fig. 9 :
Fig. 9: Closer view of the LCZ result of large-scale classification example in Fig. 8, with an rural area around the location of longitude 113.2072 • east and latitude 32.6849 • north.Water Bare soil or sand Bare rock or paved Low plants Bush (scrub) Scattered trees Dense trees Heavy industry Sparsely built Large low-rise Lightweight low-rise Open low-rise Open mid-rise Open high-rise Compact low-rise Compact mid-rise Compact high-rise

TABLE I :
Basic information of Sentinel-2 bands used in this study.VNIR: Visible and Near Infrared, SWIR: Short Wave Infrared each LCZ class into the west and east halves of the city, comprising the validation and test sets.Therefore, all three sub-datasets are geographically separated from each other, even though the test and validation sets are drawn from the same list of cities.The number and distribution of the individual LCZ classes in those three subsets is visualized in Fig.4. of

TABLE II :
Test results from Sen2LCZ-Net of different depth (number of layer, D = 4N + 1) and width, which is related to the filter number of the first block, f .

TABLE III :
Testing performance of six CNNs with ( ) and without multi-level fusion (MF).

TABLE IV :
Testing performance of six CNNs with ( ) and without double-pooling layer.AM indicates the configuration where both pooling layers, average and maximum pooling, are used.Without double-pooling layer means only maximum pooling is used.

TABLE V :
Testing performance of four CNNs trained with and without () class weights.

TABLE VI :
Performance comparison among various CNNs on So2Sat LCZ42 dataset, following the same experimental setup.The top five results are indicated in bold.