CSE-HRNet: A Context and Semantic Enhanced High-Resolution Network for Semantic Segmentation of Aerial Imagery

Semantic segmentation of high-resolution aerial images is a concerning issue of remote sensing applications. To address the issues of intra-class heterogeneity and inter-class homogeneity, a novel end-to-end semantic segmentation network, namely Context and Semantic Enhanced High-Resolution Network (CSE-HRNet), is proposed in this paper. Two procedures are considered comprehensively, which are multi-scale contextual feature extractor and multi-level semantic feature producer. Nested Dilated Residual Block (NDRB) is designed firstly, which could enhance the representational power of multi-scale contexts and tackle the issue of intra-class heterogeneity. The pyramidal feature hierarchy is introduced secondly, by which multi-level feature fusions could be utilized to enlarge inter-class semantic differences. Experimental results verify that, based on the Potsdam and Vaihingen benchmarks, the proposed CSE-HRNet can achieve competitive performance compared with other state-of-the-art methods.

With the rapid development of aerial remote sensing technologies, the increased accessibility to high-resolution aerial images has opened up new horizons in the remote sensing community for various application fields, such as traffic monitoring [1], urban planning [2], intelligent agriculture [3], and disaster management [4]. Towards the automated interpretation of aerial images, semantic segmentation (i.e., semantic labeling) is a crucial step to extract valuable information from the regions of interest, inferring every pixel in an image with the information about categories of the ground objects.
Aerial images acquired with high spatial resolutions are expecting to exhibit a great diversity of objects, and provide very detailed geometric information. Specifically, objects of the same class may show various shapes, scales, colors, and structures. Meanwhile, objects of the different categories having the same colors or interacted with cast shadows would present very similar visual characteristics. Therefore, these The associate editor coordinating the review of this manuscript and approving it for publication was Jiachen Yang . confusing objects lead to the issues of intra-class heterogeneity and inter-class homogeneity, both of which pose extreme challenges for accurate and coherent segmentation [5]- [7].
Intra-class heterogeneity, on the one hand, is defined that objects grouped into the same category but with different visual characteristics shall be assigned the identical semantic label. Typically, Fig. 1 illustrates some examples of the intra-class heterogeneity issue. As shown in Fig. 1(a), cars have different colors, but they all belong to the car category. Similarly, buildings of the same category in Fig. 1(b) vary in texture and shape.
For the case of inter-class homogeneity, on the other hand, objects that are similar in appearance while allocated with different labels shall be categorized into separate semantic classes. Examples of the inter-class homogeneity are illustrated in Fig. 2. Areas of trees and low vegetation, which belong to two separate categories, have very similar appearances, as shown in Fig. 2(a). Meanwhile, buildings and impervious surfaces in Fig. 2(b) are quite similar in appearance, which intensifies the difficulty of semantic segmentation. VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ FIGURE 1. Intra-class heterogeneity could be found in typical high-resolution aerial images, which selected from the Potsdam and Vaihingen datasets.
Semantic segmentation of high-resolution aerial images suffers from the problems of intra-class heterogeneity and inter-class homogeneity. It is claimed that the intra-class heterogeneity issue is mainly derived from the lack of contextual information [7]- [9]. Consequently, multi-scale contextual information is essential to categorize the objects with significant intra-class variances, such as vehicles with various colors and shapes, into the same semantic class [7]- [10]. Moreover, multi-level semantic information could be exploited to discriminate the targets with similar appearances but belonging to distinct semantic categories [11]- [17], owing to the consideration that the problem of inter-class homogeneity results from inadequate semantic information [14]- [16]. In conclusion, multi-scale contexts and multi-level semantics would be beneficial to improving the segmentation performance for aerial imagery.
Over the past few years, the breakthrough of Deep Convolutional Neural Network (DCNN) based methods has shown remarkable advantages in semantic segmentation. Unlike traditional machine learning algorithms (mostly relying on manually-designed features), DCNN-based approaches can learn the features automatically through the training process. In [18], Fully Convolutional Network (FCN) becomes the first end-to-end, pixel-to-pixel DCNN-based segmentation network. FCN [18]- [20] converts the fully-connected layers into the convolutional layers and combines the intermediate score maps for the detailed pixel-wise labeling. Subsequently, a variety of deep convolutional encoder-decoder networks are further studied, and these methods have made their way into the domain of aerial scene understanding. For instance, U-Net [21]- [23] utilizes a symmetry U-shape encoderdecoder structure concatenating the encoded feature maps to the upsampled ones in the decoder part via skip connections. SegNet [24]- [26] records the pooling index information at the encoder part and performs non-linear upsampling in the decoder part based on the stored index information. Global Convolutional Network (GCN) [27], [28] applies the large convolution kernel and global convolution operation to tackle both of the classification and localization issues.
Note that, directly applying these DCNN-based encoderdecoder networks to semantic segmentation of high-resolution aerial images could be inappropriate. The encoder-decoder models suffer from loss of spatial information caused by a serial of downsampling operations, and they also generate blurring object boundaries after recovering high-resolution representations from low-resolution ones by upsampling. To these ends, a High-Resolution Network (HRNet) is constructed in [29]. As shown in Fig. 3(a), HRNet maintains high-resolution representations in the main branch throughout the network, instead of recovering the resolution from a low-to-high process. It also contains parallel convolution layers of different resolutions, with repeated information exchanges or feature fusions across the parallel branches. However, there still exist two shortcomings that make issues of intra-class heterogeneity and inter-class homogeneity remain challenging. Firstly, HRNet encodes FIGURE 3. Comparison of HRNet and CSE-HRNet: CSE-HRNet uses NDRB as the contextual feature extractor, and introduces the features pyramid to produce additionally strong semantic information. The multi-level fusions of feature maps are implemented between the pyramid and the parallel branches of the network. The main high-resolution branch of the network maintains the high-resolution feature map and keeps receiving information from the other 3 parallel lower-resolution branches on all of the 4 stages. The network finally integrates abundant multi-scale contexts and multi-level semantics to improve segmentation accuracy.
insufficient multi-scale contextual information and therefore generates less discriminative features for decreasing intra-class variances. Secondly, HRNet is incapable of fully exploring multi-level semantic information, which leads to weak class-dependent features.
In this paper, a novel deep convolutional architecture is presented, namely Context and Semantic Enhanced High-Resolution Network (CSE-HRNet). To strengthen the representational capacity of the network for multi-scale contextual features, we develop Nested Dilated Residual Block (NDRB) functioning as a generic feature extractor, where dilated residual convolutions with different dilation rates are organized in a nesting scheme. Moreover, we equip HRNet with the pyramidal hierarchy of multi-level features to tackle the issue of inter-class homogeneity. The hierarchy leverages the pyramidal shape of features and multi-level feature fusions at different spatial resolutions, both of which lead to more reliable feature representations containing abundant and valuable semantic information at different levels.
To summarize, the main contributions of the article are as follows.
1) Propose a novel end-to-end semantic segmentation architecture for high-resolution aerial imagery, termed as CSE-HRNet. 2) Design an NDRB structure as the generic extractor for multi-scale contextual features to increase the representational ability of the network, which can effectively mitigate the problem of intra-class heterogeneity. 3) Introduce the pyramidal feature hierarchy to enhance semantic information of different levels, which is of great benefit to tackling the inter-class homogeneity issue.
The remainder of the paper is organized as follows. Section II presents a concise overview of the related semantic segmentation methods, and Section III provides the details of our proposed CSE-HRNet. Then, the settings and results of the experiments are detailed and discussed in Section IV and V, separately. Finally, we conclude this paper in Section VI.

II. RELATED WORK
In this section, a short overview of general semantic segmentation methods is firstly presented. Then, a concise summary focusing on semantic segmentation models for high-resolution aerial imagery is followed.

A. GENERAL SEMANTIC SEGMENTATION
Nowadays, DCNN-based approaches have emerged as the leading modeling tools for semantic image segmentation. FCN [18] is the first DCNN-based semantic segmentation model that is trained in an end-to-end manner directly. Subsequently, plenty of works augment FCN with explicit encoder-decoder structures. The encoder-decoder architecture exploits multi-scale features generated through consecutive downsampling operations from the encoder part, and recovers the reduced spatial information at the decoder part by means of upsampling. U-Net [21] concatenates the downsampled feature maps from the encoder with the upsampled ones from the decoder via skip connections to form a U-shape structure. SegNet [24] saves pooling index information and performs non-linear upsampling to recover the spatial details. GCN [27] utilizes the large-sized convolutional kernel and global pooling layer to harvest context information for global representations. Notice that the encoder-decoder convolutional networks fail to perfectly prevent loss of spatial information caused by a serial of downsampling operations, owing to the fact that heavy upsampling leads to the blurring of edge and loss of location details. One effective solution is to generate and maintain high-resolution representations during the whole process. Recently, HRNet [29] is developed in the form of interconnecting high-resolution and low-resolution parallel convolution layers with repeated information exchanges or feature fusions across the parallel convolutions, which is beneficial to improving the segmentation accuracy.
Besides, to better aggregate multi-scale contextual information and enlarge receptive fields, the dilated convolution, also known as the atrous convolution, is used in DeepLab variants [30]- [32] and PSPNet [33]. DeepLabv2 [30] employs Atrous Spatial Pyramid Pooling (ASPP) which consists of multiple parallel branches with different dilation rates to embed the multi-scale contexts. DeepLabv3 [31] augments ASPP with image-level features, and Deeplabv3+ [32] adopts the encoder-decoder structure by employing DeepLabv3 as the encoder module as well as adding a simple decoder module. PSPNet [33] introduces a pyramid pooling module in which large kernel pooling layers are applied to perform multi-scale context aggregation. In addition, DenseASPP [34] consists of a cascade of dilated convolution layers with dense connections which feeds the output of each dilated convolution layer to all unvisited dilated convolution layers ahead. Dilated Residual Networks (DRN) [35] implement dilated convolutions with residual connections and outperform their non-dilated counterparts in the task of semantic image segmentation. Waterfall Atrous Spatial Pooling (WASP) module [36] combines the benefits of the cascade of dilated convolution layers with multiple fields-of-view in the parallel configuration. KSAC [37] uses multiple dilated convolution branches with different rates as well as a single shared kernel to extract both local detailed and global contextual information.

B. SEMANTIC SEGMENTATION OF AERIAL IMAGERY
Due to exceptional accuracy performance, DCNN-based semantic segmentation methods are quickly dominating the remote sensing application, and encoder-decoder architectures are becoming state-of-the-art. Specifically, Hourglassshaped network [25] uses the composed inception module and skip connections with residual units to provide the network with multi-scale receptive fields and rich contexts. Model [26] modifies SegNet [24] by adding multi-rate dilated convolutions and integrating CRF as the post-processing module. MLP [38] extracts and combines the features at different resolutions based on FCN [18] to generate the fine-grained pixel-wise classification maps. SCNN [39] proposes two shuffling convolutional neural networks by introducing the shuffling operator. SNFCN and SDFCN proposed in [19] contain deep fully convolutional networks with the shortcut blocks and adopt an overlay strategy as the post-processing method. ScasNet [40] proposes a self-cascaded encoder-decoder network to improve the segmentation coherence with the sequential global-to-local context aggregation and object refinement sub-networks. Model [41] utilizes ResNet [42] followed by ASPP [30] as the encoder and combines two scales of the high-level features with the corresponding low-level features within the decoder. In [28], GCN [27] is enhanced with the channel attention module and the domain-specific transfer learning technique. CAN [43] is an encoder-decoder-like architecture with the aggregation of affluent context information and the fusion of attention-based multi-level features. DCCN [44] introduces a CoordConv module to improve the spatial information by putting the coordinate information into feature maps to reduce the loss of spatial features and strengthen the object boundaries. In [45], four stacked fully convolutional networks and one feature alignment framework are designed to calculate an alignment loss of features that are encoded from the four basic models, to balance their similarity and variety for multi-label land-cover segmentation. Relation-augmented FCN (RA-FCN) [20] proposes a spatial relation module and a channel relation module to model global relationships between any two spatial positions of feature maps to produce relation-augmented feature representations. Tree-UNet [22] adaptively constructs the Tree-shape convolutional blocks though the Tree-Cutting algorithm to fuse the multi-scale features and learn the best weights. DDCM-Net [46] consists of combinations of the dilated convolutions merged with varying dilation rates to enlarge the network's receptive fields. Moreover, ENRU-Net [23] and DRR-Net [47] adopt the encoder-decoder structures to automatically segment diversified buildings and roads from aerial imagery data, respectively.
Combining semantic segmentation with semantically informed edge detection is also an effective and promising solution, which could make class boundaries more explicit. Model [48] proposes to explicitly represent class-boundaries in the form of pixel-wise contour likelihoods, and include them in SegNet [24]. A multi-task DCNN-based architecture is presented in [49] to collectively model the semantic class likelihoods, semantic boundaries across classes and shallowto-deep visual features. ERN [50] involves multiple weighted edge supervisions to retain the spatial boundary information to reduce the semantic ambiguity. Model [7] adopts a dual-path architecture, where a holistically-nested edge detection path is employed to extract the semantic boundaries for deep supervision of the network.
Different from but not in contradiction with the above methods, our proposed CSE-HRNet constructs NDRB as the feature extractor, which is based on dilated residual convolutions, to strengthen the representational power for multi-scale contexts. Also, the pyramidal feature hierarchy is introduced in the network to provide strong multi-level semantics. Based on both of these two blocks, our CSE-HRNet is capable of resolving the issues of intra-class heterogeneity and inter-class homogeneity.

III. METHOD
In this section, we first elaborate on the overall network architecture. Then, we describe the concept of our proposed NDRB. Lastly, we present detailed information regarding the introduction of the feature pyramid in our network.

A. NETWORK ARCHITECTURE
The model architecture of our proposed CSE-HRNet is shown in Fig.3(b). The CSE-HRNet approach is based on the backbone network of HRNet-W32. The network starts from the stem sub-network that consists of two strided 3 × 3 convolutions decreasing the feature resolution to 1/4. ''W32'' represents the number of channels or feature dimensions of the high-resolution representations throughout the main high-resolution branch. The channel numbers of other three parallel branches are accordingly set to 64, 128, and 256.
In high-resolution aerial images, multi-scale contextual information is vital for resolving the issue of intra-class heterogeneity. As a result, we propose NDRB, a multi-scale contextual feature extractor block, to replace the cascaded basic residual blocks which are initially used in HRNet. In our NDRB, three dilated residual blocks with the dilation rates of 1, 2 and 3 respectively are arranged in a nesting fashion to encode sufficient multi-scale contextual information.
Semantic cues are the critical elements for recognizing similar object instances with different semantic labels. To enhance the semantic representation of the network, we introduce a multi-level feature pyramid with the top-down architecture. The pyramidal hierarchy is capable of building additionally strong semantic features at both high-and lowlevels to increase inter-class differences between multiple objects.
Our CSE-HRNet leverages the outputs of each level of the feature pyramid and exploits lateral connections to fuse these feature maps with their counterparts at the start of the parallel branches across different semantic levels. The element-wise addition is adopted as the fusion approach, and the multi-level feature fusion is formulated as the below equation: where F k denotes the output of the element-wise addition fusion between the feature map from feature pyramid P k at level k and its counterpart B k−1 at stage k − 1 from (k − 1)th parallel branch in the network. Parameter k is set to 2, 3 and 4. Specially, 1th (when k − 1 = 1, if k = 2) branch refers to the main high-resolution branch in the network, and the first-level feature map of the pyramid is fed into the main high-resolution branch directly. After the multi-level feature fusion, F k contains enhanced semantics from the feature pyramid, and it becomes the new input feature map at the start of kth parallel branch. Then it passes through the branch which is composed of NDRB as well as multi-scale information fusion from other multi-resolution parallel branches at different stages. For the main high-resolution branch of the network, because of repeated multi-scale feature fusions performed on the stage 2 to 4, the main branch keeps receiving information from other three parallel lower-resolution branches while maintaining the high-resolution feature map, and it finally integrates multi-scale contexts and multi-level semantics simultaneously, to improve the segmentation accuracy.
In a word, CSE-HRNet is competent in enhancing the representational power of multi-scale contexts as well as multi-level semantics, owing to the sufficient contextual features along with the additionally strong semantic information at different levels. our proposed model is beneficial to resolving the issues of intra-class heterogeneity and inter-class homogeneity for semantic segmentation of high-resolution aerial images.

B. NESTED DILATED RESIDUAL BLOCK
In high-resolution aerial images, it is challenging to segment objects, which have large intra-class variances under the influence of complex scenes, scale variations, illumination changes and different colors. To overcome this intra-class heterogeneity problem, the representational capacity of the model for multi-scale and contextual features is essential to resolve the ambiguous cases. Therefore, enlarging the receptive field for acquisition of more contextual information can effectively increase the segmentation performance. Nevertheless, as shown in Fig. 4(a), only a cascade of 4 basic residual blocks is used as the feature extractor in parallel branches of different resolutions in HRNet, which is unable to perceive sufficient multi-scale context receptive fields.
To mitigate intra-class variances in high-resolution aerial images and obtain sufficient multi-scale contextual information, inspired by DRN [35] and WASP [36], we integrate the benefits of dilated residual convolutions which are arranged in the cascading and parallel strategies, and further present our NDRB to generate richer multi-scale contexts and larger receptive fields while preserving the feature map resolutions. The architecture of NDRB is depicted in Fig. 4(b). The two dilated convolutional layers, both of which have the kennel size of 3 × 3, are organized in a pair with the residual connection to form a dilated residual block. There is a total of 3 dilated residual blocks within NDRB, and the dilation rates are set to 1, 2 and 3 respectively. Then, we arrange these 3 dilated residual blocks in a nesting fashion, which can be viewed as a hybrid of both cascading and parallel manners, as shown in Fig. 4(c), to obtain multiple effective fields-ofview and accordingly capture multi-scale context features. The 1 × 1 or 3 × 3 convolution layer could be used in NDRB to calibrate the number of channels if necessary.
In Fig. 4(c), the output of NDRB can be approximated as the fusion of outputs from the four parallel branches within an NDRB-equivalent architecture. And in each branch, the dilated residual blocks are organized in the cascading mode. The first branch consists of three cascading dilated residual blocks with dilation rates of 1, 2 and 3 respectively. The output from the first branch has the following calculation: where i (X) denotes the output of the dilated residual blocks whose dilation rate is set to i, and X denotes the input feature map. Next, the second branch consists of two serial dilated residual blocks with dilation rates set to 1 and 2. The output from the second branch is formulated as: There is only one dilated residual block in the third branch, and the fourth is the identity mapping. The outputs of these two branches are represented by 1 (X) and X, respectively. We adopt element-wise addition as the efficient and reasonable fusion method for multi-scale contextual features from all of the four parallel branches. Hence the final fusion output of NDRB can be expressed as follows: output = 3 ( 2 ( 1 (X))) + 2 ( 1 (X)) + 1 (X) + X. (4) Convolutional kernels with small dilation rates are used to learn detailed information. In contrast, kernels with large dilation rates are capable of extracting features with large receptive fields, but detailed information may be lost [37]. In Fig. 4(c), for the cascading fashion, since the upper dilated residual block accepts the output of lower dilated residual block in each parallel branch, multiple receptive fields and larger contexts can be effectively perceived without losing spatial information. For the parallel mode, since the four branches are fed with the same input and their outputs are fused, the output of NDRB is able to embed features with different contexts in a wider range. Therefore, our proposed NDRB has the merits of both parallel and cascading schemes of the dilated residual blocks, and sufficient multi-scale contextual information is encoded in the network to minimize the intra-class variances.

C. PYRAMIDAL FEATURE HIERARCHY
Recent works [11]- [17] have suggested that multi-level semantic information are beneficial to recognizing visually similar objects that belong to different semantic categories, especially in the fields of image segmentation [11]- [15], action recognition [16] and road extraction [17]. Specifically, on the one hand, high-level features from deeper layers of DCNN-based segmentation models encode high-level semantic information, including object-or category-level evidence and knowledge [52], [53], which would be very useful to recognize the semantic categories of objects with similar appearances [14].
On the other hand, since high-level semantic information is not adequate for recovering detailed spatial information, feature maps from shallower layers, which have smaller semantic levels, are required to encode low-level details and spatial information to refine the coarse high-level semantic features for accurate spatial localization [53]- [55].
As a result, multi-level semantic features, combining highand low-level semantic information, can solve the problem of inter-class homogeneity. However, the original HRNet maintains high-resolution representations in the main branch throughout the whole network, and its lower-resolution parallel branches generated by downsampling operations are unable to provide sufficient multi-level semantic information, which limits the network's ability to resolve the problem of inter-class homogeneity for segmenting aerial images.
The feature pyramid is commonly used in the object detection architectures derived from Feature Pyramid Networks (FPN) [51]. The pyramid structure can exploit the inherent multi-level features, and accordingly, provide strong and adequate semantic knowledge at all levels. In order to enhance multi-level semantic representations of the model, the pyramidal feature hierarchy, as depicted in Fig. 5, is introduced in our method. The hierarchy adopts a 4-level top-down architecture where the strided convolution is applied as the downsampling method. The stride is set to 2. Heights and widths (i.e., spatial resolutions) of the feature maps are then reduced by half after each downsampling, whereas numbers of channels (i.e., feature dimensions) are doubled by contrast. The top-level (first-level) feature map of the pyramid is directly fed into the main high-resolution branch of the network. The second-, third-and fourth-level feature maps are fused with the counterparts from the multi-resolution branches via the element-wise addition, as shown in Equation (1).
Consequently, benefiting from the features outputted from both shallow and deep layers altogether, the pyramidal hierarchy is capable of exploring high-and low-level semantic features simultaneously, and propagating abundant multi-level semantic information into the model, to solve the problem of inter-class homogeneity.

IV. DATA AND EXPERIMENTAL SETUP
In order to verify the effectiveness of our proposed CSE-HRNet, intensive experiments have been conducted using the publicly available Potsdam and Vaihingen datasets. In this section, we firstly describe these two datasets. Then, we present the implementation details and the evaluation metrics used in our experiments. Finally, we briefly introduce the six state-of-the-art deep learning networks, which are selected as the comparison methods, before the experimental results and discussion are presented in the next section.

A. DATASETS
The proposed method was evaluated with two public benchmarks, i.e., the Potsdam dataset and the Vaihingen dataset, both of which are provided by International Society for Photogrammetry and Remote Sensing (ISPRS). These two datasets contain the high-resolution true ortho photo (TOP), the digital surface model (DSM), and the normalized DSM (nDSM) data, with the corresponding ground truth labels. While both DSM and nDSM data are included in the two datasets, we only focused on the raw TOP images in this work, following [40], [43], [46].
There are six object categories in both of the datasets, comprising impervious surfaces, buildings, low vegetation, trees, cars, and clutter/background. The clutter category includes ground objects like water bodies, containers, tennis courts, and swimming pools. Following [20], [25], [38], [39], we only predicted five classes, ignoring clutter/background in the case of Vaihingen, due to the lack of training data for that class. As for the Potsdam dataset, we predicted all of the six categories.
The Potsdam dataset contains 38 TOP patches of size 6000 × 6000 at a spatial resolution of 5 cm. 24 out of 38 patches are provided with the ground truth, and only these 24 TOP patches were used for training and testing the proposed method in this paper. Image tiles consist of four spectral bands: red (R), green (G), blue (B), and near infrared (IR). But we only focused on the three-channel RGB tiles in this work, as in [46]. The Vaihingen dataset consists of 33 TOP image tiles of varying sizes (on average VOLUME 8, 2020 approximately 2100×2100 pixels), which have a spatial resolution of 9 cm. 16 out of the 33 tiles involve the labeled ground truth, and these 16 tiles were explored to train and test the method. Different from Potsdam, there are only three spectral bands in the TOP tiles, including IR, R, and G channels, and we used the IRRG data in our experiments.
Furthermore, we utilized the same dataset setting as several previous works [20], [38], [39] to facilitate the comparison of the experimental results. For the Potsdam dataset, we used 17 out of the 24 labeled tiles for the training process and the other 7 tiles (ID: 2-11, 2-12, 4-10, 5-11, 6-7, 7-8, and 7-10) as the test set. For Vaihingen, 11 tiles were selected as the training set, and the remaining 5 tiles (ID: 11, 15, 28, 30, and 34) were utilized for the test set. Otherwise, we also conducted the statistical significance tests, as well as 4-fold cross-validation tests using both of the two datasets to assess the performance of the proposed method.

B. EVALUATION METRICS
In our experiments, we employed three metrics according to the ISPRS dataset guidelines, including the F1 score (F1), the mean F1 (mF1) and the overall pixel accuracy (OA), to assess the model performance comprehensively.
The F1 metric can be interpreted as the weighted average of the precision and recall, and the contributions of precision and recall to the F1 metric are set to be equal. Thus, the F1 metric is defined as: And precision and recall are expressed as the two equations below respectively: where TP, FP, TN, and FN are the number of true positive, false positive, true negative, false negative, separately. Moreover, the mF1 metric is defined as the average value of the F1 metric over all predicted categories. The OA metric is the ratio of the number of correctly predicted pixels to the total number of pixels in the entire image, and the OA metric can be calculated as:

C. IMPLEMENTATION DETAILS
All the experiments were performed on a server equipped with eight 11-GB NVIDIA GeForce RTX 2080 GPUs. Our model was built on the deep learning framework of Pytorch, and the main required libraries included Python 3.7, CUDA 9.0, and cuDNN 7.6. Since the spatial resolution of the Potsdam dataset is approximately double of that of the Vaihingen dataset, we cropped the image patches from Potsdam with a size of 512 × 512, and from Vaihingen with a size of 256×256, to make them roughly cover the same geographical area, according to the difference in resolution [38]. To avoid over-fitting of the model, two data augmentation operations were considered in this paper. Firstly, a random sampling or cropping procedure was employed to the labelled image patches, which generated new training data with the same dimension in a random position to further expand the size of the training set. Secondly, we random transformed training images in each iteration by random flipping (horizontal, vertical or both). For training details, we selected Stochastic Gradient Descent (SGD) as the optimizer with an initial learning rate of 0.1, a momentum of 0.9, and a weight decay of 0.0005. We also employed the poly learning rate decay policy where the initial learning rate was multiplied by (1 − iter max−iter ) power at each iteration with a power of 0.9. Each training procedure in our experiments contained 100 epochs. Additionally, owing to the GPU capacity and the different sizes of input patches for the two datasets, the batch size was set to 32 for the Potsdam dataset, and set to 128 for Vaihingen.
When making inferences on the test set, we adopted a sliding window to traverse the whole images to yield seamless prediction results. Following [19], [25], [46], in order to obtain well-smoothed inference results, consecutive overlapping patches were retained in the test data (50% overlapping for Potsdam, and 75% for Vaihingen) in our experiments, and then multiple predictions were averaged to smooth the prediction of boundaries and eliminate the possible discontinuities.
Specially, MLP, EDeeper-SCNN and RA-FCN are three segmentation methods for high-resolution aerial imagery, whose results have been published by other authors using the same training and test sets. We refer to these official results as baselines for the proposed work. In addition, we trained and tested SegNet (the typical encoder-decoder method), Deeplabv3 (the dilation-based model) and HRNet (the high-resolution architecture) respectively under the same experimental setup. We adopted VGG-16 [56] as the base network for SegNet, and selected ResNet-50 [42] as the backbone for Deeplabv3. The proposed CSE-HRNet and HRNet were both based on the HRNet-W32 architecture. All of the models were trained from scratch in our experiments.

V. RESULTS AND DISCUSSION
In this section, we present and analyze the experimental results of our method on both of the datasets. Firstly, we report and discuss the results of ablation studies to verify the effectiveness of the proposed NDRB and feature pyramid. Secondly we compare our method with the other deep learning models for the accuracy evaluation, all of which have been described in Section IV-D. The numerical results of the accuracy evaluation, statistical significance analysis and multi-fold cross validation are listed to demonstrate that the improvement on the model performance brought by our proposed method is significant. Thirdly, the visual comparisons are presented to further show the superior performance of CSE-HRNet.

A. ABLATION STUDY
In this subsection, we step-wise decomposed the network and evaluated the influence of the proposed components, namely the proposed NDRB and feature pyramid, both separately and collectively. As a result, we performed three ablation studies by training the following three models and comparing them with the original HRNet: 1) HRNet which replaced the cascaded residual blocks with the proposed NDRB, 2) HRNet equipped with the fearure pyramid, and 3) the proposed CSE-HRNet configured with both of the proposed NDRB and feature pyramid. Experimental results on the ablation studies are shown in Table 1.
It is shown from the table that HRNet with NDRB outperforms the original HRNet for the Potsdam dataset with respect to the overall metrics of mF1 and OA. For Vaihingen, HRNet with NDRB obtains better value for the OA metric, but its mF1 value is about 0.6% lower than HRNet. On the other hand, HRNet with the feature pyramid enhances the overall segmentation performance on these two ISPRS datasets, with respect to both of the metrics. By combining the NDRB and feature pyramid simultaneously, the performance of our proposed CSE-HRNet is generally superior to the other three models. CSE-HRNet achieves the best results in terms of the metrics of mF1 and OA for the Potsdam dataset. Meanwhile, it attains the best OA value, as well as the second best value on mF1 for Vaihingen. To conclude, both of the proposed NDRB and feature pyramid reinforce the performance of HRNet for semantic segmentation in aerial scenes, and the effectiveness of these two components is solidly verified.

B. QUANTITATIVE RESULTS ON THE POTSDAM DATASET
In this subsection, the benchmark tests of the Potsdam dataset are performed, and the quantitative results are presented and discussed.

1) ACCURACY EVALUATION UNDER GIVEN DATASET SETTING
In comparison with the other six baseline models, numerical results on the Potsdam dataset are shown in Table 2, under the same dataset setting described in Section IV-A.
Based on the given dataset setting, it is demonstrated that CSE-HRNet outperforms other methods in terms of the overall metrics of mF1 and OA, as well as the per-class F1 of low vegetation, trees, cars and clutter/background, without using any DSM data. Although RA-FCN performs better for the per-class F1 of impervious surfaces and buildings, the proposed method still achieves competitive performance regarding these two categories.

2) TEST OF SIGNIFICANT DIFFERENCE
The same dataset setting, which has been outlined in Section IV-A, were used in the assessment of the segmentation accuracy. Hence, the samples are not independent, and not necessarily normally distributed [57]. For such related samples, the statistical significance of the difference between two measures in the remote sensing study may be evaluated using a Wilcoxon signed-rank test [58]. Consequently, we sought to execute the two-tailed Wilcoxon signed-rank tests to establish the superiority of the proposed method.
As a null hypothesis, it is assumed that there is no significant difference between the two measures, regarding our proposed CSE-HRNet and one of the other baseline methods. According to the alternative hypothesis, there is a significant difference between the evaluation metrics of the two methods. A p-value is produced by the Wilcoxon signed-rank test to verify the significance of the obtained results. The difference between two sets of performance metrics derived with the proposed CSE-HRNet and the baseline model would be regarded as being statistically significant, if the p-value is less than 0.05 (at the 5% significance level).
In this study, 10 repeated runs of experiments were carried out based on the aforementioned dataset setting, and the Wilcoxon signed-rank tests were undertaken on a pairwise basis between our CSE-HRNet and other three models (Seg-Net, Deeplabv3 as well as HRNet) respectively, in terms of the metrics of per-class F1, mF1 and OA. The experimental results are listed in Table 3.
From Table 3, we can determine that there are statistically significant differences between CSE-HRNet and the other three baselines, in terms of mF1 and OA, which is identical to the corresponding results in Table 2. Regarding per-class F1 of the car and clutter categories, there is no significant difference between CSE-HRNet and HRNet. Nevertheless, the proposed CSE-HRNet still performs the best among all of the networks.

3) MULTI-FOLD CROSS-VALIDATION
To further demonstrate the robustness and generality of the method, a 4-fold cross-validation was conducted to assess VOLUME 8, 2020  the performance of the proposed method. In this strategy, 4 runs were executed, where at each run, 3 disjoint subsets of the dataset were used for training, while the remaining one was used for validation. The results are shown in Table 4, where each average metric of the 4 runs is followed by its corresponding standard deviation.
It is observed that among all of these techniques, the CSE-HRNet approach performs the best in almost all evaluation indicators except per-class F1 of the clutter class. Compared with SegNet and Deeplabv3, the lower standard deviation of CSE-HRNet implies that the method produces more stable results. The difference between the standard deviations of CSE-HRNet and HRNet is small, and the variations would have no serious impact on their final outcomes. Besides, for all the methods, the standard deviations of per-class F1 of clutter/background are much larger than those of the other five categories. The reason would be that the clutter class is not semantically nor visually coherent, since it is just the collection of different semantic classes, which may lead to varying training and test samples for different cross-validation runs.

C. QUANTITATIVE RESULTS ON THE VAIHINGEN DATASET
For the Vaihingen dataset, we repeated the comparison experiments, including 1) the accuracy evaluation under the aforementioned dataset setting, 2) the test of significant difference, and 3) the test of multi-fold cross-validation, by training the same network structures under the same experimental setting as presented in Section V-B, except that we ignored the segmentation accuracy of clutter/ground because of the lack of corresponding training data. However, the OA metric in this study was calculated with all the pixels in the image, including the ones which belong to the clutter class. Table 5 reports the comparison of the experimental results based on the given dataset setting, to show the superiority of the proposed CSE-HRNet.

1) ACCURACY EVALUATION UNDER GIVEN DATASET SETTING
From the table, it is noteworthy that the CSE-HRNet approach also exhibits the best overall performance. However, in contrast to the results in Table 2, Table 5 shows that the proposed method only obtains the third best performance for the categories of low vegetation, trees and cars, whereas it achieves the best per-class F1 for the impervious surfaces and buildings. This might be justified by the observation that urban areas in Potsdam tend to have more complicated interactions between objects than Vaihingen, due to its higher spatial resolution (5 cm for Potsdam, and 9 cm for Vaihingen), and the larger training patches used in the experiments. The dilation rates of NDRB are fixed, which are set to {1, 2, 3}, in this paper. As a result, the network is more competent in extracting sufficient contextual information of the medium and small sized objects (e.g., trees and cars) in the higher-resolution image with large patch size, as well as the large objects (e.g., impervious surfaces and buildings) in the lower-resolution image with small patch size. The influence of the spatial resolution of aerial imagery, the training patch size, and the selection of the dilation rates in the model over the per-class accuracy will be studied in our future research.

2) TEST OF SIGNIFICANT DIFFERENCE
The Wilcoxon signed-rank tests were implemented on Vaihingen in the same manner as on the Potsdam dataset, to verify the effectiveness of the proposed method. The experimental results are listed in Table 6.
All the p-values reported in the table are much less than 0.05 when comparing CSE-HRNet with SegNet, and Deeplabv3, respectively, indicating that the better values of the performance metrics produced by our proposed method are statistically significant. For the comparison of CSE-HRNet and HRNet, it is observed that in regard to F1, OA, together with per-class F1 of impervious surfaces and buildings, the improvement is also statistically significant, which is consistent with the results in Table 5. Overall, the results of the statistical significance tests verify that the proposed method provides a statistically improvement on the segmentation accuracy, based on the Vaihingen dataset.

3) MULTI-FOLD CROSS-VALIDATION
Likewise, the 4-fold cross-validation was also performed, and the same strategy was adopted on both of the ISPRS datasets. Experimental results for the 4-fold cross-validation are presented in Table 7.
We can confirm that the proposed method achieves the best performance in validation results under all evaluation metrics. The results in Table 7 are unexpectedly inconsistent with the ones in Table 5 and Table 6, in terms of the per-class metrics of low vegetation, trees and cars. Our model offers a higher mean accuracy with lower standard deviations for these three categories, due to the larger variations of performance caused by HRNet. This proves the robustness and generality of the CSE-HRNet approach, as well as the effectiveness of the proposed NDRB and feature pyramid.
Note that there exist slight differences on the results between the 4-fold cross validation, and the test of significant difference. We believe that such differences result from varied distributions or splitting strategies of the training and test data, which may have some influence on the model assessment. we will continue to research, in the next step, how different distributions or splitting strategies of training and test data will affect the model performance.   The aerial scenes in the figure are complicated, and the issues of intra-class heterogeneity and inter-class homogeneity are displayed in all of the four image patches in the first column. For example, in Row (A) from the Potsdam dataset, impervious surfaces appear in similar appearance to low vegetation or clutter/background, and in Row (C) from Vaihingen, buildings have different colors and shapes.
As expected, the CSE-HRNet provides the best preservation of boundaries and geometry for the segmented objects. For instance, impervious surfaces at the lower left corner of Row (A) are only correctly segmented by CSE-HRNet, whereas the other three models misclassify them as clutter/ground, due to the inter-class homogeneity issue. In Row (B), various cars with different colors and models are parked along the street. We note that only our proposed method is able to segment the two white cars, which are sheltered by other objects, in a geometrically accurate manner. This effect could be a natural consequence of the proposed NDRB to alleviate the intra-class heterogeneity issue. In addition, we observe that, despite of the intra-class variance caused by the shadow, the edges and corners of the building at the right corner of Row (C) are better outlined by CSE-HRNet. In Row (D), the boundaries between areas of trees and low vegetation in the segmentation result tend to be more precise and smoother for the case of CSE-HRNet, due to the introduction of the feature pyramid. The feature pyramid is utilized to enhance the multi-level semantic features to distinguish the similar objects of different semantic categories. To summarize, despite the complex scenes with ambiguous objects, the visual comparisons show that our CSE-HRNet gives much more accurate labeling and object boundaries than the other baseline methods.

VI. CONCLUSION AND FUTURE WORK
In this paper, a notable CSE-HRNet mode is presented for the semantic segmentation of high-resolution aerial images. With the aid of NDRB combined with the pyramidal multi-level feature hierarchy, CSE-HRNet is able to resolve the issues of intra-class heterogeneity and inter-class homogeneity, simultaneously. Through simulations, we have confirmed that the proposed CSE-HRNet can achieve superior performance of semantic segmentation on the ISPRS Potsdam and Vaihingen benchmarks.
Inspired by methods on domain adaption theories, a possible future direction of this work is to apply new or existing domain adaption algorithms into the model design to reduce the domain shift problem between the two ISPRS datasets, and thus increase the generality of the method. This new domain adaption based method will be evaluated with the combined dataset, rather than two datasets separately. We will concentrate on this line of research in our future work. Furthermore, we will further study the impacts of different spatial resolutions of aerial imagery, various patch sizes of training images, the selection of dilation rates, together with different distributions or splitting strategies of training and test data, over the model performance. In addition, we will continue to design a lightweight framework based on our proposed CSE-HRNet, for real-time semantic segmentation of high-resolution aerial imagery, to systematically improve the model performance in terms of both accuracy and efficiency.