CCENet: Cascade Class-Aware Enhanced Network for High-Resolution Aerial Imagery Semantic Segmentation

Semantic segmentation of high-resolution aerial images is a challenging task on account of interclass homogeneity and intraclass heterogeneity of land cover. Recent works have sought to mitigate this issue by exploiting pixelwise global contextual information using self-attention mechanism. However, the existing attention-based methods usually generate inaccurate object boundary segmentation results, as the self-attention model is embedded in high-level features with low resolution due to prohibitively computational complexity. Moreover, existing attention-based models ignore classwise contextual information from intermediate results, which leads to undesirable feature separability. To obtain discriminative feature as well as generate accurate segmentation boundaries, we present a novel segmentation framework, named cascade class-aware enhanced network (CCENet) for high-resolution aerial imagery. The proposed CCENet predicts segmentation results on multiple stages, and the result of the previous stage is used to refine object boundary details for the latter stage. To exploit the class-aware prior information in previous stage, we propose a lightweight class-aware enhanced module (CaEM) to grab the class-aware contextual dependencies. Specifically, CaEM first extracts a set of class representation of the land covers by global class pooling block and then reconstructs enhanced features using class relation measurement, which alleviates the interclass homogeneity and intraclass heterogeneity of ground objects in feature space. Quantitative and qualitative experimental results on three publicly available datasets demonstrate the superiority of our CCENet over other state-of-the-art methods in the items of high labeling accuracy and computation efficiency.

reach less than 10 cm. Based on the fine description of highresolution aerial images, automatic interpretation plays an important role in a widespread of applications, such as road extraction [1], traffic monitoring [2], urban planning [3], intelligent agriculture [4], and disaster management [5]. To aid scene understanding, semantic segmentation for aerial images (i.e., semantic labeling) is a very crucial step by inferring pixelwise semantic class labels.
In semantic segmentation of high-resolution aerial images, interclass homogeneity and intraclass heterogeneity are two issues [6], [7]. On one hand, the aerial images are acquired from a bird's view with less common structural information, which could make the interclass homogeneity of land covers worse. Fig. 1 illustrates an examples of this issue. For example, the roof of buildings presents very similar visual characteristics to the road surface. On the other hand, high spatial resolution aerial images contain only a few spectral channels, which is insufficient to narrow the intraclass heterogeneity caused by various appearances, such as the different roofs in Fig. 1. In recent years, semantic segmentation of high-resolution aerial images has achieved remarkable progress, owing to the emergence of deep convolutional neural network (CNN) and massive training data [8], [9], [10]. To adaptively discover patterns in the high-resolution aerial images instead of traditional hand-crafted feature [11], fully convolutional network (FCN) [10] is used to automatically learn the features by sliding the filters (convolutional kernels) at all locations. However, due to the lack of contextual information caused by limited receptive fields of convolutional kernels [12], [13], [14], the features learned by CNN are usually not discriminative to solve the problems on interclass homogeneity and intraclass heterogeneity.
To exploit richer contextual information, some research works in computer vision focus on enlarging the network receptive field. The early study is mainly about increasing the spatial scale of contexts, including large kernel convolution [12], atrous spatial pyramid pooling (ASPP) [13], and pyramid pooling module (PPM) [14], to exploit multiscale contexts. These multiscale context improvements have been proved effective for high-resolution aerial images [15], [16], [17], [18]. Nevertheless, these strategies ignore the long-range relational dependencies between the objects and scenes. In order to aggregate long-range relational context, self-attention mechanism [19], [20] is widely adopted in semantic segmentation. Specifically, OCNet [21], DANet [22] and relation-augmented FCN (RAFCN) [23] calculate the weighted point-by-point attention map and reconstruct the feature of each pixel by fusing information from all positions. However, when processing high-resolution aerial images, the point-by-point attention-based methods face heavy calculation. For instance, given an input feature map with size C × H × W , generating the similarities map of all locations between each other requires a matrix multiplication of computational complexity O(CH 2 W 2 ), which is prohibitively tremendous for high-resolution aerial images.
Generally, most existing works simply embed the selfattention model in high-level features with low resolution due to the huge computational complexity, which leads to inaccurate object boundary segmentation results. It can be seen in Fig. 1(c) that the feature from RAFCN [23] has defective boundary details. Moreover, the above methods hold less discriminative ability to learn features. For example, Fig. 1(c) shows the RAFCN feature cosine similarity visualization of all pixels to the pixel marked by yellow cross sign. It can be seen that RAFCN fails to distinguish the confused building and impervious surface. The main reason is that the feature learned by self-attention module collect all the pixel representation, which could bring background interference during feature learning. Moreover, the intermediate feature of network lacks global class-aware prior knowledge, i.e., no direct supervision from groundtruth label, thus leading to undesirable feature separability.
To promote the discriminative ability and generate accurate segmentation boundaries, we present a novel segmentation framework, named cascade class-aware enhanced network (CCENet) for high-resolution aerial imagery. The key insight of CCENet is to refine object boundary details recursively under the guidance of previous output. Specifically, CCENet predicts segmentation results on multiple stages, and the result of the previous stage is used to refine object boundary details for the latter stage. To exploit the class-aware prior information in previous stage, a lightweight class-aware enhanced module (CaEM) is proposed to grab the class-aware contextual dependencies. Specifically, previous segmentation result and feature map are first used to extract class representation. Then the class representation is utilized to measure the similarity with feature map to get the refined result, which is supervised by our proposed similarity loss. Last, the enhanced feature is reconstructed by the class representation and refined result. The proposed CaEM has similar structure with OCR [24], which reduces the attention computational complexity from O(CH 2 W 2 ) to O(CHW K) with K classes in dataset, so that CaEM can be applied into low-level feature with high-resolution details.
To summarize, our main contributions are as follows. 1) We design a coarse-to-fine CCENet to deliver the global class-aware prior information from the deep layer to the following portion, which refines object boundary details in a recursive paradigm for high-resolution aerial imagery labeling. It can be seen in Fig. 1(c) and (d) that our CCENet can learn more distinguishable object boundaries compared with RAFCN [23]. 2) We propose a novel lightweight attention module CaEM to boost the representational ability of the network by introducing metric learning. The proposed CaEM similar to OCR to some degree but with less computation load by directly measuring the pixel-class relation with feature cosine similarity instead of matrix multiplication, which simultaneously excludes the interference of other class representation. As shown in Fig. 1(d), our method can exclude the interference of irrelevant land covers for learning more discriminative features. 3) We design a series of experiments to demonstrate the effectiveness of the proposed CCENet on the challenging International Society for Photogrammetry and Remote Sensing (ISPRS) Potsdam and Vaihingen datasets. Quantitative and qualitative results show that our model outperforms other state-of-the-art methods. The remainder of this article is structured as follows. Related work is briefly reviewed in Section II. Section III introduces the proposed approach for high-resolution aerial imagery segmentation in detail. The effectiveness of the proposed method and ablation studies is demonstrated in Section IV by results on three real high-resolution aerial imagery datasets. Finally, Section V concludes this article and suggests future research directions.

A. Architecture of Semantic Segmentation
Over the past few years, the breakthrough of deep CNN has led to remarkable progress in semantic segmentation. FCN [10] is the first end-to-end segmentation network, which converts the fully connected layers in classification network [8] into the convolutional layers for the dense pixelwise labeling. Subsequently, plenty of variants were proposed to improve the performance of FCN. In order to prevent the loss of spatial detailed information caused by downsampling operations, SegNet [25] saves pooling index information and conducts nonlinear upsampling to recover the spatial details. U-Net [26] concatenates the downsampled feature in encoder with the upsampled ones in decoder via designing a U-shape structure. HRNet [27] generates and maintains high-resolution representations during the whole process, which becomes a widely used backbone in semantic segmentation. To refine segmentation results, coarse-to-fine concept has been adopted in segmentation framework. Early researchers tried to address this issue with graphical models such as CRF [28]. However, these methods rely on low-level color boundaries without leveraging high-level semantic information and cannot fix large error regions. In order to recover a precise boundary, low-level texture features are utilized by skip-connection into the deeper layers. For example, RefineNet [29] merges all the information available along the downsampling process to enable high-resolution prediction. CascadePSP [30] refines boundaries in different resolution using PPM. However, they ignore global class-aware prior information from the deep layer when refining the object boundary. In contrast, our cascade architecture can output multiple intermediate results, which is supervised by groundtruth during end-to-end training. This concept is similar to deep supervision [31], [32], and the difference lies in that our CCENet can transfer the global class prior information from previous stage to the following portion, which can further improve the feature separability and refine object boundary.

B. Context Aggregation Model
Other than detailed refinement achieved by structural innovation, many researchers try to improve segmentation result by designing plug-and-play model for aggregating more contextual information. For the purpose of exploiting multiscale context, GCN [12] adopts global convolutional module and global pooling to capture global context information. DeepLabv3+ [13] employs ASPP consisting of parallel convolutions with different dilated rates to increase the receptive field. PSPNet [14] introduces a PPM in which different kernel pooling layers are applied to aggregate multiscale contextual information. DMNet [33] utilizes multiple dynamic convolutional modules arranged in parallel, each of which explores context-aware filters for a specific scale. Different from multiscale context aggregation model, self-attention mechanism [19], [20] is more efficient to aggregate long-range relational context information. Typically, OC-Net [21], DANet [22], and RAFCN [23] calculate the weighted point-by-point attention map, i.e., the relation between pixels, and augment the feature of each pixel by fusing information from all positions. However, generating the weighted attention map consumes tremendous computing and memory resources, which hinders its usage in real-time operations of an aerial platform.
Several works are proposed to reduce the computation load and memory usage of attention-based methods. Specifically, CCNet [45] collects relational context of all the positions by stacking two serial criss-cross attention module. EMANet [34] and asymmetric nonlocal neural network [35] explore a group of global descriptors to reconstruct the feature maps instead of treating all pixels themselves as the reconstruction descriptors. ACFNet [31] and OCR [24] are related to our method, which further improves the global descriptors with class information. Our proposed CaEM is inspired by the above relational context approach [24], [35]. The main difference between CaEM and other relational context model lies in similarity calculation and explicit similarity loss supervising the intermediate results.
Specifically, our CaEM measures the pixel-class relation by feature cosine similarity without interference of other class. Besides, inspired by metrics learning [36] and deep supervision [32], we propose similarity loss to explicitly supervise the intermediate result, which alleviates the interclass homogeneity and intraclass heterogeneity problem.

C. Semantic Segmentation of Aerial Imagery
Compared with natural images, semantic segmentation for aerial images is more challenging, as the aerial images are captured from a bird's view with less common structural information and few spectral channels. This aggravates the interclass homogeneity and intraclass heterogeneity [6], [7] problems. Several works have been proposed to learn more discriminative feature for semantic segmentation of high-resolution aerial images. For example, Tree-UNet [37] adaptively constructs the tree-shape convolutional blocks though the tree-cutting algorithm to fuse the multiscale features and learn the best weights. CSE-HRNet [7] adopts nested dilated residual block to enhance the representational power of multiscale contexts. ScasNet [17] proposes a self-cascaded encoder-decoder network to improve the segmentation by sequential global-to-local context aggregation and object refinement subnetworks. DDCM-Net [38] combines the dilated convolutions merged with varying dilation rates to enlarge the network's receptive fields. CAM-DFCN [39] is introduced to automatically weight the channels of feature maps to perform feature selection.
In order to grep global spatial contextual information, attention mechanism has been introduced into semantic segmentation of aerial imagery. Typically, RAFCN [23] proposes spatial relation module and channel relation module to learn relationships between any two positions. SSAtNet [40] proposes pyramid attention pooling module to introduce the attention mechanism into the multiscale module for adaptive features refinement. In order to reduce the large time and space demands of selfattention operation, MSCA [41] adopts a multibranch spatialchannel attention model to efficiently extract global dependency and combine it with multiscale and channel-attention methods. HMANet [42] adaptively captures global correlations from the perspective of space, channel, and category in an efficient manner. MANet [19] proposes a novel attention mechanism of kernel attention with linear complexity and adopts multiple attention modules to extract contextual dependencies.
However, the above models aim to learn the distribution across the entire dataset, so that they balance all the classes on different aerial images. Therefore, some categories could be depressed for one specific aerial image. Inspired by the dynamic prototype extraction of few-shot semantic segmentation [43], our work introduces class-aware prior information to ensure the discrimination of each category for each specific image in a recurrent manner, which is helpful for the pixel classification in this image.

A. Overall Framework
In high-resolution aerial images, both high-level contextual information and low-level texture features are vital for pixelwise segmentation task. Thus, we propose CaEM module to enhance high-level contextual information integrated class label prior information. To further keep low-level texture feature, we construct a novel cascade segmentation framework CCENet based on multiple-level recursive CaEM. The overall structure of our proposed CCENet is shown in Fig. 2, which follows the similar coarse-to-fine framework [17], [29], [30].
To generate feature maps with different resolutions and initial segmentation, we first use the dilated ResNet-101 [13], [22] as encoder. Specifically, the dilation convolutions are used to improve the convolutional layers, and the last downsampling operation is removed, which can preserve the details of land covers as well as to enlarge the receptive fields of the feature map. Based on this modified encoder, an initialization segmentation map is 1/8 size of the input image instead of 1/16 in original ResNet-101. Subsequently, we propose a cascade class-aware enhanced decoder to transfer the global class-aware prior information from the bottom-up. The decoder reutilizes the low-level features from shallow layers by long-range connections, so that the low-level details can be recovered. For each stage, the enhanced process can be formulated as follows: For stage-t, E t and F t denote the enhanced feature map and raw feature map, respectively, and P t is the segmentation result. Notably, E 1 is initialized as F 1 at the stage-1, and F 4 = 0 as there is no corresponding feature map for the last stage. f CaEM (·) represents the mapping function of our proposed CaEM, which will be described in Section III-B. δ(·)is a transformation function for multilevel feature fusion, which implements bilinear interpolation, 3 × 3 convolution, ReLU [44], and batch normalization step-by-step. After three refinement stages, the last stage feature is refined through bilinear interpolation, 3 × 3 convolution layer, and softmax layer to get the final segmentation  result. The detailed configuration of CCENet is captured in Table I.
It is worth noting that the embedded CaEM generates intermediate segmentation result for each stage, which can deliver the global class-aware prior information to refine the object boundary in a recursive manner from the deep layer to its following portion.

B. Class-Aware Enhanced Module (CaEM)
Inspired by self-attention mechanism [19], [20] and its variants [31], [34], we propose CaEM to introduce classwise information and boost the representational ability of the network. The detailed structure of CaEM is shown in Fig. 3. It consists of two subsequent blocks, global class pooling (GCP) block and class relation measurement (CRM) block. Given an input feature map E t ∈ R C×H×W at stage t with channel number C, height H and width W, and the segmentation result P t ∈ R K×H×W including K classes, GCP block can extract a set of class representation c t ∈ R K×C . The similarity between feature map E t ∈ R C×H×W and class representation c t ∈ R K×C is measured via CRM block, which not only gets the refined segmentation result P t+1 ∈ R K×H×W but also reconstructs the enhanced feature E t+1 ∈ R C×H×W .

1) Global Class Pooling (GCP):
It has been shown in [12], [13], and [45] that global context features turn out to be advantageous in segmentation tasks, which can be easily achieved by global average pooling (GAP). However, GAP aggregates the feature of all pixels without considering categorical information, which could result in interclass indistinction problem. To improve GAP from categorical view, we propose a GCP operator which can generate global descriptors with class information. Therefore, GCP can provide a strong global class prior information for a specific image. As shown in Fig. 3(a), the inputs at stage t are the feature map E t ∈ R C×H×W and coarse segmentation map P t−1 ∈ R K×H×W . To simplify the following explanation, we omit the denoting symbol t of stage order. Recording H×W = N as the pixel number, the class representation c ∈ R K×C is computed by GCP operator as follows: where E i ∈ R C×1 denotes the feature depiction of pixel i, and P ki ∈ [0, 1] denotes the confidence of pixel i belonging to class k. To control the influence of other classes on specific category, we use a temperature T of softmax referring to knowledge distillation [46]. We note that GCP can be deducted into two special cases by adjusting the temperature T. When T → +∞, GCP degenerates as follows: This has the same formulation to GAP. Equally, we can view that GAP is the extreme case of GCP with maximum of T. Therefore, with smaller T, GCP takes the class information into consideration compared with GAP, so that the irrelevant land cover can be excluded to learn more discriminative features. Another extreme situation is T → 0 + ; the class representation computed by GCP can be formulated as follows: In this case, the class representation becomes the feature of most representative pixel [47]. However, only preserving one representative feature could lead to exceptional situation when some ground object categories do not exist in image. Instead, GCP with larger T aggregates global context features in the whole feature map in a soft form, and thereby GCP is easier to optimize.

2) Class Relation Measurement (CRM):
To lead the input feature close to the class representation extracted by GCP block, we design CRM block to measure their similarity, which can enhance the representation capacity of network. As shown in Fig. 3(b), CRM inherits the structure of self-attention mechanism [20], [24]. Different from the pixelwise matrix multiplication of self-attention mechanism, CRM introduces metric learning to measure the similarity between pixel feature and class representation. In particular, the reshaped feature map E reshape ∈ R N ×C serves as the input of query branch, and class representation c ∈ R K×C serves as the input of key branch and value branch, which can be formulated as follows: where W Q , W K , W V ∈ R C×C are three linear transformations implemented by 1 × 1 convolution, and X Query ∈ R N ×C , X Key ∈ R C×K , X Value ∈ R C×K are the output of three branches. Next, X Query and X Key are used to conduct metric learning, which calculates the cosine similarity matrix S ∈ R N ×K as follows: where S ij ∈ [0, 1] represents the cosine similarity between the ith pixel and the jth class representations. Different from pixelwise matrix multiplication of self-attention mechanism, e.g., DANet [22] and OCR [24], CRM treats pixel and class representation separately without interclass comparison. This improvement filters background interference during feature learning. For stage t, the similarity matrix can be obtained as S t from the above processes, and the enhanced feature E ( t + 1) can be reconstructed by Finally, similarity matrix S t ∈ R N ×K is transposed and reshaped into R K×H×W as the intermediate segmentation result P t of stage t, which is supervised by our proposed similarity loss. The enhanced feature E t+1 ∈ R N ×C is transposed and reshaped into R C×H×W as the output feature of CaEM.

C. Loss Function
Following the idea of deep supervision [34], [51], we combine all the losses on multiple stage results. Specifically, both initial (stage 0) and final prediction map (stage 4) are supervised with cross-entropy (CE) loss since they are obtained by softmax layer, which can be formulated as follows: where P ki denotes the output probability of pixel i belonging to class k, and y i is the groundtruth of pixel i, and I(·) refers to the indicator function as illustrated in (11) Different from segmentation result predicted by softmax, CaEM measures the cosine similarity between pixel and class representation. Therefore, it is inappropriate to supervise the intermediate results produced using CE loss, which would result in that all elements in similarity map close to 1. To avoid this situation, we propose to adopt the mean squared error loss in metrics learning [36] to constrain intermediate result during training, which can be formulated as follows: Intuitively, if a pixel feature F i has the same label as the class representation c k , the similarity S ik should be close to 1. On the contrary, if F i and c k do not belong to the same category, the similarity S ik should be close to 0. This ensures that pixels with the different category labels obtain lower similarity. To sum up, the overall loss function can be formulated as follows: where hyperparameters λ init , λ sim , and λ final are used to balance the initial prediction loss L init , similarity loss L sim , and final prediction loss L final .

D. Computation Analysis
It has been reported that the computational complexity of self-attention module is prohibitively tremendous [31], [34], which conflicts with the requirements for limited computational capabilities and real-time operations of an aerial platform. In our proposed CCENet, the similarity is used to depict the relationship between feature and class information; so it has much lower computation complexity even though we apply CaEM into low-level feature with high-resolution details. Specifically, previous self-attention based methods [20], [22], [23] have quadratic computation complexity O(CN 2 ) to input image size due to computation of self-attention globally. In contrast, the proposed CaEM has linear computational complexity with respect to image size; since we only use class representation as the input of key pass and value pass instead of feature from all positions, the computational complexity and memory occupation of generating a similarity matrix S is reduced from O(CN 2 ) to O(CN K) compared with original nonlocal block [20], where K is the number of class, and N = H × W denotes the number of pixels in feature map; therefore, K N holds. For example, if the input image size is 256 × 256, the feature map size used in three-stage CaEM is 128 × 128× 64, 64 × 64× 256 and 32 × 32× 2048, respectively. In the situation with class number 6, the CCENet spends merely 0.024 times of computation and memory for matrix multiplication compared with self-attention operation.

IV. EXPERIMENTS
To verify the effectiveness of our proposed CCENet, extensive experiments including ablation analysis and comparison experiments have been conducted using three publicly available high resolution aerial imagery datasets, i.e., ISPRS Vaihingen and Potsdam datasets, and iSAID dataset [48].
A. Datasets 1) Vaihingen: The ISPRS Vaihingen dataset contains 33 TOP tiles and DSMs with an average size of 2494 × 2064 and spatial resolution of 9 cm, including three spectral bands: near-infrared, red (R), and green (G) channels. Notably, 16 out of 33 tiles were fully annotated and used for the experiments, in which 11 images are used for training and the other 5 images (with image IDs: 11, 15, 28, 30, and 34) are for validation following [17], [23], [41], [49]. Referring to [15], [17], and [38], we only selected the raw TOP images in this work to compare fairly and keep network general. The groundtruth labels contain six fully annotation classes: impervious surfaces, building, low vegetation, high vegetation, car, and clutter. The clutter category includes ground objects like water bodies, swimming pools, and containers. Following [23], [41], [49], and [50], we only predict five classes in the case of Vaihingen, ignoring the clutter class, due to the lack of training data (less than 1%) for that class.
2) Potsdam: The ISPRS Potsdam dataset consists of 38 TOP tiles and DSMs with size of 6000 × 6000 and spatial resolution of 5 cm. Different from Vaihingen, these TOP files contains four bands: infrared (IR), red (R), green (G), and blue (B). Following [23], [41], and [50], we only focus on the IR-R-G channel data in our work, and 24 of 38 images are carefully annotated; so we use these 24 images in our experiments. Seven TOP tiles (with image IDs: 2_11, 2_12, 4_10, 5_11, 6_7, 7_8, 7_10) are selected as validation data while the other 17 annotated images comprise the training data. Different from Vaihingen dataset, we predict all six classes from Potsdam dataset.
3) iSAID: The iSAID [48] dataset is the largest instance segmentation dataset which consist of 2806 high spatial resolution RS images. Concretely, the training, validation, and test set contains 1411, 458, and 937 images respectively. Besides instance-level annotation, iSAID also provides semantic mask annotation for segmentation, which is used for our experiment. The images of iSAID are captured from multiple sensors and platforms with multiple resolutions. The original image scale ranges from 800 × 800 to 4000 × 13 000 pixels. The iSAID dataset contains 15 categories, i.e., ship (Ship), storage tank (ST), baseball diamond (BD), tennis court (TC), basketball court (BC), ground field track (GTF), bridge (Bridge), large vehicle (LV), small vehicle (SV), helicopter (HC), swimming pool (SP), roundabout (RA), soccerball field (SBF), plane (Plane), and harbor (Harbor). In our experiment, we only use the training and validation set for training and evaluating, as the groundtruth of test set is unavailable.

B. Evaluation Metrics
To assess the quantitative performance, two metrics are employed in our experiments according to the ISPRS dataset guidelines, including the overall pixel accuracy (OA) and the mean F1 (mF1) following [17], [23], [41], and [49] where TP, FP, TN, and FN are the number of true positive, false positive, true negative, and false negative, respectively. For iSAID dataset, we use the mean intersection over union (mIOU) as metric following [51] and [52]

C. Implementation Details
The implementation is based on Pytorch1.7, and all the experiments are conducted on a server with eight NVIDIA GeForce  [51] and [52]. To avoid overfitting during training, two data augmentation methods are considered in this article. On the one hand, a random sliding cropping strategy is employed instead of using fixed sliding window to crop the labeled tiles, which can generate new training images in a random position to expand the number of training set. On the other hand, we randomly transform training images by flipping (horizontal and vertical or both) with a probability of 0.5.
For training details, we adopt stochastic gradient descent algorithm as the optimizer with an initial learning rate of 0.01, a momentum of 0.9, and a weight decay of 0.0005. Each training process contains 50 epochs in total, and we use multistep learning schedule with learning rate decaying by the factor of 0.1 in 20th, 30th, and 40th epochs. In addition, due to the GPU memory capacity, the batch size is set to 24 for the Vaihingen and Potsdam datasets and set to 16 for iSAID according to the different sizes of input patches for the two datasets. As for inference, sliding windows with 50% overlap is utilized to traverse the whole images to achieve seamless prediction results, and then multiple predictions are averaged to get the final prediction.

D. Ablation Study
In this subsection, we conduct a series of experiments to reveal the effect of each component in our proposed module. Notably, we first use the dilated ResNet-101 as the baseline, and final segmentation result is achieved by directly upsampling the output. To avoid the influence of the training deviation of each experiment, we conduct five experiments for each group of the settings and calculate the average value and standard deviation as the final results.
1) Impact of Temperature in GCP: As proved in Section III-B1, the temperature T in GCP block controls the influence of other categories on a category. To investigate the impact of temperature T in GCP block, we change it from 0.1 to 10, and the corresponding accuracy results are recorded in Table II. Specially, when T is set to 0.5, our model achieves the best performance 90.09% in overall accuracy. As T increases to 10, the OA drops to 88.93%. This could be because more irrelevant land cover is introduced to class representation, which leads to less discriminative features for segmentation. When T decreases to 0.1, the overall accuracy descends to 88.61%. The reason could be that GCP only preserving few representative features and filter too much global information, which causes some exceptional situation and brings difficulty to the optimizer when some ground object categories do not exist in image. Therefore,  the temperature T should be set neither too large nor too small, and then we set T to 0.5 in the later experiments.
2) Impact of Different Loss and Hyperparameters: As mentioned in Section III-C, the proposed CCENet consists of three parts of loss at training stage, i.e., initial prediction loss L init , similarity loss L sim , and final prediction loss L final , which are controlled by hyperparameters λ init , λ sim , and λ final . Here, we fix the final loss λ final to 1 and conduct extension experiments to analyze the influence of weight parameters λ init and λ sim on the segmentation results separately, i.e., setting the other parameter to 0. As shown in Tables III and IV, it can be observed that the model is not sensitive to the change of hyperparameter. Moreover, the best results are achieved under the setting of λ init = 0.4 and λ sim = 0.6. Based on the above observation, we choose the optimal settings in the following experiments.
To further evaluate the impact of the auxiliary supervision on initial and intermediate results, we perform experiments using different loss setting. As shown in Table V, it can be observed that both auxiliary loss functions L sim and L init have certain improvement effects on model training optimization. Notably, if neither L sim nor L init is applied, our CCENet does not consider class information and degenerated to EMANet [29] where the number of basis is equal to the number of class. It can only achieve 89.04% in mean F1 score and 89.25% in overall accuracy. When we only add supervision L sim , the labeling result merely increases to 89.27% in mean F1 score and 89.44% in overall accuracy, due to the lack of the guidance of initial segmentation. While the supervision L init leads to notable improvement to 89.63% in mean F1 score and 89.78% in overall accuracy, which indicates that class-aware prior information is beneficial for segmentation. When both L init and L sim are applied, our model achieves the best performance with 89.91% in mean F1 score and 90.09% in OA. This proves that the proposed similarity supervision L sim improves the separability of ground objects in feature space when the class prior information is provided, thus boosting the labeling results. Based on the above comparative analysis, both L init and L sim are adopted in our method.
3) Effectiveness of CaEM: To demonstrate the advantage of our proposed CaEM, we design comparison experiments on dilated ResNet-101 baseline, its improved model with selfattention SRM [28] (ResNet-101 + SRM), a single CaEM (ResNet-101 + single CaEM), and cascade CaEM (ResNet-101 + cascade CaEM) in terms of performance and efficiency. For fair comparison, we first only apply a single CaEM in decoder and simply output the result by bilinear interpolation, which is the same as self-attention SRM [28]. The efficiency indexes, i.e., GPU Memory, computation complexity (measured by the number of FLOPs), and inference time, are obtained on a single GPU with the high-resolution aerial image size of 1 × 3 × 1024 × 1024 and the indexes are the smaller the better. Notably, the results contain all the computational overhead of the entire model, which is different from theoretical analysis in Section III-D with only matrix multiplication. From the experiment results in Table VI, it can be observed that ResNet-101 + SRM can bring 1.22% and 0.97% improvement in mean F1 and OA over dilated ResNet-101 baseline, respectively, because of considering the global attention information. Introducing attention is a two-edged sword, which introduces about two times additional memory (+1919.83 MB), one time additional computation complexity (+813.94G Flops), and 1/5 extra inference time (+1.97 ms) compared to baseline. Similarly, the ResNet-101 + single CaEM improves the baseline performance to 89.56% (↑1.54%) on mean F1 score and 89.68 (↑1.34%) on overall accuracy, and it surpasses SRM 0.32% in mean F1 score and 0.37% in OA, respectively. However, it hardly brings additional computation compared with baseline. Furthermore, for our ResNet-101 + cascade CaEM, it brings about three times more resources than ResNet-101 + single CaEM, but it still spends much less resource and running time than self-attention SRM, with only additional 497.34 MB memory, 362.61 G Flops, and 1.42 ms inference time. As for the performance, ResNet-101 + cascade CaEM dramatically exceeds 0.65% in mean F1 score and 0.76% in OA compared with SRM, which demonstrates the superiority of our method in labeling accuracy and computation efficiency.
To better understand the superiority of CaEM, we further conduct feature visualization. As shown in Fig. 1, given a pixel belonging to building [pixel marked by yellow cross sign in Fig. 1(a)], we visualize the feature cosine similarity of RAFCN equipped with self-attention SRM [23] and CCENet equipped with CaEM. It can be seen that the feature of building and  [53]. As shown in Fig. 4, after the process of CaEM, pixels of different classes gather together and come into several groups. Also, we can observe that the intraclass features are more consistent and the interclass features are more distinguishable, which alleviates the interclass homogeneity and intraclass heterogeneity of ground objects in feature space. This is mainly because the class representation introduces class-aware global prior information for a specific image.

4) Intermediate Results of Cascade CaEM:
In order to validate the effectiveness of our proposed cascade strategy, we output the quantitative results of intermediate results. As shown in Table VII, our cascade architecture can bring stable improvement compared with initial result, which benefits from object boundary refinement with class prior information from deep layer. Another interesting phenomenon is that the previous stage brings larger improvement, where stage 1 brings largest improvement by 0.72% in OA and 0.84% in mean F1 score, while the final convolution layer only boosts 0.22% in OA and 0.31% in mean F1 score. This might be because the class-aware global information is first introduced into model, while the latter stage with high resolution feature map plays a role of refinement.

1) Comparisons on the ISPRS Vaihingen:
Numerical results on the Vaihingen dataset are shown in Table VIII, and the perclass F1 scores are also enumerated to assess the performance of  Fig. 2) using t-SNE [53]. (c) Visualization of the class-aware enhanced feature aggregated using proposed CaEM (green cuboid in Fig. 2) by t-SNE [53]. Each point in (b) and (c) represents a pixel feature in image (a). As shown in this figure, after the process of CaEM, the intraclass features are more consistent and the interclass features are more distinguishable.  [49] have extra DSMs as input, our CCENet outperforms them by a large margin. Moreover, comparisons with widely used Deeplabv3+ [13] validate the effectiveness of the class-level global information, where CCENet contributes to increments of 3.52% and 1.22% in mean F1score and OA, respectively. Compared to RAFCN [23] and SSAtNet [40], CCENet reaches rise of 1.37% and 1.04% in mean F1 score, 0.86% and 0.73% in OA, which indicates the superiorities of class-level information aggregation over traditional attention-based schemes. In comparison with OCRNet, our model achieves increment of 0.5% and 0.45% in mean F1 score and OA, which demonstrates the effectiveness of cascade object boundary refinement architecture in semantic segmentation of high-resolution aerial images. It is worth noting that our method possesses strong capacity to distinguish confusing categories pair like impervious surface and building, tree, and low vegetation; so it outperforms the others in terms of per-class F1 by a large margin. As for the efficiency, although cascade refinement framework introduces additional computation complexity, CCENet still has much lower computation complexity than self-attention based network, while achieving best results.
To achieve intuitive comparison, samples of segmentation results of SegNet, HRNet, DANet, and the proposed CCENet are depicted in Fig. 5. The first row demonstrates that networks without global class-level information fail to recognize buildings with different facade. Whereas, the proposed CCENet makes relatively accurate predictions. This indicates that our model alleviates the problem of high intraclass heterogeneity in aerial images. Besides, examples in the second row illustrate that CCENet performs better than other methods when land covers have similar visual appearance such as low vegetation and tree. This illustrates that our model has stronger ability to distinguish ground object with interclass homogeneity. To summarize, though the complex scenes contain ambiguous objects, the qualitative comparisons show that our CCENet provides more accurate labeling than other methods due to the introduction of CaEM.
2) Comparisons on the ISPRS Potsdam: In order to further validate the effectiveness of our network, we conduct experiments on the Potsdam dataset, and numerical results are reported in Table IX. It can be seen that our model outperforms the other nine models by a considerable margin on both metrics, achieving  88.98% and 89.27% mean F1 score and OA, respectively. To be specific, our CCENet achieves improvement of 0.97% and 0.56% in mean F1 score, and 0.68% and 0.44% in OA with respect to competitive RAFCN [23] and DANet [22]. In contrast with the recently proposed SSAtNet [40] and OCRNet [24], our CCENet achieves 1.15% and 0.36% improvement in mF1 score, which verifies the effectiveness of the proposed CaEM and cascade refinement design. As for per-class F1, though our model ranks second in categories of tree and car, the segmentation results of CCENet are still very close to the best values. Notably, CCENet remarkably surpasses other competitors by a large margin in recognizing the category of clutter, which is challenging owing to its various occurrence and complicated textures. This indicates that leveraging the classwise contextual dependencies can learn more discriminative feature, yielding a better classification accuracy.
Furthermore, visual results are presented in Fig. 6. As shown in the first row, albeit cars sheltered by intricate surroundings, the boundaries between areas of trees and car in the segmentation result tend to be more precise and smoother for the case of CCENet. In the second row, the aerial scene is complicated, where the buildings, cars, clutters, and roads show interclass homogeneity. This brings enormous challenge for other competitors, while our model enhances the classwise semantic features to distinguish the similar objects of different semantic categories.
3) Comparisons on the iSAID Dataset: To evaluate the importance of the proposed CaEM and cascade framework, we further conduct experiment on iSAID dataset with more category number. As shown in Table X, we report per-class IOU and mIOU for comprehensive comparison. It can be observed that our model achieved state-of-the-art performance. Specifically, our model surpasses SegNet [25] and HRNet [27] by a large margin. This is mainly because SegNet and HRNet only consist of brunches of traditional convolution layer with limited respective field, which fails to extract context information effectively. In contrast, though Deeplabv3+ [13], SSAtNet [40], and DANet [22] increase respective field by additional contextual aggregation module, our model still reaches better result. This phenomenon shows the importance of class-aware prior information for semantic segmentation. Moreover, the proposed CCENet achieves 4.14% and 1.74% improvement compared with ACFNet [31] and OCRNet [24]. This attributes to our   explicit supervision on the intermediate result and cascade refinement structure. Moreover, we additionally report the stateof-the-art result of FarSeg [51] for competitive comparison. The proposed CCENet outperforms 13 categories in F1 score and achieves 1.01% higher mIOU in contrast with FarSeg [51].
We further visualize the segmentation result in Fig. 7. In the first row, our model can precisely distinguish background (black), ground tracker field (blue), and soccer ball field (light blue), while the comparison models cannot accurately delineate the edges of ground tracker field with the help of cascade refinement. In the second row, the comparison methods mistake large vehicle (dark blue) for small vehicle, while our model tackles the interclass homogeneity problem. This shows the importance of exploiting class-aware prior information to moderate the problem of interclass homogeneity.

V. CONCLUSION
In this article, a novel coarse-to-fine CCENet is proposed for the semantic segmentation of high-resolution aerial images. In this framework, classwise global context information is effectively exploited by the proposed CaEM, which alleviates the issue of interclass homogeneity and intraclass heterogeneity. With the help of CaEM, CCENet refines object boundary results using class-aware prior information. The ablation studies and comparative experiment results verify the superiorities of our proposed framework in terms of high labeling accuracy and computation efficiency. On three publicly available datasets, both quantitative and qualitative results show that our CCENet provides more precise labeling results than other contrasting well-known algorithms. In future, we will consider the idea of metrics learning among class representations into domain adaptation to narrow the domain gap between the two ISPRS dataset, thus solving the domain shift problem.