Enhanced Lightweight End-to-End Semantic Segmentation for High-Resolution Remote Sensing Images

Deep learning based methods have shown promising performance in semantic segmentation of high-resolution remote sensing (HRRS) images. However, due to the multi-scale property and complexity of HRRS images, it still faces many challenges in tackling the scale variance problem and obtaining global context information. In this paper, we propose an enhanced lightweight end-to-end semantic segmentation (ELES2) framework for HRRS images, where a superpixel segmentation pooling (SSP) module is embedded with the framework for result refinement, leading to a more accurate end-to-end semantic segmentation. Besides, compensation connections (CC) are applied between encoder blocks to establish long-range dependencies. In addition, a dense dilated convolutional pyramid (DDCP) module is proposed to generate dense features under different scales and capture global context information. Experiments conducted showed that our ELES2 respectively achieves mIoU values of 80.16% and 73.20% on the ISPRS Potsdam and Vaihingen benchmark datasets using only 12.62M parameters and 13.09G FLOPs. Experimental results prove that our method achieves a promising balance between segmentation accuracy and computational efficiency compared with the state-of-the-art semantic segmentation models.


I. INTRODUCTION
Recently, with the continuous development of satellite technology, high-resolution remote sensing (HRRS) images have been widely used in global and regional scale Earth observation and analysis [1]. Semantic segmentation of HRRS images aims to interpret the images by segmenting the images into semantically meaningful objects and assigning each part with a predetermined tag. It can be widely applied in many applications, such as land use survey, natural disaster detection, environmental monitoring, precise vegetation, and urban planning [2].
With the continuous maturity of deep learning technology, the convolutional neural networks (CNNs) based models show stronger classification capabilities than the shallow models (e.g. support vector machine, SVM [3]) in largescale datasets due to the use of sparse representation, weight sharing, pooling, and etc [4]. These classical CNNs-based models take the original image patches as the input and output a single category probability vector, which is difficult to directly apply to the task of semantic segmentation for pixel-level classification.
Under this circumstance, a fully convolutional network (FCN) model [7] was proposed for semantic segmentation, which adopted a trainable transposed convolutional layer and the layer-by-layer up-sampling strategy. Since then, semantic segmentation methods based on deep learning have been proposed one after another [8]- [11]. UNet [8] and SegNet [9] both adopt an encoder-decoder structure to recover the location information while retatining high-level semantic features. To further improve the accuracy for semantic segmentation, PSPNet [10] and DeeplabV3+ [11] introduced pyramid pooling module and atrous spatial pyramid pooling module respectively, which aim to obtain multiscale features with pooling operations and dilated convolutions.
Semantic Segmentation becomes challenging owning to multiscale objects, several studies aim to extract multiscale features and global context information to enhance the segmentation accuracy. In EncNet [12], a context encoding module is proposed to capture the scene-dependent global context as channelwise attention. PSANet [13] introduces the modeling of long-range correlation for each spatial position. In [14] , a multiscale design has been introduced to aggregate context information through different branches. TreeUNet [15] adopts a Tree-CNN block to transmit feature maps via concatenating connections and further fuse multiscale representations. In [16], a multiscale pyramid pooling module is introduced to extract multiscale features for semantic segmentation in HRRS images. Whereas in [17], authors attempt to learn and reason about global relationships, thereby capturing long-range multiscale relation. On the other hand, some studies focus on the refinement of semantic segmentation of HRRS images. In [18], the conditional random field (CRF) is combined with a pretrained network to enhance the segmentation results. Besides, a multitask FusionNet [19] is proposed to achieve semantic segmentation and edge detection, and an edge-aware regularization is applied to refine the segmentation prediction. Though the above-mentioned methods can achieve a decent segmentation performance, the large amount of parameters and low prediction speed would severely limit their application scenarios.
Besides, several studies aim to solve the multiscale problems by using superpixels. In [20], Mostajabi et al. obtained 14 sub-images based on superpixels from input images. In [21], Kwak et al. applied superpixels to generate the class score, which is used to predict the classes of objects. Or as in [22], superpixel images are used as the input to enhance the segmentation results by exploring the texture and detailed information in superpixels. Or for post-processing, as in [23], superpixel and CRF are combined as the post-processing method to enhance the segmentation results. However, the post-processing will increase the computational cost and reduce the segmentation efficiency.
To solve these problems, we proposed a novel light-weight end-to-end semantic segmentation framework for HRRS images, which can achieve a promising balance between segmentation accuracy and computational efficiency, where a superpixel segmentation pooling (SSP) layer is introduced in our proposed network structure to improve the computational efficiency and makes our network less sensitive to noise and segmentation scale. Besides, we introduced the compensation connections (CC) to make our network achieve semantic segmentation in a more efficient way. Furthermore, a dense dilated convolutional pyramid (DDCP) module is proposed to deal with the phenomenon of different objects with the same spectrum and further solve the multiscale problem.

A. THE PROPOSED ELES2 FRAMEWORK
Our proposed ELES2 framework can be applied to any segmentation network based on encoder-decoder structure.
In this work, we choose LinkNet [24] as the backbone of our proposed method due to its efficiency and the lightweight structure. LinkNet bypasses the input of each encoder block into the output of its corresponding decoder block. In this way, it can recover spatial information lost in the downsampling stage, which can be further reused by the decoder. By sharing the knowledge learned from the encoder at each layer, the decoder can use fewer parameters. The overall flowchart of our proposed ELES2 is given in Fig. 1. Compensation connections are applied between encoder blocks of the backbone network, which aims to use the parameters efficiently. Besides, a DDCP module is introduced in ELES2, which can enhance the segmentation performance. Apart from this, an SSP layer is embedded with the model, such that we can achieve more accurate end-toend segmentation. The details will be introduced as follows.

B. SUPERPIXEL SEGMENTATION POOLING
Although the use of post-processing steps can benefit segmentation results, it will delay the segmentation process 2 VOLUME 4, 2016 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. severely. To achieve end-to-end segmentation, we propose an SSP layer that can use the boundary of the superpixel to correct the output feature map, where superpixel refers to an irregular pixel block with a certain visual significance composed of adjacent pixels with similar texture, color, brightness, and other characteristics. In this paper, we choose the simple linear iterative cluster (SLIC) [25] algorithm for superpixel segmentation due to its efficiency and robustness. Compared with other superpixel segmentation methods, SLIC is compatible with different forms of segmentation maps, and is ideal in terms of running speed, compactness of generating superpixels, and contour preservation.
For an input high-resolution remote sensing image M , we can obtain a superpixel segmentation image by applying the SLIC algorithm, which clusters pixels with similar characteristics in the image into superpixels to constrain the umsampled feature maps, thereby improving the accuracy of semantic segmentation. For the superpixel segmentation image S ∈ R H×W ×C , where H and W represent the height and width of the input image respectively and C is the number of channels, we apply matrix transformation to convert the superpixel segmentation image into a weight matrix W ∈ R H×W . The weight matrix W is a two-dimensional matrix consisting of 0 to L, where L represents the number of superpixels. After that, the weight matrix if fed into the SSP module to perform a pooling operation on the unsampled feature map. The weight matrix W is defined as: where w i,j represents the value of the i-th row and the j-th column in W , and M L is the L-th superpixel region in M . As shown in Fig. 2, the superpixel segmentation map performs pooling on the feature map F out obtained from the last transposed convolutional layer in the network. Different from common pooling, which averages or maximize the areas of a fixed size, the SSP layer averages the value of each superpixel area of the feature map. For the output feature map F out ∈ R H×W ×N class and the superpixel segmentation image I ∈ R H×W , I is first one-hot encoded into I ′ ∈ R H×W ×L , where N class denotes the number of categories. We define F s as the superpixel feature map, and the formulation of the SSP operation can be expressed as follows: where I ′ {i} represents the ith layer of I ′ , sum(·) stands for sum operation, and '·' denotes the dot product between the feature maps.

C. COMPENSATION CONNECTIONS
Though using a deeper and wider structure could improve the performance of multilayer networks, it will also increase the number of the parameters and cause the gradient vanishing problem [26]. In [27], DenseNet is proposed to reduce computational cost and address the degradation problem using feature reuse and dense connections. Motivated by this idea, as shown in Fig. 1, we adopt compensation connections between the encoders blocks of our network, which can improve the efficiency of the parameters. Different from DenseNet, we introduce direct connections from any encoder block to all subsequent encoder blocks. Consequently, the ℓth encoder block can receive the feature maps of all preceding encoder blocks, denoted as follows: where [·] is the concatenation operation of the feature maps F 1 , F 2 , · · · , F ℓ−1 produced in encoder blocks 1, 2, · · · , ℓ−1, respectively. H ℓ is defined as the function of the ℓth encoder block, as illustrated in Fig. 3. As the backbone network used in our work has 4 encoder blocks, ℓ is set as 4.
The compensation connections enable all encoder blocks to receive a direct supervision signal from the preceding encoder blocks, such that the feature maps between network streams can be better reused. Therefore, we can reduce the computational cost and the number of the parameters for semantic segmentation. VOLUME   It should be noted that 1 × 1 convolutions with different strides are used to down-sample the feature maps output by each encoder block before concatenation, so as to align the sizes of the feature maps output by different encoder blocks and further reduce the number of the parameters.

D. DENSE DILATED CONVOLUTIONAL PYRAMID
In order to enlarge the receptive field of feature points, we propose a DDCP module, which can obtain features from different scales by stacking the dilated convolutions with different dilation ratios, horizontally and vertically, thereby increasing the accuracy of semantic segmentation. The architecture of the DDCP module is given in Fig. 4.
For example, for a dilated convolutional layer L d with a dilation rate d L and a kernel size k L , the equivalent receptive field size R d can be calculated by: According to [28], stacking convolutional layers with dense connections can provide a larger receptive field, which brings more overall and higher semantic features. Suppose we stack two dilated convolutional layers with dilation rate d 1 and d 2 respectively, then the total size of the receptive field R total can be reformulated as follows: where R d1 and R d2 denote the receptive field sizes of the convolutional layers with dilation rate d 1 and d 2 , respectively. As shown in Fig. 4, if the feature map F 4 output by the last encoder block is 32 × 32, the corresponding receptive field size of each layer of DDCP from top to bottom are 3, 7, 15, and 31, respectively. Such that it can generate multiscale features, which can not only cover a large scale range, but also cover that scale range densely. In addition, 1 × 1 convolutional layers can be used to align the size of the feature maps from each layer of the DDCP module.

A. DATASET AND METRICS
We evaluate our method on two state-of-the-art aerial image semantic segmentation benchmarks, i.e., ISPRS 2D Semantic Labeling Challenging for Vaihingen and Potsdam 1 , consisting of very high-resolution ture orthophotograph (TOP) tiles and corresponding digital surface models (DSMs). The Potsdam dataset is composed of 38 high-resolution aerial images with 6000 × 6000 pixels, where 24 images are chosen for training, and the rest 14 images are used for testing. There are four spectral bands in each image (i.e. red (R), green (G), blue (B), near-infrared (NIR)), and one band in each DSM. The data are organized into six categories: impervious surface, building, low vegetation, tree, car, and clutter/background.
The Vaihingen dataset contains 33 Top tiles and the corresponding DSMs, with an average size of 2494 × 2064 pixels. 16 images for training and 17 ones to test our model from the Vaihigen dataset, which has the same six categories corresponding to the Potsdam dataset. Note that we do not use DSM in our experiments. Besides, due to the clutter/background category is very different from the other categories, the clutter/background is not included for comparison according to the ISPRS's rules.
To better evaluate the segmentation accuracy and computational efficiency of the semantic segmentation algorithms, we adopt three segmentation accuracy evaluation indicators (overall accuracy (OA), F 1 score and the mean pixel intersection-over-union (mIoU)) and three computational efficiency evaluation indicators (inference time, parameter amount and floating-point operations (FLOPs)) to comprehensively evaluate the performance of the lightweight semantic segmentation model. Each indicator is introduced in detail below.
Semantic segmentation is essentially a pixel-level classification problem to minimize the error of pixel classification. For a binary classification task, it has multiple calculation metrics, including true positive (TP), true negative (TN), false positive (FP), and false negative (FN). These four indicators represent the situation when different samples are predicted to different categories. Overall accuracy (OA) refers to the probability that all samples are predicted to be the correct classes and can be expressed as: F 1 score refers to the harmonic average of precision and recall, which allows allows to evaluate the accuracy of each category by combining the precision and the recall metrics. Based on the confusion matrix, the precision rate represents the probability that the actual class in the predicted positive sample is also a positive sample, and its expression is as follows:

precision =
T P T P + F P  The recall rate represents the probability that the actual positive category is predicted to be positive, and the expression is as follows: The formula for calculating the F 1 score is as follows: The mean F 1 score for each category is expressed as follows: where k represents the total number of categories and mF 1 is the mean F 1 score. Intersection-over-Union (IoU) is one of the most effective accuracy evaluation indicators for semantic segmentation. It represents the ratio of the intersection and union of set A and set B. Its calculation formula is as follows: The mean pixel intersection-over-union (mIoU) represents the average IoU over all categories, which is expressed as follows: We take the prediction time of the model for a single 512 × 512 pixel size image as the inference time.
The amount of parameters is an important indicator to measure the computational efficiency of the algorithms. The amount of parameters can intuitively reflect the space complexity of the semantic segmentation algorithm, and correspond to the consumption of computer memory resources at the hardware level. Taking the convolutional layer as an example, the calculation formula of the parameters of the convolutional layer is as follows: where k w × k h × C in represents the number of weights of a convolution kernel, 1 means bias, and C out is the number of convolution kernels in this layer. Floating-point operations (FLOPs) refers to the total number of multiplication and addition calculations in a neural network, and is used to measure the time complexity of the algorithm. Taking the convolutional layer as an example, the calculation formula of FLOPS is as follows: where k w and k h represent the width and height dimensions of the convolution kernel, C in × k w × k h denotes the multiplication calculation amount, C in × k w × k h − 1 represents the addition calculation amount, 1 means bias, C out × w × h is all the elements in the feature map, and w and h represent the width of the feature map respectively.

B. IMPLEMENTATION DETAILS
All the experiments are implemented with Pytorch on two NVIDIA GeForce RTX 2080 Ti GPUs, and the optimizer is set as Adam with a 1e-4 learning rate. The Focal loss [37] function is used as a quantitative evaluation, where the focusing parameter γ is set to 0.5. Due to the limitation of computational resources, the input data are randomly cropped into the size of 512 × 512 pixels. The batch size is fixed as 8, with 60 epochs for the network to converge. Besides, we perform the following data augmentation strategies for model training: (1) Flipping randomly with 0.5 probability; (2) Cropping randomly to the original size of image with [0.5, 2] scale range; (3) Normalizing all channels by subtracting mean and dividing by the standard variance of the dataset. Whereas in the testing phase, only the normalization strategy is applied.
For the ISPRS Potsdam dataset, the comparison results are shown in Table 1. It can be seen that our ELES2 has achieved the best performance in several classes (i.e., Imp_suf, Low_veg, and Car) and OA, mF 1 , mIoU metrics. Although lightweight models such as LEDNet and BiseNet V2 have lower computational costs than ELES2, they also lead to lower segmentation performance. Besides, compared to the other three lightweight models BiseNet, LiteSeg, and EMANet, ELES2 performed better. It can also be seen that while achieving the comparable segmentation accuracy as DeepLabV3+, the parameters and FLOPs of DeepLabV3+ (56.71M, 244.22G) are almost five and twenty times that of ELES2(12.62M, 13.09G). Hence, our ELES2 can better balance the relationship between segmentation accuracy and computational cost for more efficient image segmentation.
The comparison results for the ISPRS Vaihingen are shown in Table 2. Our ELES2 achieves top-ranking but not optimal segmentation performance. The main reason is that there are few training samples in the ISPRS Vaihingen dataset, which makes shallow networks easier to fall into local overfitting states. However, ELES2 still maintains a high segmentation accuracy for small samples such as cars. DDCP in ELES2 can generate denser and more continuous   Fig. 5 and Fig. 6, respectively. It can be seen that our ELES2 has fewer misclassifications compared to other methods, especially the lightweight models. Moreover, the segmentation boundary of ELES2 is more regular and close to the ground truth.

D. ABLATION STUDIES
To verify the effectiveness of the modules (i.e. CC, DDCP, SSP) in our proposed framework, ablation studies are conducted for the ISPRS Potsdam dataset, where LinkNet is chosen as the baseline. We applied different module collocations on the baseline network. We can see that all the modules can provide performance improvement as compared to the baseline network, where DDCP and CC would bring extra parameters and SSP slightly increases the inference time. With all the three modules, our ELES2 can further improve the performance and achieve the best semantic segmentation results. Fig. 7 illustrates the visual performance of our proposed 6 VOLUME 4, 2016 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.   method on the Vaihingen test dataset. It can be seen that our method shows a splendid segmentation performance with few classification errors. Especially, when SSP is applied to the network, the segmentation result is further corrected. This is because the small number of misclassified pixels will be classified into the correct categories, and the segmentation boundary is closer to the actual boundary of the image. After applying SSP, the segmentation result of our method does not need to be further refined by other post-processing methods.
In addition, we also analyzed the influence of the number of superpixels in SSP on the segmentation results of ELES2, and the experimental results are shown in Fig. 8. It can be seen that with the increasing number of superpixels, the segmentation accuracy of ELES2 increases first and then decreases. When the number of superpixels in SSP is set too small, SSP cannot capture enough boundary information, and will misclassify objects of different categories into one category. On the contrary, when the number of superpixels in SSP is set too large, the whole object will be divided into several small pieces and classified separately, which will increase the probability of misclassification. When the number of superpixels is 300, ELES2 achieves the optimal segmentation realization. In our experiments, the number of superpixels for ELES2 in both ISPRS Vaihingen and Potsdam datasets is set to 300. VOLUME 4, 2016 7 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3182370

IV. CONCLUSION
In this paper, we presented a ELES2 framework to achieve end-to-end efficient semantic segmentation of HRRS images, where a SSP layer is embedded with the framework to solve the scale variance problem and refine the segmentation results. Besides, CC is introduced to enhance the representation capacity of the network, and DDCP is proposed to generate dense and continuous features under different scales. Experiments conducted on Potsdam and Vaigingen dataset demonstrate the efficiency and superiority of the proposed method. ELES2 can achieve promising segmentation accuracy while maintaining high computational efficiency. ELES2 achieves mIoU of 80.16% and 73.20% on the IPSRS Potsdam and Vaihingen datasets, respectively, with only 12.62M parameters and 13.09 FLOPs. Our study confirms that semantic segmentation algorithms for HRRS images are feasible with both segmentation accuracy and computational efficiency. In the future work, we will focus on the study of lightweight semantic segmentation algorithms under the condition of small samples and unsupervised.