Processing math: 0%
ORBNet: Original Reinforcement Bilateral Network for High-Resolution Remote Sensing Image Semantic Segmentation | IEEE Journals & Magazine | IEEE Xplore

ORBNet: Original Reinforcement Bilateral Network for High-Resolution Remote Sensing Image Semantic Segmentation


Abstract:

Semantic segmentation of high-resolution remote sensing images (HRRSIs) is a basic research in the field of remote sensing image processing. Many current CNN-based method...Show More

Abstract:

Semantic segmentation of high-resolution remote sensing images (HRRSIs) is a basic research in the field of remote sensing image processing. Many current CNN-based methods complete detailed segmentation by building an encoder–decoder network. However, the representative selection features of ground objects are often ignored and the semantic gap between high-level features and low-level features, resulting in redundant information and erroneous annotation results. In this article, we propose an original reinforcement bilateral network (ORBNet) to improve the performance of HRRSIs semantic segmentation. The ORBNet consists of two branches—the detail branch and the semantic branch, which are responsible for extracting low-level features and high-level features, respectively. The feature alignment and fusion (FAF) modules are used to align features at different levels between two branches and produce shallow features and deep features. Furthermore, we use the detail loss in the detail branch to supervise the generation of low-level features. The class-specific discriminative loss is used to help the semantic branch distinguish features of different ground objects. The spatial-channel attention (SCA) modules are used in the feature fusion stage to select representative features. We conducted extensive experiments on two open-source ISPRS remote sensing datasets, and the experimental results verified the superior performance of our ORBNet.
Page(s): 15900 - 15913
Date of Publication: 22 December 2023

ISSN Information:

Funding Agency:


SECTION I.

Introduction

Semantic segmentation is a widely studied task in computer vision. Unlike image classification, semantic segmentation requires clear category labeling for each pixel in an image. At present, this task is widely applied in various fields, such as land planning, autonomous driving, environmental monitoring, and other fields [1], [2]. Benefiting from the rapid progress of remote sensing satellite technology in recent years, numerous high-resolution remote sensing images (HRRSIs) have been acquired [3]. In order to complete their analysis, the task of semantic segmentation for HRRSIs has been studied by more and more scholars [4], [5], [6].

Traditional image segmentation algorithms often need to manually design features according to the changes in image scenes, which are not well suited for HRRSIs with complex scenes [7], [8], [9], [10]. With the rise of deep learning, CNN-based methods have been widely used in various tasks of image processing, including semantic segmentation. [11], [12], [13], [14].

The encoder–decoder structure is currently the mainstream structure in semantic segmentation [1], [2], [15], [16], [17]. The image is continuously downsampled in the encoder stage to capture deep features, and is continuously upsampled in the decoder stage to generate the prediction result. FCN [1] is a landmark network in the field of semantic segmentation that achieves end-to-end pixelwise prediction. Badrinarayanan et al. [2] proposed a U-shaped network called U-Net. U-Net was initially used in medical image. Due to its simple structure and excellent performance, it is quickly widely used in other computer vision tasks.

Different from other natural images, semantic segmentation of HRRSIs with complex ground objects is more difficult. As shown in Fig. 1, we enumerate the main characteristics of ground objects in HRRSIs. 1) High intraclass variance: Buildings are marked with yellow lines, however, their appearance and colors are very different. 2) Low interclass variance: Guided by blue lines are low vegetation and trees, although they belong to different objects, they have similar appearance characteristics. 3) Multiscales: The buildings indicated by green lines are of different scales. 4) Multiobjects: There are multiobjects in the whole image that need to be labeled. Among the four points mentioned, points 3) and 4) are commonly observed in various RGB images, and CNN-based convolutional neural networks have been proven to effectively address these challenges. On the other hand, points 1) and 2) are the specific issues encountered in HRRSIs.

Fig. 1. - Examples of challenges in HRRSIs semantic segmentation.
Fig. 1.

Examples of challenges in HRRSIs semantic segmentation.

Segmenting the HRRSIs accurately is challenging due to their high intraclass variance and low interclass variance. It is generally believed that eliminating the semantic gap between low-level feature and high-level feature is an effective way to solve the above problem. To this end, we design a reinforcement bilateral network based on the original bilateral network BiSeNet [18] and BiSeNetV2 [19]. Our network has two branches, which are responsible for extracting low-level feature and high-level feature, respectively. The network structures of BiSeNet and BiSeNetV2 are shown in Fig. 2(a) and 2(b). Our bilateral network shown in Fig. 2(c) has many enhancements compared with BiSeNet and BiSeNetV2. First, we additionally add the detail loss in the detail branch. Unlike STDC [20], we apply the detail loss to the aligned features. Second, we use an ASPP module to extract multiscale features in the semantic branch, and the class-specific discriminative loss is used in the semantic branch to distinguish the features of different ground objects. Multiscale feature and class-discriminative feature can be viewed as enhancements to high-level features. Third, we use four feature alignment and fusion (FAF) modules between two independent branches to generate aligned features and shallow or deep features. The aligned features are used for the detail branch enhancement, while the shallow or deep features are used in attention modules at the feature fusion stage.

Fig. 2. - Comparison of several semantic segmentation networks. (a) BiSeNet. (b) BiSeNetV2. (c) ORBNet (Ours).
Fig. 2.

Comparison of several semantic segmentation networks. (a) BiSeNet. (b) BiSeNetV2. (c) ORBNet (Ours).

In addition, to better distinguish them, we use attention mechanism in our network. In recent years, the attention mechanism has also been widely used in the field of computer vision [21], [22], [23], which can establish the dependencies between different objects in the image and screen out the representative features. Our proposed SCA module improves the current spatial attention and channel attention for multiobjects HRRSIs. The core contributions of this article are listed below.

  1. We propose an original reinforcement bilateral network (ORBNet) for HRRSIs semantic segmentation.

  2. Between the two branches, the proposed FAF modules are used to align different levels of features and generate shallow/deep features.

  3. The proposed SCA modules are used in the network to capture representative features and establish long-distance dependencies between pixels.

  4. We conduct extensive experiments on two well-known remote sensing datasets, and the experimental results confirm the good performance of our ORBNet.

SECTION II.

Related Works

Some research related to this article will be introduced in this section, including semantic segmentation, semantic segmentation in HRRSIs, and attention mechanism.

A. Semantic Segmentation

Semantic segmentation paves the way for a complete understanding of complex scenes. With AlexNet [24] winning the championship of ilsvrc 2012, feature extraction method based on CNN has become the mainstream methods of semantic segmentation [25], [26], [27], [28], [29].

FCN [1] achieves end-to-end image semantic segmentation by fully convolutions. U-Net [2] improves the original encoder–decoder structure by connecting the encoder network and the decoder network at each corresponding stage to supplement detail information for the feature maps in decoder stage. This method was originally used in the field of medical image semantic segmentation, and is currently widely used in many other tasks in computer vision. SegNet [15] combines the information of the pooling layer with the decoder stage on the basis of the encoder and decoder structure, which can obtain more detailed information. To obtain multiscale semantic information in images, PSPNet [30] uses the PPM module, which contains parallel pooling kernels of different sizes. DeepLab [31], [32], [16] proposes the ASPP module to obtain multiscale features of objects, which uses parallel atrous convolutions with different atrous rates. BiSeNet [18] and BiSeNetV2 [19] adopt a dual path network to achieve real-time semantic segmentation. By recounting pixels with different distribution attributes, STLNet [33] can obtain richer texture features. Xu et al. [34] proposed a network with three branches called PIDNet, which can efficiently extract spatial information, contextual information, and edge information from images. In order to obtain accurate object position information, HRNet [35] parallelized feature maps of different resolutions. At the same time, the features of different branches will fuse with each other to obtain strong semantic information.

Besides, with the popularity of transformer in the field of computer vision [36], [37], [38], [39], more and more semantic segmentation methods are combined with it. SegFormer [40] combines transformer with a lightweight multilayer perceptron decoder for semantic segmentation. MPViT [41] improves the accuracy of transformer in image segmentation by segmenting images at different scales and forming a multipath structure. In order to maintain the advantages of transformer while reducing computational complexity, Topformer [41] reduces the number of tokens by using the token pyramid model, and multiscale tokens are used as inputs to improve the the performance. SparseViT proposes a new sparse perception adaptive method to find effective sparse configurations, thereby improving the inference speed of semantic segmentation.

B. Semantic Segmentation in HRRSIs

Semantic segmentation networks for general scenes often do not achieve ideal results when directly applied to HRRSIs. Therefore, more and more scholars have proposed networks for semantic segmentation of HRRSIs.

HCANet [42] combines U-Net and ASPP modules to achieve hierarchical information aggregation. In another job, Liu et al. [43] proposed a lightweight network that combines EfficientNet [44] and attention mechanism. To effectively utilize the rich edge features of ground objects, Li et al. [45] designed a network that takes full advantage of multiscale features to generate complete and sharp land-cover boundaries. Chong et al. [46] proposed a network for small objects segmentation in remote sensing images, which is divided into two streams—semantic stream and edge stream. Zhang et al. [47] designed a hybrid encoder–decoder deep neural network, the main structure of which includes CNN and transformer, and swin transformer is used in the encoding stage to establish long-range dependencies between ground objects. Deng et al. [48] designed a network that can use ground objects edge features to complete segmentation. This network uses Ghost [49] as the feature extraction network, which can greatly improve inference speed while maintaining accuracy.

The existing semantic segmentation networks for HRRSIs are primarily built upon general semantic segmentation models, which are proficient in handling multiobjective and multiscale scenarios. However, these models often fall short in scenarios characterized by high intraclass variance and low interclass variance, where their performance is not good.

C. Attention Mechanism

At present, attention mechanism has become a universal method in computer vision. To obtain more helpful features for prediction results, an increasing number of methods have completed feature filtering of images by introducing attention mechanism. Generally speaking, attention mechanisms can be established separately from the spatial and channel dimensions of feature maps.

SENet [23] obtains a set of adaptive channel weights by global pooling to change the attention relationship between channels. Furthermore, CBAM [50] uses a mixed pooling method to obtain the relationship between features in both spatial and channel dimensions. DANet [21] establishes spatial and channel relationships of features through matrix calculations. CCNet [22] proposes an attention module that establishes long distance dependencies in the image by only calculating the horizontal and vertical pixels, greatly improving computational efficiency. In PANet [51], a feature pyramid attention module is proposed to capture multiscale features. In addition, at each stage of the decode network, a global attention upsampling module is connected to the corresponding encode stage to obtain richer detail features.

In the field of HRRSIs semantic segmentation, many methods use attention mechanism to enhance the performance. Huang et al. [52] introduced an attention-guided label refinement network based on U-Net. HAMNet [53] uses a variety of hybrid attention modules to establish long-distance dependency relationships between ground objects in HRRSIs.

SECTION III.

Methodology

At the beginning, the structure and forward process of ORBNet will be introduced, and then each module of ORBNet will be introduced in turn.

A. Original Reinforcement Bilateral Network

The architecture of the proposed ORBNet is shown in Fig. 3. At first, the input image I \in {\mathbf {R}^{3\times H \times W }} will go through the detail branch and the semantic branch separately. The detail branch is responsible for capturing low-level feature, which is necessary for the semantic segmentation of the edge and texture of ground objects. In the detail branch, the input image passes three consecutive convolutions to obtain the feature map D \in {\mathbf {R}^{256\times H/8 \times W/8 }}. In addition, we use the detail loss to monitor the feature maps of each stage in the detail branch.

Fig. 3. - Original reinforcement bilateral network.
Fig. 3.

Original reinforcement bilateral network.

In the semantic branch, we use ResNet-101 [27] as the backbone. The feature map K \in {\mathbf {R}^{1024\times H/32 \times W/32 }} is obtained by ResNet. Subsequently, the feature map K^{\prime } \in {\mathbf {R}^{256\times H/32 \times W/32 }} is obtained by the ASPP module, where the multiscale feature of different ground objects are encoded. The process can be described as follows: \begin{align*} K =& \text{backbone}(I) \tag{1}\\ K^{\prime } =& \text{ASPP}(K). \tag{2} \end{align*} View SourceRight-click on figure for MathML and additional features.

In order to further distinguish different ground objects, the feature map K^{\prime } will first reduce the number of channel to two times the number of ground objects to obtain the feature map J \in {\mathbf {R}^{2N\times H/32 \times W/32 }}, where N represents the number of ground objects. The feature map J is supervised by the class-specific discriminative loss. Then, the feature map J and K are concatenated to obtain the output S \in {\mathbf {R}^{256\times H/32 \times W/32 }} of the semantic branch. The process can be described as follows: \begin{align*} J =& \text{Conv}(K^{\prime }) \tag{3}\\ S =& \text{Conv}(\text{Cat}(J, K^{\prime })). \tag{4} \end{align*} View SourceRight-click on figure for MathML and additional features.

The aligned features will get into the detail branch by FAF modules, and the FAF modules will generate shallow feature map V_{1} \in {\mathbf {R}^{64\times H/2 \times W/2 }} and V_{2} \in {\mathbf {R}^{128\times H/8 \times W/8 }} and deep feature map T_{1} \in {\mathbf {R}^{256\times H/8 \times W/8 }} and T_{2} \in {\mathbf {R}^{256\times H/4 \times W/4 }}. The feature map M \in {\mathbf {R}^{256\times H/8 \times W/8 }} is generated by the last FAF module. Subsequently, the feature map M will through the SCA modules with the shallow feature map and the deep feature map, respectively. The process is as follows: \begin{equation*} {O = \text{SCA}(\text{Cat}(V_{1}, V_{2}), M) + \text{SCA}(\text{Cat}(T_{1}, T_{2}), M).} \tag{5} \end{equation*} View SourceRight-click on figure for MathML and additional features.Finally, the feature map O will restore to original resolution to get the final prediction map {Y} by upsample \begin{equation*} {Y = \text{Upsample}(O).} \tag{6} \end{equation*} View SourceRight-click on figure for MathML and additional features.

B. Detail Loss

The HRRSIs not only have high resolution, but also have diverse and variable scales of ground objects, which can result in the image containing a large amount of detail information. We use the original ground truth to generate the detail map to optimize detail features.

The Laplacian convolutional kernel is a second-order differential filter that detects edges in an image by calculating the differences between a pixel and its surrounding pixels. By applying the Laplacian convolutional kernel, the edge regions in the image are emphasized while other areas are suppressed. This allows for easier detection and extraction of the edge features in the image.

The ground truth will obtain 1x, 2x, and 4x downsampling results from Laplace convolution kernels, which contain a large amount of edge information of ground objects. Then each downsampling result is restored to the original resolution by bilinear interpolation. Next, we dynamically fuse the three detail feature maps using learnable 1×1 convolutions to generate final detail ground truth. The Laplacian convolution kernel is as follows: \begin{equation*} {\text{Laplacian}} \text{Kernel}=\left[\begin{array}{ccc}-1 & -1 & -1 \\ -1 & 8 & -1 \\ -1 & -1 & -1 \end{array}\right]. \tag{7} \end{equation*} View SourceRight-click on figure for MathML and additional features.

Generally speaking, the number of categories of foreground and background is unbalanced. To alleviate this problem, the dice loss is used in the detail loss. The detail loss L_{\text{detail }} and dice loss L_{\text{dice }} are represented as follows: \begin{align*} &L_{\text{detail }}\left(p_{d}, g_{d}\right)=L_{\text{dice }}\left(p_{d}, g_{d}\right)+L_{\text{b c e}}\left(p_{d}, g_{d}\right) \tag{8}\\ &L_{\text{dice }}\left(p_{d}, g_{d}\right)=1-\frac{2 \sum _{i}^{H \times W} p_{d}^{i} g_{d}^{i}+\epsilon }{\sum _{i}^{H \times W}\left(p_{d}^{i}\right)^{2}+\sum _{i}^{H \times W}\left(g_{d}^{i}\right)^{2}+\epsilon } \tag{9} \end{align*} View SourceRight-click on figure for MathML and additional features.where p_{d} and g_{d} represent the predicted detail map and the corresponding detail ground truth, respectively. L_{\text{b c e}} represents the binary cross-entropy loss. i represents the ith pixel, and \epsilon represents the smoothing coefficient. \epsilon is set to 1 in our experiments.

C. FAF Module

The features generated at each stage of the two branches are strongly related. Besides, during the encoding stage, continuous downsampling can cause varying degrees of offset in the pixels of the corresponding features. To alleviate the above problems, we use the designed FAF modules to align the features and generate new semantic features at the same time.

As shown in Fig. 4, our FAF module includes two different types. For the convenience of description, we call them FAF_{a} and FAF_{b}, respectively. Both FAF_{a} and FAF_{b} generate aligned features. In addition to producing aligned features, FAF_{a} also produce fused shallow features and FFA_{b} produce fused deep features. In our dual branch, we use a total of four FAF modules, with the first two being FAF_{a} structure and the last two being FAF_{b} structure. We take the FAF_{a} as an example to describe.

Fig. 4. - Proposed FAF modules with two structures. (a) FAFa. (b) FAFb.
Fig. 4.

Proposed FAF modules with two structures. (a) FAFa. (b) FAFb.

First, the feature map H \in {\mathbf {R}^{4C\times H/2 \times W/2 }} will be convolved and upsampled to obtain H^{\prime }, and the size of H^{\prime } is the same as the size of the feature map L\in {\mathbf {R}^{C\times H \times W }}. Next, we stack H^{\prime } and L together and get two offset prediction maps \theta H^{\prime }\in {\mathbf {R}^{2\times H \times W }} and \theta L\in {\mathbf {R}^{2\times H \times W }} by through multiple convolutions. We use \theta H^{\prime } and \theta L to pixel-align H^{\prime } and L respectively, as follows: \begin{equation*} {X = {\text{ali} {gn}}(L,\mathrm{{ }}\theta L) + {\text{ali}{gn}}(H^{\prime },\mathrm{{ }}\theta H)} \tag{10} \end{equation*} View SourceRight-click on figure for MathML and additional features.where align represents the feature alignment and X represents the final aligned feature map.

Another path of this module outputs the shallow features. The high-level feature map H is upsampled and reduce the number of channels to 1 to obtain H^{\prime \prime }\in {\mathbf {R}^{1\times H \times W }}. After passing sigmoid, H^{\prime \prime } will be multiplied by L to obtain A\in {\mathbf {R}^{C\times H \times W }}. Next, A and L are superimposed and go through a convolution to get the shallow feature map L^{\prime }\in {\mathbf {R}^{C\times H \times W }}. The process described above can be expressed as follows: \begin{align*} A =&\mathrm{ Mul} (L, \mathrm{Sigmoid}(\mathrm{Upsample}(H)))\tag{11}\\ L^{\prime}=& \mathrm{Conv}(\mathrm{Cat}(A, L)). \tag{12} \end{align*} View SourceRight-click on figure for MathML and additional features.

D. Spatial-Channel Attention Module

We use the SCA module to fuse different semantic features. Below we will describe the mechanism of this module in detail.

As shown in Fig. 5, we divide the feature map F\in R^{C \times H \times W} into N parts along the channel dimension, each part can be described as F_{i} \in R^{C^{\prime } \times H \times W} with i=0,1,\ldots, N-1, where C^{\prime } is C/N. Each part is processed by convolution separately and then stacked together to obtain the new feature map X\in R^{C \times H \times W}. The process can be described as \begin{equation*} X=\operatorname{Cat}\left(\left[{\text{Conv}_{0}}(F_{0}), {\text{Conv}_{1}}(F_{1}),\ldots, {\text{Conv}_{N-1}}(F_{N-1})\right]\right). \tag{13} \end{equation*} View SourceRight-click on figure for MathML and additional features.

Fig. 5. - Channel attention module (CAM).
Fig. 5.

Channel attention module (CAM).

Next, we similarly divide X into N parts along the channel dimension. The SE module is used to obtain the channel weight parameters of each part, and the process can be described as \begin{equation*} V_{i}=\operatorname{SEM}\left(X_{i}\right), \quad i=0,1,2 \cdots N-1 \tag{14} \end{equation*} View SourceRight-click on figure for MathML and additional features.where SEM represents the SE module, V_{i} represents the channel weight parameter of the ith part.

Next, we will add up each set of channel weight parameter, as follows: \begin{equation*} V=\operatorname{Cat}\left(\left[V_{0}, V_{1},\ldots, V_{N-1}\right]\right). \tag{15} \end{equation*} View SourceRight-click on figure for MathML and additional features.V represents the multiscale channel weight parameter.

After softmax, we use V to adjust the relationship between the channels of the feature map X. The output of the channel attention module Y\in R^{C \times H \times W} is the sum of the input feature map F and the adjusted feature map X \begin{equation*} Y=\operatorname{Add}\left(\left[F, \text{Mul}(X, \text{Softmax}(V))\right]\right). \tag{16} \end{equation*} View SourceRight-click on figure for MathML and additional features.

As shown in Fig. 6, the input feature map F is convolved by 1×1 to obtain new feature maps A, B, P\in R^{C^{\prime } \times H \times W}, where C^{\prime }< C. The feature map A and B will use the transpose operation of the matrix to obtain the attention map S\in R^{(H \times W)\times (H \times W)}, which encodes pixel-to-pixel dependencies in spatial dimension. The feature map P will go through three different pooling methods yields P^{\prime }\in R^{C^{\prime } \times H \times W}.

Fig. 6. - Spatial attention module (SAM).
Fig. 6.

Spatial attention module (SAM).

Then, the attention map S will be matrix multiplied with the feature map P^{\prime } to obtain the feature map F^{\prime }. Finally, we add feature map F^{\prime } and feature map F to obtain the output feature map E\in R^{C\times H \times W}.

The final output of the SCA module can be expressed as \begin{equation*} {O = \text{Conv}(\text{Cat}(Y, E))} \tag{17} \end{equation*} View SourceRight-click on figure for MathML and additional features.where O\in R^{C \times H \times W} represents the final output of the SCA module.

Strip pooling is used in the CAM. Strip pooling is a novel pooling method that can make the network focus on the features of horizontal and vertical ground objects. To some extent, strip pooling can also be seen as an attention mechanism. In HRRSIs, many ground objects have a strip shape, so it is recommended to use strip pooling to improve the segmentation effect of these ground objects.

Fig. 7 shows the structure of the strip pooling module. A given feature map K\in R^{C \times H \times W} is fed to convolution layers to get K^{\prime }\in R^{C \times H \times W}. Next, X and X^{\prime } are obtained by reducing the number of channels by 1x1 convolution.

Fig. 7. - Strip pooling module (SPM).
Fig. 7.

Strip pooling module (SPM).

The following shows the calculation of vertical strip pooling: \begin{equation*} y_{c}^{v}=\frac{1}{H} \sum _{i=1}^{H} x_{c}(i, j) \tag{18} \end{equation*} View SourceRight-click on figure for MathML and additional features.where x_{c}(i, j) is the value at position (i, j) of the cth feature map in X. y_{c}^{v} represents the cth element output after horizontal strip pooling.

Similarly, horizontal strip pooling is handled as follows: \begin{equation*} y_{c}^{h}=\frac{1}{V} \sum _{i=1}^{V} x^{\prime }_{c}(i, j). \tag{19} \end{equation*} View SourceRight-click on figure for MathML and additional features.We upsample y_{c}^{v} and y_{c}^{h} to R^{C^{\prime } \times H \times W} and add them together \begin{equation*} {M = \text{Add}(\text{Upsample}(y_{c}^{v}), \text{Upsample}(y_{c}^{h})).} \tag{20} \end{equation*} View SourceRight-click on figure for MathML and additional features.

Next, M restores the number of channels to C by 1x1 convolution, and then multiplies with k^{\prime } after the sigmoid function to obtain the relationship matrix Z\in R^{C \times H \times W} \begin{equation*} {Z = \text{Mul}(\text{Sigmoid}(\text{Conv}(M)), K^{\prime }).} \tag{21} \end{equation*} View SourceRight-click on figure for MathML and additional features.Finally, the output P\in R^{C \times H \times W} is obtained by \begin{equation*} {P =\text{ Conv}(\text{Add}(Z, K)).} \tag{22} \end{equation*} View SourceRight-click on figure for MathML and additional features.

E. Class-Specific Discriminative Loss

Ground objects in HRRSIs often have similar appearance characteristics. Many semantic segmentation networks use deep features to distinguish different ground objects. However, deep features make it difficult to exert their effect on similar ground objects. In our network, we use the class-specific discriminative loss on deep features extracted by the semantic branch to distinguish similar ground objects. The operation of the class-specific discriminative loss are as follows.

First, 1×1 convolution is used on the feature map passed through the ASPP module to get T \in {\mathbf {R}^{2C\times H^{\prime } \times W^{\prime } }}, where C represents the number of ground objects. Next, we split the feature map T into C groups along the channel dimension, corresponding to c ground objects. The set of grouping results we denote by \lbrace t_{i}\rbrace _{i=1, \ldots, c}. Each group contains two channels, where one channel t_{i}^{f g} represents the foreground and the other channel t_{i}^{b g} represents the background. The expression for the probability of each pixel belonging to the foreground or background is as follows: \begin{align*} t_{i, o}^{f g}=&p\left(o=1 \mid t_{i}^{f g}\right) \tag{23}\\ t_{i, o}^{b g}=&p\left(o=0 \mid t_{i}^{b g}\right) \tag{24} \end{align*} View SourceRight-click on figure for MathML and additional features.where o represents each specific pixel location in t_{i}^{f g} and t_{i}^{b g}.

Further, we perform the following processing on t_{i}^{f g} and t_{i}^{b g}: \begin{align*} \hat{t}_{i, o}^{f g}=&\frac{e^{t_{i, o}^{f g}}}{e^{t_{i, o}^{f g}}+e^{t_{i, o}^{b g}}} \tag{25}\\ \hat{t}_{i, o}^{f g}=&\frac{e^{t_{i, o}^{f g}}}{e^{t_{i, o}^{f g}}+e^{t_{i, o}^{b g}}}. \tag{26} \end{align*} View SourceRight-click on figure for MathML and additional features.We optimize this process using multiple cross-entropy loss functions \begin{equation*} \begin{aligned} \mathcal {L}_{\text{dis }}=&-\frac{1}{N} \sum _{1}^{N} \frac{1}{H W} \sum _{r=1}^{H W} \frac{1}{C} \sum _{i=1}^{C}\left[y \times \log \left(\hat{t}_{i, o}^{f g}\right)\right.\\ &\left.+(1-y) \times \log \left(\hat{t}_{i, o}^{b g}\right)\right]. \end{aligned} \tag{27} \end{equation*} View SourceRight-click on figure for MathML and additional features.

SECTION IV.

Experiments

A. Experimental Settings

1) Dataset

The Vaihingen and Potsdam datasets are two widely used semantic labeling datasets provided by the ISPRS committee.

The Vaihingen dataset contains 33 tile aerial images with an average resolution of 2494×2064, while the Potsdam dataset contains 38 tile aerial images with an average resolution of 6000×6000. Follow the official calculation method, we use 40 images for training, of which 16 images belong to the Vaihingen dataset and 24 belong to the Potsdam dataset. Meanwhile, we use 31 images for testing, of which 17 images belong to the Vaihingen dataset and 14 belong to the Potsdam dataset. There are six ground objects in two datasets, namely, impervious surfaces, buildings, low vegetation, trees, cars, and clutters.

2) Implementation Details

Limited by the memory size of the GPU, we crop all images to 512×512 with an overlap of 1/3. We use the SGD optimizer and the weight decay is set to 0.0005. Based on past experience, we set the batch size to 8. As for the learning rate, we set the initial value to 0.003. In order to maintain the stability of training and ensure that the model can be continuously and effectively updated, we use a learning rate decay strategy as follows: \begin{equation*} lr = lr_{\text{init}} \times \left(1-\frac{\text{current}\_\text{iterations}}{\text{max}\_\text{iterations}}\right)^{0.9}. \tag{28} \end{equation*} View SourceRight-click on figure for MathML and additional features.

3) Evaluation Metrics

If only compare the segmentation results to evaluate the effectiveness of various methods, there will be significant errors. In order to provide a fair comparison of the effectiveness of various methods, we selected three commonly used indicators in semantic segmentation tasks—overall accuracy (OA), mean IoU (mIoU), and mean F1 score to evaluate the results. OA can directly evaluate the accuracy of classification from pixels, mIOU can evaluate the segmentation effect based on the integrity of the region, and F1 score takes into account the accuracy and recall of classification results. The calculation method for OA, Iou, and F1 scores is as follows: \begin{align*} \text{OA}=&\frac{N_{\text{TP}}+N_{\text{TN}}}{N_{\text{TP}}+N_{\text{FP}}+N_{\text{FN}}+N_{\text{TN}}} \tag{29}\\ \text{IoU}=&\frac{N_{\text{TP}}}{N_{\text{TP}}+N_{\text{FP}}+N_{\text{FN}}} \tag{30}\\ F_{1}=&\frac{2 *(\mathrm{ \frac{N_{\text{TP}}}{N_{\text{TP}}+N_{\text{FP}}} } * \mathrm{\frac{N_{\text{TP}}}{N_{\text{TP}}+N_{\text{FN}}} })}{(\mathrm{ \frac{N_{\text{TP}}}{N_{\text{TP}}+N_{\text{FP}}} }+\mathrm{ \frac{N_{\text{TP}}}{N_{\text{TP}}+N_{\text{FN}}} })}. \tag{31} \end{align*} View SourceRight-click on figure for MathML and additional features.

B. Hyperparameter Setting

1) Class-Specific Discriminative Loss Weight Parameter Setting

Our training process uses a variety of loss functions. In our experiments, we set the weight parameter of the main loss function \lambda _{1} and the detail loss function \lambda _{2} to 1. Furthermore, we add an auxiliary loss after the third block of the backbone network of the semantic branch. The weight parameter of the auxiliary loss function \lambda _{3} is set to 0.4.

For the weight parameter of the class-specific discriminative loss function \lambda _{4}, we set several experiments to find its optimal value. As shown in Fig. 8, when \lambda _{4} = 0.6, the network achieves the best results. Specifically, the OA is 90.87% and the mean F1 is 89.98% on the Vaihingen dataset. On the Postdam dataset, the OA is 91.25% and the mean F1 is 92.62%. Unless otherwise specified, the value of \lambda _{4} in all subsequent experiments is set to 0.6. In the above experiments, the semantic branch uses ResNet-50 as the backbone, and all other loss functions are used at the same time.

Fig. 8. - Comparison results of different values of the weight parameter of the class-specific discrimination loss function $\lambda _{4}$. (a) Experiments on Vaihingen. (b) Experiments on Potsdam.
Fig. 8.

Comparison results of different values of the weight parameter of the class-specific discrimination loss function \lambda _{4}. (a) Experiments on Vaihingen. (b) Experiments on Potsdam.

2) Backbone Setting

To extract high-level features of ground objects in the semantic branch, we need to use a deeper backbone network. We choose VGG-16 [25], ResNet-50 [27], and ResNet-101 for comparative experiments to choose the best backbone.

From the experimental results in Table I, it can be seen that the effect of using ResNet as the backbone is significantly better than using VGG. On the Vaihingen (Postdam) dataset, when using resnet50, there are 0.97, 0.52, and 0.54 improvements in OA, mIou, and F1 compared to VGG. Our ORBNet achieves excellent results thanks to the deeper structure of ResNet. In addition, ResNet also uses residual connections, which can avoid gradient disappearance. ResNet-50 and ResNet-101 are two representative networks of ResNet, both of which have similar basic structure but differ in depth. In the following comparison experiments with other algorithms, we use ResNet-101 as the backbone, and in the ablation experiments, ResNet-50 is used.

TABLE I Different Backbone Network Comparison Experiments
Table I- Different Backbone Network Comparison Experiments

C. Ablation Study for Multiple Loss Functions

We further verify the role of each loss function in the optimization process. \mathcal {L}_{c e}represents the main loss function, \mathcal {L}_{\text{aux}} represents the auxiliary loss function, \mathcal {L}_{de} represents the detail loss, and \mathcal {L}_{\text{cd}} represents the class-specific discriminative loss. The total loss of the network \mathcal {L}_{\text{total}} is represented as follows: \begin{equation*} \mathcal {L}_{\text{total}}=\lambda _{1} \cdot \mathcal {L}_{c e}+\lambda _{2} \cdot \mathcal {L}_{d e} + \lambda _{3} \cdot \mathcal {L}_{\text{aux}} + \lambda _{4} \cdot \mathcal {L}_{\text{c d}}. \tag{32} \end{equation*} View SourceRight-click on figure for MathML and additional features.

As shown in Table II, we tested the effects of different combinations of four loss functions on the Vaihingen dataset. When we used the main loss and the auxiliary loss together, the network achieved 90.83%, 81.77%, and 89.82% on the three indicators of OA, mIOU, and F1, respectively. When we add the detail loss, OA and mIOU are improved by 0.04% and 0.03%, respectively. And when we add class-specific discrimination loss, OA and mIOU are improved by 0.03% and 0.08%, respectively. The experimental results show that both the detail loss and class-specific discrimination loss loss have positive effects in their respective branches. The network works best when we use all loss functions together.

TABLE II Ablation Study for Multiple Loss Functions
Table II- Ablation Study for Multiple Loss Functions

D. Ablation Study for Bilateral Branches

We conducted ablation experiments to evaluate the effectiveness of the bilateral branches in our design. The results, as shown in Table III, demonstrate the impact of incorporating detailed branches alongside semantic branches. It is evident that when the network solely retained semantic branches, it achieved an OA of 89.97%. However, upon adding detail branches, the OA increased by 0.49%. This outcome underscores the significance of both low-level and high-level features in accomplishing accurate semantic segmentation tasks.

TABLE III Ablation Study of Bilateral Branches on the Potsdam Dataset
Table III- Ablation Study of Bilateral Branches on the Potsdam Dataset

E. Module Effectiveness Evaluation

1) FAF Module

In our network, the FAF modules are used to complement the aligned features for the detail branch and provide different levels of semantic features for the feature fusion stage. Tables IV and V show the ablation experiments of FAF modules, where FAF represents that the network only uses aligned features, low-level represents the low-level features from the first two FAF modules that are used, and high-level represents the high-level features from the last two FAF modules are used.

TABLE IV Ablation Study for Feature Align ADN Fusion Modules on the Vaihingen Dataset
Table IV- Ablation Study for Feature Align ADN Fusion Modules on the Vaihingen Dataset
TABLE V Ablation Study for Feature Align ADN Fusion Modules on the Potsdam Dataset
Table V- Ablation Study for Feature Align ADN Fusion Modules on the Potsdam Dataset

Our benchmark network uses ResNet-50 as the semantic branch. The results on the Vaihingen dataset are shown in Table IV. When the outputs of all FAF modules are used together, the effectiveness of the network is the best. Specifically, it achieves 90.92% OA, 90.02% mean F1, and 82.08% mIOU. The experimental results show that both features at different levels and aligned features are essential in HRRSIs semantic segmentation. Table V shows the experiments on the Postdam dataset, and similar conclusions can be obtained.

The detail loss is used to supervise the aligned features from the FAF modules. We visualize the aligned feature maps in three stages of the detail branch. It can be seen from Fig. 9 that when the FAF modules are used, the feature maps not only contain the edge and texture features of the ground objects but also high-dimensional features, which will enhance the performance of HRRSIs segmentation.

Fig. 9. - Visualization of the detail maps. (a) Input image. (b) Label of the input image. (c) First detail feature map without the FAF modules. (d) Second detail feature map without the FAF modules. (e) First detail feature map with the FAF modules. (f) Second detail feature map with the FAF modules.
Fig. 9.

Visualization of the detail maps. (a) Input image. (b) Label of the input image. (c) First detail feature map without the FAF modules. (d) Second detail feature map without the FAF modules. (e) First detail feature map with the FAF modules. (f) Second detail feature map with the FAF modules.

Furthermore, we visualize the results of several different models in our experiments. As can be seen from Fig. 10, when the model uses all FAF modules, the accuracy and completeness of the segmentation results are the best.

Fig. 10. - Comparison of segmentation results of different models. (a) Input image. (b) Label of the input image. (c) +ASPP (d) +ASPP+ FAF (e) +ASPP+FAF+$shallow features$. (f) +ASPP+FAF+$shallow features$+$deep features$.
Fig. 10.

Comparison of segmentation results of different models. (a) Input image. (b) Label of the input image. (c) +ASPP (d) +ASPP+ FAF (e) +ASPP+FAF+shallow features. (f) +ASPP+FAF+shallow features+deep features.

2) Attention Module

The SCA module is responsible for capturing and filtering out key features for segmentation.

We test the SCA module on the Vaihingen and Postdam datasets, respectively. Tables VI and VII show the specific experimental results regarding the SCA module. Our network has achieved relatively good results even without using the SCA module, which is mainly due to the enhanced bilateral network design. When we only use one of the attention modules (CAM or SAM), the effect of the network will be improved. When we use CAM and SAM together, the network improves OA, mean F1, mIOU by 0.07% (0.37%), 0.2% (0.19%), and 0.32% (0.4%) on the Vaihingen (Postdam) datasets. Fig. 11 illustrates a comparison between the results obtained before and after using the SCA module. We can easily see that using the SCA module results in higher segmentation accuracy. As shown in Fig. 11(d), the network can correctly segment impervious surfaces compared with Fig. 11(c). This improvement can be attributed to the SCA module's ability to capture crucial features in both the spatial and channel dimensions, thereby enhancing the model's overall performance.

TABLE VI Ablation Study for Sca Modules on the Vaihingen Dataset
Table VI- Ablation Study for Sca Modules on the Vaihingen Dataset
TABLE VII Ablation Study for SCA Modules on the Potsdam Dataset
Table VII- Ablation Study for SCA Modules on the Potsdam Dataset
Fig. 11. - Comparison of segmentation results of different models. (a) Input image. (b) Label. (c) w/o SCA. (d) w/ SCA.
Fig. 11.

Comparison of segmentation results of different models. (a) Input image. (b) Label. (c) w/o SCA. (d) w/ SCA.

As shown in Fig. 12, we select the heatmap of the output from the SCA module for comparison. The heatmap has a total of six channels, corresponding to six ground objects. We exclude the heatmap corresponding to the last channel, which represents the background. It is evident that the utilization of the SCA module has significantly improved the response of the object corresponding to the heatmap. The conclusion drawn from the aforementioned experiments is that the SCA module effectively enhances the features related to the corresponding categories of channels. In other words, the SCA module enables the network to prioritize and focus on the features that require attention.

Fig. 12. - Heatmaps before and after using the SCA module. The first column is the input map, and the following columns are the heatmaps for the channels corresponding to the impervious surface, buildings, low vegetation, trees, and car categories. The first row is the effect after using the SCA module.
Fig. 12.

Heatmaps before and after using the SCA module. The first column is the input map, and the following columns are the heatmaps for the channels corresponding to the impervious surface, buildings, low vegetation, trees, and car categories. The first row is the effect after using the SCA module.

F. Comparison With State-of-the-Art Methods

1) Effectiveness on Vaihingen DataSet

We compare our ORBNet with other semantic segmentation models on the Vaihingen dataset, including FCN-dCRF, Dilated FCN, DLR, PSPNet, U-Net, DeepLabV3, BiSeNet, HCANet, DLNet, DANet, CCNet, and MFANet. Table VIII shows the detailed comparison results. In addition, in order to facilitate the understanding and analysis of the segmentation effect of each ground object, we also separately list the F1 scores of each ground object.

TABLE VIII Experimental Results on the Vaihingen Dataset
Table VIII- Experimental Results on the Vaihingen Dataset

As a whole, our ORBNet achieves the highest value in all three evaluation metrics. Specifically, OA is 91.76%, mIOU is 83.86%, and mean F1 is 91.10%. Compared with MFANet whose effect is the best among other methods, ORBNet has achieved 1.16% and 1.76% improvement on OA and mean F1, respectively. It should be noted that our network ORBNet-ResNet101 achieves the highest F1 scores in all four categories.

Although FCN achieves good predictions in the building, it performs poorly in the tree and car categories. This is because building have more pixels in the image, and the fully convolutional structure of FCN can learn their features well. However, tree and low vegetation have similar appearance colors, car have fewer pixels, FCN can not distinguish them well. Because U-Net uses a skip-connection structure between the encoding network and the decoding network to supplement the detailed information, so it can improve the accuracy of segmentation for car and other objects.

Compared with U-Net, DeepLabv3+ has achieved significant improvement. This is not only because DeepLabv3+ uses a deeper backbone network but also because it uses the ASPP structure for multiscale features extraction. Our previously proposed network MFANet also adopts a similar bilateral structure to capture multilevel semantic features. ORBNet is enhanced on the basis of MFANet, and finally achieves the best results in all indicators.

Overall, our ORBNet performs better in segmentation than the baseline. For example, in the first sample image shown in Fig. 13, our ORBNet has significantly higher integrity in clutter segmentation than the baseline. As shown in the second sample image, the performance of the baseline in building segmentation is not very good, and cavities are prone to appear in the continuous areas of building segmentation. Our ORBNet does not have this issue in building segmentation.

Fig. 13. - Comparison of segmentation results on the Vaihingen dataset.
Fig. 13.

Comparison of segmentation results on the Vaihingen dataset.

2) Effectiveness on Postdam DataSet

Similarly, we conducted comparative experiments with other methods on the Postdam dataset. As shown in Table IX, our ORBNet achieves the highest value in all three evaluation metrics. Specifically, OA is 91.92%, mIOU is 87.44%, and mean F1 is 93.17%.

TABLE IX Experimental Results on the Potsdam Dataset
Table IX- Experimental Results on the Potsdam Dataset

Fig. 14 shows the segmentation results on the Potsdam dataset, our proposed ORBNet is significantly closer to the labels. In addition, the small objects in HRRSIs has always been very difficult. Compared to ground objects such as building, car account for less pixels in HRRSIs, making it a small target. During the experimental process, we found that ORBNet had significantly better F1 scores in car segmentation than other methods, mainly due to its full utilization of semantic features at different levels.

Fig. 14. - Comparison of segmentation results on the Potsdam dataset.
Fig. 14.

Comparison of segmentation results on the Potsdam dataset.

G. Complexity Analysis

We compare the complexity of our model with several representative models, including FCN, U-Net, DeepLabV3, PSPNet, and BiSeNet. Table X shows the comparison of computational complexity between our proposed method and other partial methods. U-Net and FCN have a small number of model parameters and low computational complexity because their structures are simple. DeepLabV3, PSPNet, and BiSeNet adopt a more complex network design to improve the performance. Although the complexity of ORBNet is relatively high, the performance is the best compare with other methods.

TABLE X Computational Complexity
Table X- Computational Complexity
SECTION V.

Conclusion

In this article, we designed an ORBNet for semantic segmentation of high resolution remote sensing images, which has a dual branch network structure to collect semantic features of different levels in HRRSIs. The proposed FAF module can complete FAF, which not only aligns different levels of features but also generates fused shallow and deep features. The proposed SCA module models dependence relationships in spatial and channel dimensions. Detail loss and class-specific discriminative loss are used separately in the detail branch and semantic branch to enhance the generation of different features. We conducted comparative experiments with other popular methods and concluded that our proposed method exhibits excellent performance. However, our network has more parameters, which limits its further use in practical hardware devices. In addition, the annotation of remote sensing data requires a lot of manpower. In future work, we not only need to consider the lightweight of the model, but also focus on few-shot sample learning and weakly supervised learning.

References

References is not available for this document.