A Synergistical Attention Model for Semantic Segmentation of Remote Sensing Images

In remotely sensed images, high intraclass variance and interclass similarity are ubiquitous due to complex scenes and objects with multivariate features, making semantic segmentation a challenging task. Deep convolutional neural networks can solve this problem by modeling the context of features and improving their discriminability. However, current learning paradigms model the feature affinity in spatial dimension and channel dimension separately and then fuse them in a sequential or parallel manner, leading to suboptimal performance. In this study, we first analyze this problem practically and summarize it as attention bias that reduces the capability of network in distinguishing weak and discretely distributed objects from wide-range objects with internal connectivity, when modeled only in spatial or channel domain. To jointly model both spatial and channel affinity, we design a synergistic attention module (SAM), which allows for channelwise affinity extraction while preserving spatial details. In addition, we propose a synergistic attention perception neural network (SAPNet) for the semantic segmentation of remote sensing images. The hierarchical-embedded synergistic attention perception module aggregates SAM-refined features and decoded features. As a result, SAPNet enriches inference clues with desired spatial and channel details. Experiments on three benchmark datasets show that SAPNet is competitive in accuracy and adaptability compared with state-of-the-art methods. The experiments also validate the hypothesis of attention bias and the efficiency of SAM.

Abstract-In remotely sensed images, high intraclass variance and interclass similarity are ubiquitous due to complex scenes and objects with multivariate features, making semantic segmentation a challenging task. Deep convolutional neural networks can solve this problem by modeling the context of features and improving their discriminability. However, current learning paradigms model the feature affinity in spatial dimension and channel dimension separately and then fuse them in a sequential or parallel manner, leading to suboptimal performance. In this study, we first analyze this problem practically and summarize it as attention bias that reduces the capability of network in distinguishing weak and discretely distributed objects from wide-range objects with internal connectivity, when modeled only in spatial or channel domain. To jointly model both spatial and channel affinity, we design a synergistic attention module (SAM), which allows for channelwise affinity extraction while preserving spatial details. In addition, we propose a synergistic attention perception neural network (SAPNet) for the semantic segmentation of remote sensing images. The hierarchical-embedded synergistic attention perception module aggregates SAM-refined features and decoded features. As a result, SAPNet enriches inference clues with desired spatial and channel details. Experiments on three benchmark datasets show that SAPNet is competitive in accuracy and adaptability compared with state-of-the-art methods. The experiments also validate the hypothesis of attention bias and the efficiency of SAM.
Index Terms-Attention bias, contextual affinity, remote sensing images, semantic segmentation, synergistic attention.

I. INTRODUCTION
S EMANTIC segmentation of remote sensing images aims to produce pixelwise categorical labels to facilitate interpretation of the remote sensing data [1], [2], [3]. The semantically parsed annotation enables an intuitive perception of targets and, therefore, has been widely adopted in downstream tasks, such as land-cover mapping [4], [5], water resources management [6], [7], and disaster assessment [8], [9], among others. Due to the homogeneity between remote sensing and natural images, convolutional neural networks (CNNs) trained on large natural image datasets show strong adaptability and generalization capability when applied to remote sensing tasks [10], [11], [12], [13], [14]. Fully convolutional network (FCN) [15], for example, has become a baseline of semantic segmentation. The encoder-decoder architecture has also been adopted for this task. SegNet [16] gradually recovers the encoded abstract feature maps to the original spatial size to compensate for the sampling loss. U-Net [17] deploys multiple skip connections between encoder and decoder stages to retain details. Nevertheless, the abovementioned approaches suffer from the local receptive field that delivers information in a short-range context. Therefore, it is necessary to model sufficient and wider context to enrich inference clues based on CNNs.
To address this issue, dilated convolution was designed to extract multiscale features that help to obtain more accurate semantic information [18]. Dilated convolution, also named atrous convolution, paves a feasible way to adaptively modify the size of the field-of-view of the convolution filter. Consequently, the atrous convolution-based networks [19], [20] can aggregate a wider range of contextual information. Unfortunately, the global context cannot be well-extracted by stacking atrous convolutions [21], [22].
Attempting to enrich spectral-spatial context, Wang et al. [23] proposed to combine spectral, spatial, and semantic information before classification by a multikernel support vector machine, in which the spectral information, spatial information, and semantic information are processed by one kernel individually. Moreover, 3-D convolution is utilized. A 3-D fast learning block (depthwise separable convolution block and a fast convolution block) followed by a 2-D CNN was introduced to extract spectral-spatial features [24]. Tang et al. [25] developed a 3-D octave convolution with the spatial-spectral attention network to capture discriminative spatial-spectral features for the classification of hyperspectral images. Wang and Liang [26] developed a lightweight three-layer 3-D convolutional network module for hyperspectral images' spectral-spatial feature extraction. Alternatively, spectral-spatial fusion mechanism was investigated. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Li et al. [27] skillfully designed a gated mechanism to aggregate the context in spatial and spectral domains. Furthermore, Zhang [28] developed a two-stream spectral-spatial feature extraction network via 3-D hybrid convolution associated with a multilevel aggregation subnetwork. However, such kind of designs integrates multiple modules and always with massive parameters, hindering the availability and scalability of applying a segmentation network to remote sensing images with variable objects and complex scenarios.
Inspired by human perceptual and cognitive systems, the attention mechanism has advanced natural language processing and computer vision [29]. The attention mechanism allows networks to focus on salient regions or aspects of specific objects in complex scenes while filtering unimportant perceived information. In vision tasks, significant progress was made by SENet [30] that recalibrates channel weights based on their importance. This was followed by the proposal of self-attentiveness with a surprising ability to capture long-range dependencies. Nonlocal blocks [31] were initially designed for video classification and were successfully transferred to object detection and semantic segmentation. Similarly, DANet [32] implements the spatial attention module and channel attention module (CAM) in parallel and then conducts fusion operations to provide contextual information from both domains. Convolution block attention module (CBAM) [33] seamlessly integrates a lightweight attention module in a CNN to achieve competitive results in classification and detection. In addition, OCRNet [34] extracts relational context and uses it to sharpen the boundaries of objects in a self-attentive manner.
Remote sensing images acquired from high altitudes normally cover wide areas with complex and diverse backgrounds. The spatial details and potential semantic contents in remote sensing images are abundant even when the spatial resolution is relatively low [35], [36]. Thus, the ground objects are not well recognizable by most of computer vision networks. It is necessary to make an in-depth exploration of the visual features of objects in remote sensing images. Moreover, in different geographical areas, objects have typical variants. For example, forests have different colors and shapes in different areas and times, and Beijing National Stadium and Sapporo Dome have significant visual differences. Since CNNs cannot well-extract the latent inner class and interclass visual correlations of ground objects of remote sensing images, the attention mechanism provides an alternative to enlarge the interclass dissimilarities and reduce the inner class similarities. Driven by DANet [32], SCAttNet [37] adopts two lightweight attention modules that individually learn spatial and channel correlations, refining the learned representations adaptively. In addition, HMANet [38] exploits categorywise correlations beyond local-region attention to boost the discriminative capability of the segmentation model. Although achieving impressive results by diverting attention mechanism, the mutually independent scheme of modeling spatial and channel affinities is far from optimal.
No matter whether in a sequential or parallel fashion, the attention mechanism cannot model all relationships between spatial and channel domains. For example, in traditional channel attention, each channel is connected to other channels while ignoring spatial responses in a particular feature map. Similarly, interactions between channels are missing in the spatial attention procedure. The consequence of this drawback is a condensation of attentional features, named attentional bias. Specifically, TESA [39] captures interdependencies self-attentively along all dimensions of the tensor using a postfusion of vector similarities to compensate for such bias. Although the fused similarities are more accurate than individuals, the vector similarities initially extracted by TESA are independent, which inevitably leads to attentional bias. However, the integrity of multidimensional contextual information for understanding spatial and channel interactions is ubiquitous in the vanilla visual system. This inheritance allows humans to become intelligent and accurate in perceiving and understanding panoramas. Thus, a promising solution to alleviate attentional bias is necessary.
To sum up, there are two problems with the existing attention-based semantic segmentation networks for remote sensing images.
1) Different classes of objects in remote sensing images may share similar color or texture properties. Objects of the same class may have various spatial sizes or shapes.
Both textural information and spatial information are necessary for accurate ground object recognition in remote sensing images. Deploying spatial and CAMs sequentially or parallelly leads to the one sidedness of their contextual information, regardless of the completeness of the context across spatial and channel domains. Eventually, this design might potentially result in suboptimal segments of weak and small objects or inconsistency inside uniformly represented objects. 2) From the computational complexity perspective, calculating the dependencies of spatial and channel domains separately leads to higher memory and computation costs as it may involve massive matrix multiplications. Therefore, it is a high priority to reduce matrix manipulations while retaining sufficient affinity extraction and aggregation, in which the spatial context is considered while generating channelwise dependencies. To solve these two problems, in this article, a synergistic attention module (SAM), which collaboratively models spatialwise and channelwise correlations into one unique attention map, is devised to capture contextual affinity across spatial and channel domains simultaneously. This module allows the network to leverage a more comprehensive affinity that both spatial and channel correlations involved to refine representations. Essentially, we unify the attention for both dimensions in an individual model that accepts spatial affinity when computing channelwise attention maps. Furthermore, both channelwise and spatialwise interactions are jointly modeled as attentional similarities to complement inference clues. Finally, we implement plug-and-play SAM to form a synergistic attention perception neural network (SAPNet) for semantic segmentation. Experiments are conducted to validate the efficiency and superiority on three remote sensing benchmarks. The threefold contributions of this article are summarized as follows.
1) We undertake a theoretical and experimental analysis to prove the attention bias and its impact on the semantic segmentation of remote sensing images. This leads to the proposal and design of a consolidated attention module to address attention bias and retain integrated feature representations. SAM models well-preserved context and spatial and channel dependencies with the rationale of multidimensional nonlocal attention. Moreover, the proposed SAM requires fewer matrix manipulations, reducing the computational complexity and execution time. 2) A novel neural network called SAPNet is proposed.
SAPNet hierarchically deploys SAMs between the encoder and the decoder. Thus, the contextual information is more comprehensive without massive spatial or channel loss. Consequently, the uncertainty of classifying one pixel into easy-confused classes is significantly reduced. 3) Extensive experiments are conducted on three representative remote sensing image semantic segmentation benchmarks [40], [41], [42]. The results show that SAPNet is superior to several state-of-the-art methods. Furthermore, the efficacy and efficiency of SAM are validated and analyzed. The rest of this article is organized as follows. Section II presents the related works about semantic segmentation of remote sensing images and affinity modeling. Section III introduces the proposed method in detail. Section IV contains experiments, comparisons, and discussions to validate SAPNet on three remote sensing benchmarks of semantic segmentation. Finally, Section V draws the conclusions.

A. Affinity Modeling
Contextual affinity is used to refine learned representations that are typically generated by the convolutional backbones [43], [44]. Early affinity modeling approaches were realized by deploying dilated convolutions. DeepLab [18] designs and utilizes pyramid dilated convolutions with multiple rates to enlarge receptive fields. Thus, such convolution allows the feature to contain a wider range of perceptual information than the neighboring pixels. On this basis, DenseASPP [45] densifies the dilated rates to cover more extensive scale ranges. However, due to the discontinuity of the convolution kernels at different scales, the dilated convolution leads to the loss of local features. In addition, the captured long-range context may be irrelevant.
Recently, the graph convolutional network (GCN) has been adopted to describe non-Euclidean data that lack shift invariance. For semantic segmentation, Bronstein et al. [46] proposed a dual graph convolutional network (DGCNet) that models the global context of the input feature by modeling two orthogonal graphs in a single framework. However, the convolutional feature map is necessary in DGCNet and only a slight improvement has been obtained on segmenting natural images. Moreover, Zhang [47] developed a CNN-enhanced heterogeneous graph convolutional network (CHeGCN) for park segmentation, constructing the relational topology of a binary classification task. Nevertheless, the construction of relation graph of multiclass task would be very complex due to the inherence of ground objects in remote sensing images, hindering the extension of GCN-based methods.
To selectively capture beneficial long-range context, the attention mechanism has been proposed. On the one hand, the attention mechanism determines whether correlations between input elements should be considered or not. On the other hand, the attention mechanism quantifies how much weight should be put into it. To date, the attention mechanism has been widely used in remote sensing, including scene classification [48], [49], object detection [50], [51], pansharpening [52], [53], and change detection [54], [55]. The essence of the attention-based deep neural network is to incorporate the similarity between different channels or positions to enhance pixel representations. As a result, the enhanced representations are more discriminative. Thus, the predictive results usually have higher accuracy.
Early attention mechanisms for vision tasks focus on modeling affinity in either channel or spatial dimension. For example, SENet [30] first highlighted the informative feature channels and weaken the less useful ones by adaptively recalibrating the channels with specific weights in vision tasks. Concurrently, Wang et al. [31] introduced nonlocal operators to capture spatialwise correlations to enrich spatial details, which inaugurated the exploration of the self-attention mechanism. However, the 1-D contextual affinity is insufficient to produce discriminative features. Instead of identifying features only in the spatial domain, CBAM [33] attempted to emphasize the informative features in both spatial and channel domains by stacking channel attention and spatial attention. As a semantic segmentation-specific attention network, DANet [32] adopts the self-attention mechanism instead of stacking convolutions. More concretely, DANet deploys the position attention module (PAM) and CAM in parallel to capture dependencies. Latterly, transformer and its variants based on self-attention mechanism have also been proposed for semantic segmentation. However, the initial tokenization at the beginning of transformer would break the inherent details of ground objects, such as structure or boundary. For dense predictions, such details are essential for localization and recognition [56].
To sum up, nonlocal techniques provide a simple yet effective way to adaptively capturing position/pixelwise contextual information. The nonlocal-based module can be flexibly embedded in different stages of CNN backbone with feature maps at different sizes, thus enabling multiscale feature refinement and context enrichment.
Although exceptional results have been achieved by introducing attention mechanisms, the sequential or parallel deployment of attention modules is individually biased. The generated attention maps miss the information from the counterpart dimension. Unlike existing attention methods, we propose to synergistically model the contextual affinity across spatial and channel dimensions. Apart from recently used transformers, this one-stage attention module allows joint and simultaneous modeling of the contextual affinity in a unified attention map.

B. Semantic Segmentation of Remote Sensing Images
Remote sensing images are normally acquired at high altitudes. They often cover diverse and complex ground objects with high intraclass variance and interclass similarity. Semantic segmentation labels every pixel for the delineation of objects. Methods are typically implemented with statistical analysis and handcrafted feature-aided learning or data-driven deep learning steps.
Some state-of-the-art remote sensing semantic segmentation approaches adopt the encoder-decoder architecture due to its stable and reliable feature extraction and transformation capability. To boost localization accuracy and maintain feature consistency, U-Net is extensively analyzed and explored for remote sensing imagery. TreeUNet [57] adaptively increases the classification accuracy at the pixel level by building a Tree-CNN block. Tree-CNN block is constructed with a confusion matrix to fuse multiscale features. ResUNet-a [58] extends the U-Net baseline by introducing residual connections, atrous convolutions, pyramid scene parsing, pooling, and multitasking inference. To our knowledge, this network has massive parameters while achieving excellent accuracy.
Recently, attention-based deep neural networks have been introduced for the semantic segmentation of remote sensing images. For example, Li et al. [59] proposed to use a deep layer CAM and a shallow layer spatial attention module to segment large-scale satellite imagery. Moreover, the squeeze and excitation (SE) design is extended to spatial dimension to focus on patchwise semantics, bridging the semantic gap between highand low-level features [60]. AFNet [61] designs a multipath encoder structure, a multipath attention-fused module, and a refinement attention-fused block for semantic segmentation of very-high-resolution remote sensing images. HMANet [38] hybridizes three-level representations to augment learned features, including space, channel, and category. Similarly, Li et al. [62] proposed HCANet that extracts and aggregates cross-level contextual information from pixels, superpixels, and global space. In addition, an adaptive fusion network was designed by Liu et al. [63], enhancing the feature points from either low-or high-level feature maps with the constructed attention modules. LANet introduces a patchwise attention module to incorporate locality information and then adaptively fuses cross-level attention via devised attention embedding module [64].
In the context of semantic segmentation of remote sensing images, spatial details and channel information are both important to highlight the difference between diverse objects. Although the incorporation of the attention mechanism enables the segmentation network to capture correlations in all dimensions, the attention module inevitably misses the intrinsic spatial-channel interactions, leading to attention bias. To capture these interactions, this study strives to synergistically model multidimensional dependencies in a single module. This module generates a single map instead of aggregating multiple contextual affinity maps.

III. METHOD
First, an overview of the proposed network is presented in Section III-A. Next, the attention bias is discussed based on theoretical and practical analysis in Section III-B. Then, Section III-C thoroughly illustrates the proposed SAM. A complexity analysis of SAM is also described in Section III-D. Finally, the pipeline of synergistic attention perception (SAP) and framework of SAPNet is presented in Section III-E.
A. Overview Fig. 1 shows a flowchart of the proposed method. The designed network based on the encoder-decoder architecture deploys one SAM and four SAP modules between the encoder and decoder blocks. SAM refines the input features with spatial and channel dimensional affinities. Also, SAP allows aggregation of encoded, decoded, and SAM-refined features to preserve informative details. Therefore, the discriminability of pixelwise representations is enhanced before prediction. The details are given in the following.

B. Attention Bias
Contextual affinity plays a pivotal role in augmenting representations, highlighting valuable details, and disregarding irrelevant parts. Therefore, it is essential to introduce rich details into learned representations for pixel-level dense prediction [65]. Due to the advancements in attention mechanisms, the primary current focus in affinity modeling is on how to explore attention modules efficiently and effectively. The nonlocal rationale is generalized in spatial and channel dimensions among all the attention variants. To comprehensively demonstrate the attention bias, we take DANet [32] as an example. DANet has a PAM for spatial contextual affinity and a CAM for channelwise contextual affinity, denoted as PAM and CAM, respectively. Fig. 2 shows the pipelines of PAM and CAM, and SAM is shown in Fig. 5 (top). Given an input feature F in ∈ R H ×W ×C , where H , W , and C denote height, width, and channel, respectively, in PAM, the vanilla nonlocal block is adopted. Apart from PAM, CAM feeds features into three branches without convolution for preserving the original channel correlations. The contextual affinity of PAM and CAM is generated with a spatial size of H W × H W and C × C, respectively. From the perspective of feature transformation, spatial affinity is the intradimensional correlation between arbitrary position pairs, ignoring the impacts from other channels even at the same position. As to CAM, each channel can be connected with other channels while discounting spatial interactions.  Remote sensing images usually cover a wide range of areas with high intraclass variance and low interclass differences. As a result, weak and small objects are often misclassified. Moreover, the consistency inside the big object cannot be guaranteed. To address these issues, we propose an SAM that releases attention bias by synergistically modeling contextual affinity in a single attention map.
We further conduct an exploratory experiment on the ISPRS Vaihingen benchmark to verify the attention bias issue. The predefined objects in the remote sensing image can be divided into two categories: weak and discrete-distributed objects (WDOs), and wide-range with strong internal connectivity objects (WSOs). As shown in Fig. 3, three subpatches of Vaihingen are used to visually understand WDOs and WSOs. The WSOs are delineated with blue lines, while WDOs are with yellow or white lines.
The experimental models are shown in Fig. 4. The SegNet with ResNet 50 is adopted as the backbone for the baseline. At the end of the encoder, five different modules are embedded for feature refinement. In addition to deploying a single attention module, such as SAM, PAM, and CAM, the sequential and parallel fusion models are also examined. Besides, we test the attention bias based on the proposed SAPNet with these attention models embedded. The comprehensive experimental results are presented in Section IV-D. In general, modeling spatial or channel contextual affinity independently leads to attentional bias. Extracting channel attention ignores the spatial correlations inside the specific channels. As a result, the refined features might not be apparently discriminative. Therefore, we suggest that affinities shall be modeled synergistically to represent both positional and channel information in a unique attention map.

C. Synergistic Attention Module
As previously discussed, existing nonlocal attention yields attention bias to specific dimensions while ignoring others. Moreover, the computations are redundant with a sequential or parallel fusion model. To model the contextual affinity in a sole attention map without sacrificing computational efficiency, an SAM is devised.
The core pipeline is shown in Fig. 5, in which the bottom subplot shows the structure of feature transformation and reorganization modules (F&R). The input feature map is denoted as F in ∈ R C×H ×W , where H , W , and C represent width, height, and channel, respectively. First, F&R is deployed to extract contextual priors over horizontal and vertical coordinates in the spatial dimension. The widthwise and heightwise global average pooling are designed to obtain richer context horizontally and vertically, ensuring spatial consistency when computing channelwise affinity. Then, Pool(F in ) W ∈ R C×1×W and Pool(F in ) H ∈ R C×H ×1 are generated. Using different sizes of pooling with regard to height and width enables the extraction of global contextual information along both horizontal and vertical directions. After H and W times of loops, reconstructed features F w ∈ R C×H ×W and F h ∈ R C×H ×W are obtained. Afterward, the reconstructed features are split along the corresponding axial direction, followed by concatenation. Therefore, the output of F&R is F fr ∈ R C×S×(H +W ) , where S = H = W in the experiments.
F&R outputs the refined feature map that introduces spatial contextual as priors. Subsequently, the transposed F T fr is left multiplied by F fr to calculate contextual affinity where A(i, j) ∈ R (H +W )×C×C is the correlation between the ith and jth channels at a specific spatial position. Finally, matrix multiplication and an elementwise summation are implemented, injecting a 3-D contextual affinity in which the spatial context served as the prior in calculating channelwise correlation. Thus, it can be concluded that both spatial and channel contexts are aggregated. This step is implemented as follows: where α is a learnable coefficient, F SAM ∈ R C×H ×W is the refined representations with spatial and channel contextual affinities, and F SAM ( j) is the jth channel of F SAM . Here, the contextual affinity captured by SAM does not compress the spatial information while capturing the channel-dependent attention maps. SAM assumes that spatial correlations can be treated as contextual priors and injected into the channelwise attention calculation. In this way, the SAM reaches a holistic contextual view, boosting the discriminability of learned representations. This property makes the network robust to diverse and complex scenarios of remote sensing images.

D. Complexity of SAM
Given an input feature with a size of C × H × W , the computational complexity of PAM is O((H W ) 2 C), while CAM has a complexity of O(C 2 H W ). PAM and CAM model the dependencies of spatial and channel dimensions individually. To model 3-D dependencies, sequential and parallel architectures separately calculate spatial and channelwise correlations, yielding higher complexity and requiring more GPU memory. SAM leads to a novel way of computation by jointly modeling spatial and channel dimensional correlations, which has a computational complexity of O(C 2 (H + W )S), where S = H = W in the experiments. Therefore, SAM has the same order as CAM while introducing a small number of matrix multiplications.

E. Synergistic Attention Perception Neural Network
As shown in Fig. 6, the schematic of the proposed SAPNet is presented. The network consists of an encoder-decoder baseline, four SAP modules, and one SAM. The upper branch is the feature encoding path, while the bottom is the decoding path. Four SAP modules are designed and embedded hierarchically for refining features with different scales. Moreover, the encoded features at the most miniature scale have a very small spatial size where limited spatial details are available. Furthermore, there is no scale variation in transforming this feature to the decoder stage. Therefore, the output features of the encoder are refined with a single SAM.
The encoded feature is given as F e (l), 1 ≤ l ≤ 5, which represents the output feature of specific encoder stages. Correspondingly, F d (l), 1 ≤ l ≤ 5, denotes the output of different decoder blocks. Specifically, F d (5) can be formed as where F SAM (·) refines F e (5). The SAP module is utilized to aggregate the encoded and decoded features to prevent  information loss, in which SAM is deployed as a branch for feature refinement. Fig. 7 shows the SAP module. SAP initially aggregates the upsampled features from the decoder with encoded ones. Then top branch outputs the features on contextual affinity by SAM. Finally, concatenation followed by a 1 × 1 convolution is implemented for generating sophisticated features as follows: where Ref(F d (l)) represents the input feature for decoder l that derives from the fusion of the features from the former decoder and counterpart encoder, Up( ) denotes the upsampling operator, and Concat( ) denotes the concatenation operator. Overall, the encoder-decoder baseline is mutated with multiscale feature refinement, in which all-inclusive contextual affinity and well-preserved original details are merged. As a result, comprehensive clues are provided for the probabilistic delimitation of each pixel.

IV. EXPERIMENTS AND DISCUSSION
We have conducted extensive experiments on three benchmarks to compare the performance of the proposed method with several state-of-the-art methods. We also evaluate attention bias with four attention models based on two baselines.
A. Datasets 1) ISPRS Vaihingen Dataset: It was acquired from an aerial platform. Its spatial resolution is 9 cm. Six land-cover classes were labeled at the pixel level. Specifically, clutter serves as a background class and is ignored in accuracy evaluation. The publicly available data consist of 33 true orthophoto tiles with an average spatial size of 2494 × 2064 pixels. Red, green, and near-infrared (NIR) bands form the false-color images. The numbers of training, validation, and testing images are 14, 2, and 17, respectively.
2) ISPRS Potsdam Dataset: It was also acquired from an aerial platform but with a spatial resolution of 5 cm. The same land-cover classes as Vaihingen were labeled at the pixel level. Four bands, R, G, blue (B), and NIR, are used. The spatial size of each image is 6000 × 6000. There are 26 images for training, four images for validation, and eight images for testing.
3) DeepGlobe Land-Cover Classification Dataset: It was acquired from the satellite with a spatial resolution of 0.5 m and three bands (R, G, and B). Compared with aerial images, satellite imagery suffers more from the diversity of land covers. The dataset contains 1146 images with a spatial size of 2448 × 2448 pixels. Among them, 802 images are used for training, 114 images are used for validation, and 230 images are used for testing.

B. Implement Details
The experiments were implemented using PyTorch with an NVIDIA 3090 GPU under Linux OS. All images with ground truth were cropped with a spatial size of 256 × 256 for training, validation, and testing. The setting and hyperparameters are listed in Table I. Moreover, standard scaling (0.5, 1.5, and 2.0), rotation (90 • , 180 • , and 270 • ), and horizontal and vertical flips were employed for data augmentation.  We applied softmax cross entropy as the loss function for all methods in experiments.
To make a convincing comparison, we reproduced and evaluated ten methods with the same settings as in their original papers, except for the batch size and the optimizer. Among these methods, FCN-8s [15], SegNet [16], U-Net [17], and DeepLab V3+ [18] were pioneer encoding and decoding networks. CBAM [33] and DANet [32] were selected as attention-based natural image segmentation networks. RAANet [66], MACU-Net [67], SCAttNet [37], LANet [64], and HMANet [38] are the state-of-the-art semantic segmentation methods developed for remote sensing images specifically. In addition, a hybrid network ResUNet-a [58] was included as a nonattentive segmentation network for remote sensing images. It is worth noting that the reported results may not be identical to those in the original paper. Among them, the reproduced results of RAANet, MACU-Net, SCAttNet, LANet, HMANet, and ResUNet-a are slightly lower in accuracy than those published in the original paper. We also used a random approach for data partitioning, so the partition may not be exactly the same as the others either.

C. Evaluation Metrics
Similar to SCAttNet [37] and LANet [64], we adopted the overall accuracy (OA) and average F1-score (AF) to evaluate the performance of all methods. The OA is defined as the ratio of correctly classified pixels to all pixels.
For classwise evaluation, the F1-score is calculated according to precision and recall where precision and recall are calculated as precision = TP TP + FP recall = TP TP + FN (8) where TP, FP, and FN are the numbers of true positive, false positive, and false negative samples, respectively. Furthermore, the AF of all classes is also calculated.

1) Embedding Different Attention Models Into SegNet:
The exploratory experiments were conducted on the ISPRS Vaihingen benchmark to validate the attention bias hypothesis described in Section III-B. We reimplemented four models. The details of these models are shown in Fig. 4. PAM-only denotes the network that deploys PAM after the encoder to refine representation. CAM-only is similar to PAM-only but replaces PAM with CAM. CAM + PAM sequential embeds PAM followed by CAM sequentially. CAM + PAM parallel organizes the PAM and CAM in parallel and then fuses them with an elementwise summation. Note that the proposed SAM is also embedded in the same baseline. The dotted lines with different colors represent the corresponding flows of models in the experiments. The standard SegNet [16] with ResNet50 as the backbone forms the baseline. The hyperparameter settings and data partition are the same for each model for fairness. The classwise F1-score and AF on the test set for each method are shown in Fig. 8.
It can be seen that PAM-only and CAM-only produce a lower accuracy than combined ones. However, in addition to  II   NUMERICAL RESULTS ON THE ISPRS VAIHINGEN DATASET OF DIFFERENT ATTENTION MODELS BASED ON SAPNET.  CATEGORYWISE F1-SCORE, AF, AND OA ARE LISTED, WHERE BOLD INDICATES THE BEST almost the same level of accuracy, PAM-only outperforms on WDOs, including impervious surfaces, trees, and cars, while CAM-only performs better on WSOs, including buildings and low vegetation. The combined models show significant improvement in performance. The OA of CAM + PAM sequential and CAM + PAM parallel increases by more than 3% over the PAM-only or CAM-only models. This observation verifies that both spatial and channel affinities are essential for feature refinement. However, the integration of two attention-refined features is implemented as a postprocessing step, introducing attention bias. Our proposed SAM synergistically models the contextual affinity across channels and spatial within one attention map, enriching the inference clues. The exploratory experiment with standard SegNet as a baseline shows a remarkable improvement of the proposed method with about 2% higher than CAM + PAM sequential and CAM + PAM parallel.
2) Embedding Different Attention Models Into SAPNet: We further examined the attention bias based on the proposed SAPNet (see Fig. 6) on the ISPRS Vaihingen benchmark. Four models were implemented with different attention modules, namely, PAM-only, CAM-only, CAM + PAM sequential, and CAM + PAM parallel (see Fig. 4). The numerical results are listed in Table II and the training losses are presented in Fig. 9. Table II reveals that there has been a steady increase in the values of AF and OA with the proper integration of PAM and CAM models. Notably, the proposed SAP model correctly classified 89.7% of predicted pixels, achieving a large margin of 3.2% over the CAM + PAM parallel model. Superior performance is also observed in the value of AF. As for classwise analysis, we can see that the OA of cars has experienced an increase of more than 20% than the single-dimensional context. Regarding the buildings, strong inside-consistent distribution requires the assistance of a 3-D context, which makes SAP achieve the best F1-score of 96.2.
A convincing comparison is presented between PAM-only and CAM-only models, authenticating the hypothesis of attention bias in segmenting remote sensing images. As predefined, the impervious surfaces, buildings, and low vegetation belong to WSOs. When calculating channel attention maps, the global average pooling causes less loss. As a result, the F1 scores of these three categories of CAM-only are higher than PAM-only. As for WDOs, the global average pooling of CAM-only leads to the loss of spatial details, dropping the F1-score with 3.6% of trees and 8.4% of cars with respect to PAM-only. PAM concerns the spatial correlations that are essential for WDOs. CAM + PAM sequential and CAM + PAM parallel exhibit successive increases. The AF and OA results suggest that parallel fashion is better than cascaded ones for remote sensing images. Optimally, the unified manner of SAM, embedded in SAP, captures a 3-D context, leading to significant accuracy improvement.
We apply commonly used softmax cross entropy as the loss function for all comparative models. As shown in Fig. 9, the training loss is demonstrated. Less loss means higher accuracy. Compared to other models, SAP significantly reduces the loss with steady convergence at a comparatively fast speed. It is evident that the training loss is in exceptionally good agreement with numerical results.
To sum up, SAM can capture both spatial and channel contexts to boost pixelwise representations, avoiding information loss in learning. Thus, the provided contextual information is more beneficial for inference. : Table III quantitatively compares the results of all methods on the ISPRS Vaihingen dataset. It is evident that AF and OA strongly favor the performance of SAPNet, with AF and OA being 91.26 and 89.1%, respectively. Compared to natural image-specific methods, including FCN-8s, SegNet, U-Net, and DeepLab V3+, SAPNet significantly increases the performance by at least 13.92 AF and 13.6% OA. The CBAM and DANet introduce attention modules. Although their performance improvement is not significant, they confirm that the utilization of context paves a promising way to refine learned representations. RAANet embeds CBAM into DeepLab V3+ to enrich the contextual information, raising 0.3% of OA compared with DeepLab V3+. Based on U-Net, MACU-Net further realigns semantic features by multiscale skip  connections and asymmetric convolutions, achieving a 78.8% of OA. The lightweight SCAttNet rebuilds the attention steps for feature refinement in remote sensing images, resulting in about a 1% increase of OA over DANet and CBAM. HAMNet incorporates more contextual information for higher accuracy. LANet produces the second-best OA by attentively bridging the gap between high-and low-level features. ResUNet-a achieves similar OA as LANet by hybridizing abundant optimization strategies, and however, LANet has lower complexity and fewer parameters than ResUNet-a. Fig. 10 visualizes the prediction outcome on the testing set. Attention-embedded networks yield better quality than earlier encoder-decoder architecture. Moreover, SAPNet segments EODs and WSOs with the best consistency with the ground truth. The fine boundary and accurate localization and classification of all classes are realized with a simple strategy. To sum up, the WDOs and WSOs in the Vaihingen dataset are well-distinguished by SAPNet, due to unbiased contextual affinity. This affinity jointly models the spatial and channel information that boosts the discriminative capability of semantic segmentation models effectively.

1) Results on the ISPRS Vaihingen Dataset
2) Results on the ISPRS Potsdam Dataset: As evident from Table IV, attention-based methods outperform baseline encoder-decoder methods. CBAM and DANet still express the scalability of the attention mechanism in capturing nonlocal correlations, bringing the OA to almost 88%. However, the heterogeneity of remote sensing images makes natural image-specific approaches far from optimal. With spatial and channel context information, HMANet produces the second-highest performance on the ISPRS Potsdam dataset, even slightly higher than the complicated ResUNet-a. Due to the injection of sufficient contextual affinity, SAPNet achieves the highest OA of 91.8% and AF of 93. 10.
Visual inspections are shown in Fig. 11. We can clearly see that SAPNet segments both WDOs and WSOs much finer than other SOTA methods. Standard encoder-decoder variants are generally inferior in mapping ground objects. The WDOs are incomplete. Moreover, the inconsistent blurs are mapped inside the WSOs. With the incorporation of multiscale context, DeepLab V3+ partially resolves the blurs. While injecting attentive context, the main parts and objects are well-distinguished, as CBAM and DANet work. However, attention bias makes this kind of method suboptimal. After synergistically modeling contextual affinity, the spatial details and spectral correlations are well-extracted and integrated. As shown in Fig. 11(b), the impervious surfaces are consistently delineated as the ground truth. The buildings on the left are well-outlined too. As a specific category of WDOs, cars are error-prone by comparative methods with blurry edges in Fig. 11(a). Likewise, the compared methods  are unable to handle complex scenes that cover multiple contiguous classes. The visual inspections in Fig. 11(c) exhibit high consistency with ground truth by SAPNet. With the joint-modeled affinity, SAPNet strengthens the discriminability of learned representations of cars, leading to better visual performance. In summary, the quantitative and qualitative evaluations generally agree with the effects of contextual affinity. Moreover, the synergistic attention makes SAPNet extract comprehensive context at multiple scales, giving significant accuracy improvement on the ISPRS Potsdam dataset.
3) Results on the DeepGlobe Dataset: The results on DeepGlobe satellite images are shown in Table V. Although the spatial resolution is lower than the aerial images and the observation range is wider than ISPRS benchmarks, the well-annotated masks and massive data provide the preferable preconditions for training the optimal network. Statistically, 92.75 AF and 92.9% OA are attained by SAPNet. Compared with the state-of-the-art remote sensing specific methods, the AF and OA have increased by 1.79% and 1.3% compared to ResUNet-a, respectively. In addition to overall evaluation, categorywise F1-score is also in the lead. Moreover, a similar trend with ISPRS benchmarks is marked. Attentively modeling contextual affinity helps the network to learn more discriminative representations to boost the dense predictions. CBAM sequentially incorporates spatial attention module and CAM, increasing AF and OA by 1.95 and 2.7% than DeepLab V3+, respectively. Moreover, remote sensing-specific methods specify the essence of ground objects and design corresponding fine-tuning tricks or transformations. As a result, these methods produce better performance. Specifically, ResUNeta exhibits the second-best OA with 91.8%. In summary, incorporating spatial correlations into channelwise contextual affinity also lends strong support to recognizing ground objects from satellites.
Visual inspections are shown in Fig. 12. Three samples are selected from the testing set. WDOs and WSOs should be segmented with high certainty and consistency by a robust neural network. In Fig. 12(a), the rangeland and the neighboring agricultural land share similar visual features and irregular shapes. It is thus hard to be classified. ResUNet-a integrates multiple strategies that attain a comparable smooth mask. Three attention variants of remote sensing-specific methods also boost the results significantly. However, the attention bias produces blurs inside rangeland and coarse-grained boundaries between different objects. Since the water areas are characterized by strong spectral absorbance in visible and NIR bands compared to other kinds of ground objects, which is vulnerable to the surrounding context, accurate water extraction requires discriminative features. Besides channelwise correlations, spatial details are also necessary because water areas always present irregular shapes and local connectivity. SAPNet turns these requirements into reality. Various water areas with different shapes and surfaces are well-mapped. To sum up, SAPNet explicitly expresses the best qualitative visualization due to the synergistically modeled contextual affinity at multiple scales.

F. Efficiency Analysis
The inference time and floating-point operations (FLOPs) of different attention models embedded into SAPNet are presented in Table VI. The FLOPs are calculated with the input image size of 256 × 256 × 3. The FLOPs of the proposed SAP are slightly higher than those of CAM-only. However, the FLOPs of SAP are far less than those of PAM-only, CAM + PAM sequential, and CAM + PAM parallel, which involve more than three times the FLOPs of SAP. A significant reduction of FLOPs is by more than 69% in producing channel and spatial attention maps with sequential or parallel models. As to time consumption, the inference time is an average of inferencing 500 subpatches for all methods. SAP ranks in the second order (see details in Table VI). A margin of 3.53 ms is reported compared to PAM-only in exchange for an increase of 5.1% in OA.
In conclusion, the synergistic modeled contextual affinity benefits the network with richer clues for inference. These clues boost the discriminability for learned representations. Specifically, the efficiency evaluation shows a promising inference speed and low FLOPs requirement compared with PAMonly, CAM + PAM sequential, and CAM + PAM parallel models.

V. CONCLUSION
In this study, we first specify the attention bias of existing paradigms on segmenting ground objects in remote sensing images. Instead of separately modeling spatial and channel attention maps or with the postfusion operation, the proposed SAM jointly models spatialwise and channelwise correlations in one attention map, which is informative of spatial and channel interactions. Incorporating the SAM, the developed SAPNet takes care of the multiscale inherence, reaching competitive results on three representative benchmarks.
In the future, two aspects will be further studied. On the one hand, the feature representation capacity of latent geometric property in Euclidean space is limited. It is promising to form a non-Euclidean feature space that enables high-fidelity feature representation. On the other hand, a few-shot semantic segmentation network for remote sensing images is required due to the high costs of labeling samples.