Hybridizing Euclidean and Hyperbolic Similarities for Attentively Refining Representations in Semantic Segmentation of Remote Sensing Images

Attention mechanisms (AMs) have revolutionized the semantic segmentation network in interpreting remote sensing images (RSIs) due to their amazing ability in establishing contextual dependencies. Nevertheless, due to the complex scenes and diverse objects in RSIs, a variety of details and correlations are not available in Euclidean space. Therefore, a similarity-hybrid attention module (SHAM) is devised to attentively learn the hyperbolic and Euclidean attention maps between any two positions, followed by a weighted elementwise summation. The hybrid attention maps posses latent geometric properties of both Euclidean and hyperboloid. Taking commonly used fully convolutional network (FCN) as baseline, hybrid attention-enhanced neural network (HAENet) that embeds SHAM is presented. Experiments on International Society for Photogrammetry and Remote Sensing (ISPRS) Potsdam and DeepGlobe benchmarks reveal its superiority to comparative methods. In addition, the ablation study validates the effectiveness of SHAM compared with other attention modules.

Fundamentally, fully convolutional networks (FCNs) exhibit impressive performance in extracting rich features and extend to SS for RSI. However, FCN-based approaches are inherently affected by limited perceptual fields and local context [1].
To capture contextual information from a broader range, the atrous spatial pyramid pooling (ASPP) with multiscale dilation rates was proposed [2]. However, the global context is uncovered due to the inherence of stacking convolutional layers.
Alternatively, the attention mechanism (AM) provides an efficient way of capturing and incorporating global context [3]. An inchoate work, SENet [4], recalibrates channelwise weights to highlight informative feature channels of feature maps. Furthermore, as a mailstone, nonlocal neural network (NLNNet) utilizes a self-AM (SAM), modeling positionwise correlations to refine input features [5]. Afterward, the SAM has been the primary choice in capturing long-range contextual information for SS and the formed Transformer also reached a desired success [6]. Specifically, several RSI-targeted networks, HCANet [7], LANet [8], and HMANet [9], have remarkably promoted segmentation accuracy by designing novel SAM variants with typical deployment. Nonetheless, the existing methods are defined in Euclidean space, in which the features are flattened to fulfill the Euclidean geometry axiom [10].
However, Bronstein et al. [11] have proved that the images also exhibit a highly non-Euclidean latent anatomy. Besides, it appears in several applications that the dissimilarity measures constructed by experts tend to have non-Euclidean behavior. Therefore, the Euclidean space cannot provide the most powerful or meaningful geometrical representations. It is necessary to exploit hyperbolic representations and take advantage of this property.
Since Ganea et al. [12] derived hyperbolic neural network, projecting feature vectors in Euclidean to hyperboloid endows the representations to capture fundamental data properties, including non-Euclidean visual phenomena and clustering behavior [13]. Especially in RSI, the imaging altitude is always high, bringing the distortions represented in Euclidean space. Therefore, the Euclidean vectorwise similarity is suboptimal in describing the two elements. The latent non-Euclidean similarity is equally essential to measure the elementwise similarity. Apart from designing loss function to fine-tune the training of a network [14], it is suggested to incorporate hyperbolic representations into network directly.
Motivated by the SAM, a similarity hybrid attention module (SHAM) is proposed to generate and fuse Euclidean and hyperbolic similarities, which are measured with Euclidean and hyperbolic representations. Specifically, we devise a hyperbolic projection and tokenization flow to adapt to the normal attention workflow based on a pseudo-hyperboloid. Therefore, the calculated hyperbolic distance (HD) generates a hyperbolic attention map (HAM). Two contributions are summarized as follows.
1) To comprehensively and effectively exploit Euclidean and hyperbolic geometric representations, we propose an SHAM, which post-fuses the contextual affinities of the two spaces. In addition to performing self-attention in Euclidean space, two parallel branches form a pseudo-hyperboloid to project feature vectors and measure their correlations in hyperbolic space. In this way, the post-fused attention map involves the similarities of Euclidean and hyperbolic representations, making the refined features discriminative. Based on SHAM, the hybrid attention-enhanced neural network (HAENet) is devised to segment RSIs. Experiments on the Potsdam and DeepGlobe benchmarks exhibit competitive performance. Moreover, the ablation study demonstrates the effects of SHAM.

A. Overview
As illustrated in Fig. 1, HAENet inherits the encoderdecoder architecture. With an input image, the feature encoder outputs the corresponding representation. Then, the representation is fed into the SHAM. Afterward, the encoded features are refined with global context by a high-fidelity similarity, which derives from hetero-curvature spaces that carry different visual properties. At last, the upsampled SHAM-refined feature is used to classify.

B. Hyperbolic Projection and Tokenization
Ascribing to Riemannian manifold and metric, the hyperbolic projection and tokenization module, termed HP&T, enables project feature vector (Euclidean) onto hyperboloid (Poincaré model in this study) with exponential map (see Fig. 2). Formally where x is the feature vector in Euclidean space, v is the anchor, c is a hyperparameter governing curvature and radius of the Poincaré model, factor, and ⊕ c denotes Möbius addition. The anchor is set to the origin in practice; therefore, (1) turns to After projection, every Euclidean vector has its hyperbolic counterpart. We endow them with sequential positions to achieve tokenization. Thus, HP&T produces position-assisted hyperbolic representations for subsequent operation.

C. Hyperbolic Distance
Analogous to Euclidean space, HD is used to measure the similarity between two arbitrary gyrovectors (see Fig. 3). Inherently, the HD concerns the hyperbolic properties of a specific position in RSI, compensating for the information loss of the Euclidean vector. In this study, the induced distance is given as follows: where x and y are two gyrovectors and belong to one Poincaré model. Therefore, HD is capable of being normalized to provide hyperbolic similarity in accordance with each pair of positions.

D. Similarity-Hybrid Attention Module
SHAM unitedly models and aggregates self-attentive dependencies in Euclidean and hyperboloid spaces in parallel. As depicted in Fig. 1 (pink box), five sub-branches are devised. Initially, the input feature is convolved with a kernel size of 1 × 1, generating five representations for self-attention paradigm. With the first and second branches (from the top), the vanilla self-attention that captures long-range dependencies in Euclidean space is implemented. Formally where A e ∈ R HW×HW is the attention map that implies Euclidean similarity, F 11 ∈ R HW×C and F 21 ∈ R C×HW are feature maps (Euclidean), and C is the channel number.
To compute hyperbolic similarity matrix, the convolved F 3 and F 4 are fed into HP&T initially. Then, one of the tokenized gyrovectors associated with the sequential positions is reshaped with a size of HW × C h , while the other one is with C h × HW, where C h is the dimension of gyrovector. Referring to (3), the pairwise distances are obtained followed by a Softmax layer: where A h ∈ R HW×HW is the attention map that implies hyperbolic similarity, F 31 ∈ R HW×C h and F 41 ∈ R C h ×HW are gyrovectors (hyperbolic), and C h is the dimension of gyrovectors.
Attempting to aggregate the similarity in hetero-curvature spaces, a weighted summation is applied to A e and A h where A sh ∈ R HW×HW is the similarity-hybrid attention map, and α and β are two learnable coefficients (initially set as 0.5 and 0.5). Eventually, the refined feature maps F re are given as follows: where F 51 ∈ R C×HW , and ⊕ is the elementwise summation.
In summary, vanilla self-attention still suffers from uncertainty in segmenting RSIs, especially for easy-confused and edge-surrounding pixels, because the similarity in Euclidean space cannot provide sufficient discriminability. SHAM allows the network learn the hybrid similarity in a single module with acceptable computations. Consequently, the refined features could supply more comprehensive contextual cues for inference.

A. Datasets
Two benchmarks, namely, International Society for Photogrammetry and Remote Sensing (ISPRS) Potsdam and Deep-Globe, are examined. Thirty-eight aerial imagery tiles (20/4/14 tiles for training/validation/test) are collected and annotated for the Potsdam benchmark with six categories (clutter is ignored in evaluation). Every tile has a size of 6000 × 6000 with a spatial resolution of 5 cm. The DeepGlobe benchmark is acquired from a satellite platform with a spatial resolution of 50 cm. Specifically, 1146 images of size 2448 × 2448 pixels are available (803/171/172 images for training/validation/test).

B. Evaluation Metrics
To evaluate the performance, we calculate the pixelwise intersection over union (IoU) with formula where TPs, FPs, and FNs are the number of true positives, false positives, and false negatives in image i with class j , respectively. Moreover, the mIoU over k classes is given as mIoU = (1/k) k j =1 IoU j . F 1 score and OA are as follows: where precision = (TP/TP + FP), and recall = (FP/ TP + FN).

C. Implement Details
Three bands, R, G and B, are used as the input channels. In practice, we split the raw image into subpatches with a spatial size of 256 × 256 for training. With the Tesla V100-32 GB GPU, the comparative methods are reproduced. The batch size is 16, and the learning policy is poly decay with an initial learning rate of 0.0001 and the momentum of 0.9. Several data augmentations are deployed, including 90 • , 180 • , and 270 • rotation, horizontally and vertically flip. Commonly, the SGD optimizer and the cross-entropy loss are used in this study. The ResNet50 with eight times downsampling is adopted as the backbone. Besides, the max epoch is set to 500. We produce the hyperbolic embeddings referred to geoopt. 1 The feature embeddings are first transformed from Euclidean space to hyperbolic space and then mapped onto the Poincaré model for distance (similarity) calculation. The procedure is positionrelated; thus, the pseudo-hyperbolic space and its related parameters are unnecessary to be trained and optimized. All comparative methods are trained from scratch without bells and whistles. As the first group of experiments, we compare proposed HAENet to state-of-the-art (SOTA) methods, including LANet, HCANet, and HMANet. Specifically, several fundamental networks are compared, such as FCN-8s, U-Net, and DeepLab V3+. Second, we compared several attention models based on FCN, including squeeze and excitation block (SEB) in SENet [4] (termed FCN + SEB), convolution block attention module (CBAM) [15] (termed FCN + CBAM), and dual attention block (DAB) in DANet [16] (termed FCN + DAB). Specifically, FCN + nonlocal block (NLB) in [5]) is evaluated as the ablative models. This model is identical to removing hyperbolic-related branches of SHAM.

D. Results of Potsdam Dataset
As presented in Table I, the categorywise F 1 score, mean F 1 , and mIoU are collected on test set. In general, our HAENet outperforms to others on the Potsdam benchmark. The mean F 1 score and mIoU of HAENet are the best compared with the recent-proposed SOTA methods, such as LANet, HCANet, and HAMNet. An increase of 1%/2% of mean F 1 score/mIoU is obtained compared with HMANet. Some typical baselines are susceptible to imbalanced distribution, intraclass variations, and interclass similarities of RSIs, resulting in low accuracy, below 87%/72% of mean F 1 score/mIoU. Although RSI-targeted networks have made extensive efforts, the high-dimensional non-Euclidean properties are ignored by them. As for buildings suffering from occlusion, hyperbolic representations can stretch this part and inject it into the gyrovector for similarity measurement. HAENet reaches a peak of over 97% for the classification F 1 scores of buildings. At the same time, the hybrid similarity aggregation allows for more distinguishable representations of cars, rising by about 3% compared with the second-order network. Two random samples of Potsdam test set are predicted by 1 https://github.com/geoopt/geoopt  Fig. 4, where the vast majority of pixels are correctly classified.

E. Results of DeepGlobe Dataset
The DeepGlobe benchmark has a lower spatial resolution and covers a broader range than aerial images, where fine-grained visual features are difficult to be learned. As shown in Table II, all methods have experienced degradation. However, the proposed HAENet reaches the highest accuracy, with 82.93%/67.78% of mean F 1 score/mIoU. Specifically, HAENet leads in all categories except for rangeland, of which a 0.02% F1 score is dropped than HMANet. More than 95% of the F 1 score for water areas is calculated. Concerning imaging conditions, satellite RSIs are orthographic and insensitive to light. Less than a 1% increase of mean F 1 score is observed with the suboptimal method. Two random samples of DeepGlobe test set are predicted in Fig. 5. In summary, satellite RSIs have an indistinctive hyperbolic property, though a slight improvement is reached than other SOTA methods.

F. Ablation Study
With the same setup and network baseline of FCN-8s, we embedded SEB, CBAM, and DAB at the end of the encoder. The results are listed in Tables I and II. Overall, the proposed SHAM enables the best refinement of encoded  representations. Compared with SEB, the mIoU on the Potsdam test set rises from 72.29% to 84.28%. When testing DeepGlobe, the increase slightly drops to about 7% of mIoU. CBAM and DAB have similar effects on two datasets with the cascaded and parallel post-fusion manners of two attention modules. As described in III-C, FCN + NLB is the ablative model by removing hyperbolic-related branches. Numerically, NLB refines the learned feature maps with respect to position. However, the latent non-Euclidean similarity is not introduced.
With the fusion of hetero-spaces' attention maps, massive invisible cues are exploited to accurately measure the similarity of different objects, producing more fidelity similarity by incorporating hyperbolic geometry.

IV. CONCLUSION
This letter proposes a novel SHAM, which involves the latent non-Euclidean visual properties by attentively fusing position-associated attention maps in Euclidean and hyperboloid spaces, respectively. The experiments conducted on the ISPRS Potsdam and DeepGlobe benchmarks validate its efficacy and superiority to several methods. Moreover, the ablation study examined the effects of SHAM. This study opens a new direction for the interpretation of RSIs in a non-Euclidean view.