Introduction
Semantic segmentation, a typical task for remotely sensed imagery, involves classifying every pixel in an image into categories such as roads, vegetation, buildings, and water bodies [1]. In recent years, deep learning techniques have significantly advanced the accuracy of semantic segmentation of remotely sensed images, especially when handling complex patterns and diverse variations inherent in remote sensing scenes [2]. Specifically, convolutional neural networks (CNNs) [3] and vision transformers (ViTs) [4] are now commonly used backbone networks for semantic segmentation of remotely sensed images, evidenced by continuous evolution of methods that deliver state-of-the-art performance [5], [6], [7], [8].
However, these widely employed network architectures also face challenges when segmenting high-resolution remotely sensed images. CNNs, due to their limited receptive fields, struggle to capture long-range semantic dependencies present within high-resolution images [9]. Although ViTs possess a global receptive field, their quadratic complexity makes them challenging to deploy for high-resolution images [10]. To address these challenges, researchers have started to investigate architectures based on a newly introduced network called Mamba [11].
Mamba, a network based on state space models (SSMs) [12], was initially applied to large language models [13], [14]. Mamba functions as a sequential network similar to a recurrent neural network (RNN) [15], capable of inducting prior information and predicting subsequent states. It efficiently compresses long-term contextual information by incorporating a selective mechanism that selectively attends to or ignores inputs. When applied to vision tasks, this network can achieve a balance between a global receptive field and linear complexity [16], indicating great promise in the segmentation of remotely sensed images.
Drawing on the success of ViTs [4], which introduced the transformer architecture to vision tasks, extensive research [17], [18], [19] has successfully integrated mamba into image processing tasks. Similar to ViT, which crops an image into patches and flattens them to be fed into the transformer model, Mamba process flattened image patches as sequences. However, unlike ViT, which computes multihead self-attention among these image patches, Mamba processes image patches sequentially. Therefore, numerous scanning directions [17], [18], [19], [20], [21] of image patches are available.
Extensive research has explored new scanning directions and their combinations, attempting to enhance Mamba's performance in understanding of images. Fig. 1 displays 12 commonly used scanning directions (D1–D12). D1–D4 involves sequential scanning every row or column of image patches in a “Z”-shaped pattern. D5–D8 involve sequential scanning of image patches in diagonal directions. D9–D12 perform “S”-shaped serpentine scanning of image patches. However, existing studies have not comprehensively compared their effectiveness. Therefore, there is an urgent need for a comparative study to quantitatively evaluate the impact of various scanning directions and their combinations on the performance of Mamba-based methods in typical remote sensing tasks such as semantic segmentation.
In total, 12 commonly used scanning directions in Vision Mamba. Images are cropped into patches according to a predefined size, and these patches are then modeled into sequences based on specific scanning direction(s).
In this article, we designed an experimental framework aimed at undertaking a comprehensive and fair comparison of various scanning strategies for Vision Mamba, tailored specifically for the semantic segmentation of high-resolution remotely sensed images. It is important to note that our designed framework does not represent a new or modified Mamba architecture. Instead, our focus is on offering new insights into various potential scanning strategies for Mamba-based semantic segmentation, a topic that has not been adequately addressed in existing literature. We perform 22 scanning strategies in our comparative experiments, including 12 individual scanning directions and 10 combinations of scanning directions. Each scanning strategy is tested across the LoveDA [22], ISPRS Potsdam, ISPRS Vaihingen, and UAVid [23] datasets.
The main contributions of this article are as follows.
This article summarizes commonly used scanning strategies of Vision Mamba.
For the first time, this study quantitatively assesses the influence of various scanning strategies of Vision Mamba on the accuracy of semantic segmentation of remotely sensed imagery, using a specifically-designed experimental framework.
Related Work
A. Representative Semantic Segmentation Methods
With the rapid development of deep learning, numerous neural networks for image processing have been proposed, effectively extracting features and serving as powerful backbone networks. ResNet [24] addresses the challenges of vanishing and exploding gradients in deep networks by introducing residual modules, enhancing the stability of deep networks, and becoming a commonly used encoder in remote sensing tasks [25]. ConvNeXt [6] further optimizes ResNet by introducing an enhanced architecture with group normalization and depthwise separable convolutions, improving both parameter and computational efficiency, thereby enhancing feature extraction capabilities. Swin Transformer [26] optimizes ViT by implementing a hierarchical window attention mechanism, effectively addressing the computational complexity challenges of ViT in handling high-resolution images, making it suitable for high-resolution remote sensing image processing [7]. Additionally, Hong et al. [27] introduced a foundation model based on the generative pretrained transformer for multispectral and hyperspectral remote sensing image analysis, which holds great potential for applications in semantic segmentation of remotely sensed imagery. Recently, autoregressive networks, such as Mamba [11], RWKV [28], [29], and xLSTM [30], [31], have shown excellent performance in image classification tasks, demonstrating their potential as new backbone networks for image feature extraction.
In the task of semantic segmentation of remotely sensed images, encoder-decoder architectures are frequently used [32]. These architectures enable effective feature extraction and dimensionality reduction of the input space using the aforementioned backbone networks, and then restores the original resolution in the decoder. The challenge of large-scale variations [33] in objects in remotely sensed images requires effective handling by the decoder [5]. Thus, the DeepLab [34] series networks combine atrous convolutions and pyramid pooling modules, allowing the model to expand its receptive field while capturing both local and global information, making it capable of handling objects with significant scale variations. Unlike DeepLab, UperNet [35] employs a different strategy to handle multiscale information. UperNet introduces a feature pyramid network [36] that merges feature maps of different scales, integrating global information and local details to enhance the recognition of small objects and complex backgrounds. With the advent of Transformers [4], their self-attention mechanisms can effectively capture long-range dependencies and contextual information in images. This has inspired numerous studies, such as DNLNet [37], ABCNet [38], and MANet [39], which have effectively applied attention mechanisms, demonstrating the effectiveness of attention mechanisms in improving model performance in remote sensing imagery.
B. Development of SSM
RNNs [40] are classic sequential networks that calculate the current hidden state
\begin{align*}
h^{\prime} \left( t \right) &= Ah\left( t \right) + Bx\left( t \right) \tag{1}\\
y \left( t \right) &= Ch\left( t \right) + Dx\left( t \right) \tag{2}
\end{align*}
However, such SSMs are unable to handle discrete data inputs. Since many data types are discrete, enabling SSMs to cope with discrete data is meaningful. This can effectively be achieved by the Structured State Space for Sequences (S4) [41]. It employs the zero-order hold technique to discretize the SSM, which can be described through the following equations:
\begin{align*}
{{h}_k} &= \bar{A} {{h}_{k - 1}} + \bar{B}{{x}_k} \tag{3}\\
{{y}_k} &= \bar{C} {{h}_k} + \bar{D}{{x}_k} \tag{4}\\
\bar{A} &= {{e}^{\Delta A}} \tag{5}\\
\bar{B} &= \left( {{{e}^{\Delta A}} - I} \right)\ {{A}^{ - 1}}B \tag{6}\\
\bar{C} &= C \tag{7}
\end{align*}
However, due to the issue of linear time invariance [43], the matrices
C. Development of Vision Mamba
Since Mamba is a sequential network that cannot directly process two-dimensional image data, exploring methods to serialize images is meaningful. The first attempt Vim [18], which is similar to ViT [4], involves cropping an image into patches and flattening them. It performs both forward (D1) and reverse (D2) scans of image patches in rows before merging them, as shown in Fig. 2(b). Similarly, VMamba [17] builds on the foundation of ViM by adding two vertical scanning directions (D3, D4), as shown in Fig. 2(c). PlainMamba [19] adopts a serpentine scanning approach (D8, D9, D10, D11), as illustrated in Fig. 3(d).
(a) Flattening scanning strategy, consistent with Samba. (b) Scanning forward and backward after flattening, followed by merging, consistent with Vim. (c) Scanning in four directions sequentially, followed by merging, consistent with VMamba. (d) Serpentine scanning in four directions, consistent with PlainMamba.
These efforts are all based on the hypothesis that varying scanning directions of image patches can potentially enhance Mamba's understanding of images. However, there is a lack of comprehensive and quantitative comparisons of model performances under different scanning directions in their work. For example, Vim and PlainMamba lacked essential ablation studies to validate their scanning methods. In VMamba, the results for horizontal scan combinations (i.e., D1 and D2) were not reported, while the four-directional scanning (i.e., D1, D2, D3, and D4) achieved only a 0.3% higher accuracy on ImageNet [3] compared to the unidirectional scanning D1. Considering the likely fluctuations in model performance during training, this marginal improvement is inadequate to confirm the effectiveness of multidirectional scanning.
D. Mamba-Based Semantic Segmentation
As Vision Mamba continues to evolve, numerous studies have been undertaken to assess its performance in semantic segmentation tasks, particularly in the domain of medical imaging and remote sensing. In these studies, different scanning strategies were also considered to test their impact on Mamba's ability of image understanding.
U-Mamba [44] represented the first attempt to merge Mamba with the UNet [45] architecture for semantic segmentation of medical images. However, due to its simplistic architectural design, its performance fell short of the then state-of-the-art segmentation methods. Subsequently, several enhanced methods [46], [47], [48], [49], [50] emerged, using bidirectional scanning with Vim and/or four-directional scanning with VMamba.
In the context remote sensing, Samba [21] was the first study that introduces Mamba into semantic segmentation of remotely sensed images, in which image patches are flattened in the same manner as ViT, as shown in Fig. 2(a). Later, RS3Mamba [51] used a four-directional scanning method of VMamba to construct an auxiliary encoder for semantic segmentation. Similarly, RSMamba [20] expanded on VMamba's four-directional scanning by adding four additional diagonal directions (i.e., D5, D6, D7, and D8) in its encoder-decoder architecture.
Experimental Framework
To thoroughly assess the impact of scanning strategies on Mamba's performance in semantic segmentation tasks with high-resolution images, we have designed a specific semantic segmentation framework using an encoder-decoder architecture to facilitate quantitative comparisons of scanning strategies.
A. Overall Architecture
The overall framework is shown on the left-hand side of Fig. 3. Images are divided into patches in the encoder section, and then sequentially fed into four Vision Mamba scan (VMS) blocks for progressive downsampling. To ensure the fairness of the experiments, we consistently use UperNet [35], the state-of-the-art network for segmentation, as the decoder for producing segmentation results.
B. VMS Block
The VMS Block is a residual network with skip connections. The residual network consists of two branches. One branch uses a depth-wise convolution layer to extract features, performs S6 calculations [11] on scans in various directions, and subsequently merges them. The other branch consists of a linear mapping followed by an activation layer.
Although similar to Mamba, this architecture exhibits a key difference in the form of image scanning, referred to as the 8-direction scan (8D scan) block, as shown on the right-hand side of Fig. 3. As the number of considered scanning directions in our experiments ranges from 1 (i.e., unidirectional) to 8 (i.e., a combination of 8 individual scanning directions), we designed 8 potential scanning directions within the 8D scan block: Dn1, Dn2, Dn3,…, Dn8. After being separately processed in each of these 8 potential scanning directions, image patches undergo feature extraction through the S6 block, and subsequently, the features from all eight directions are merged. When the number of considered scanning directions is 1, 2, or 4, the scanning directions are repeated 8, 4, and 2 times, respectively, to fill the eight potential scanning directions.
Experiments
A. Datasets
To minimize the potential influence of varying characteristics of different datasets on results, we investigated the impact of scanning strategies across three commonly used semantic segmentation datasets in remote sensing, namely ISPRS Vaihingen, ISPRS Potsdam, LoveDA, and UAVid. In our experiments, the dataset settings are consistent with those widely used in other studies [5], [21], as detailed in the following.
ISPRS Vaihingen consists of 33 high-resolution remotely sensed images, featuring a spatial resolution of 9 cm and varying image sizes (on average 2494 × 2064 pixels). These images cover near-infrared, red, and green bands, and are categorized into six classes: impervious surface, building, low vegetation, tree, car, and clutter. Images labeled with IDs 1, 3, 5, 7, 11, 13, 15, 17, 21, 23, 26, 28, 30, 32, 34, and 37 are used for training, while the remaining 17 images are used for validation.
ISPRS Potsdam has the same categories as ISPRS Vaihingen but features a spatial resolution of 5 cm. This dataset contains 38 images, each with an identical image size of 6000 × 6000 pixels. It covers four spectrum bands: red, green, blue, and near-infrared, with only the RGB channels being used in our study. Images with the following IDs 2_10, 2_11, 2_12, 3_10, 3_11, 3_12, 4_10, 4_11, 4_12, 5_10, 5_11, 5_12, 6_07, 6_08, 6_09, 6_10, 6_11, 6_12, 7_07, 7_08, 7_09, 7_10, 7_11, and 7_12 are used for training. The remaining 14 images are used for validation. Consistent with ISPRS Vaihingen, the clutter category is excluded from the result evaluation.
LoveDA [22] contains 1669 validation images, 1796 test images, and 2522 training images. All images have a size of 1024 × 1024 pixels, with a spatial resolution of 30 cm, covering 7 categories: background, building, road, water, barren, forest, and agricultural. The validation set is used for performance evaluation in our study.
UAVid [23] is a high-resolution remote sensing video dataset specifically designed for urban scene understanding, captured using unmanned aerial vehicles. The dataset consists of 42 high-resolution video sequences. Each video sequence contains frames at a resolution of 3840 × 2160 pixels, providing detailed information on various urban elements. The UAVid dataset covers eight distinct classes: building, road, static car, tree, low vegetation, human, moving car, and background clutter. For training purposes, there are 20 sequences for training, 7 sequences for validation, and 15 sequences for testing. This dataset is notable for dynamic urban environments and various lighting conditions, making it a valuable resource for evaluating semantic segmentation algorithms in real-world scenarios. The validation set is used for performance evaluation in our study.
B. Training Settings
In this study, we utilized widely adopted training settings to ensure an effective comparison, as detailed in Table I. Due to the limited number of training sets available, data augmentation is employed to prevent overfitting [52]. Random resize, random crop, random flip, and photometric distortion are consistently applied for data augmentation in our experiments. Experiments are performed using two RTX 4090D GPUs.
C. Patching Methods
The patch size employed for image cropping and the stride used during scanning can also affect experimental results. Currently, most Mamba-based visual tasks adopt a patch size of 4 × 4 and a stride of 4. To explore the most suitable patch size and stride for subsequent experiments, this study carries out ablation experiments of various patching methods across three considered datasets. To quantify the impact of patching methods on segmentation, the Mean Intersection over Union (mIoU) is used as the metric to demonstrate the segmentation accuracy in the ablation tests. For consistent comparison, the D1 scanning direction is used in the ablation experiments. The stride also plays a crucial role as it affects the sequence's length and computational load. We use floating point operations (Flops) to quantify the computational load, which is calculated using one random-generated 512 × 512 image as input. Doubling the stride reduces the number of pixels to compute by four times, resulting in an approximate fourfold computational reduction. Considering practicality, the minimum stride in our experiments is set at 4, because smaller strides would demand excessively high computational resources, making training impractical on two 24G GPUs. Patch sizes considered in our experiments include 4 × 4, 8 × 8, 16 × 16, and 32 × 32, each paired with a stride of the same width as the corresponding patch size for segmentation of a whole image. Additionally, strides smaller than patch sizes are also considered to allow for overlapping scanning of images.
D. Experiments Design
Fig. 4 shows 22 scanning strategies tested in our experiments. Experiments (Exp) 1–12 consist of individual directional scans, where each of the eight potential scanning directions in the 8D scan block is set to that single direction. Exp 13–18 represent six sets of bidirectional scanning experiments, where the two directions are repeated four times to fill the eight potential scanning directions in the 8D scan block. Exp 19–21 consist of three sets of experiments for four-directional scans, in which four scanning directions are repeated twice in the 8D scan block. For example, in Exp 19, the eight potential scanning directions are filled with four scanning directions repeated twice. From Dn1–Dn8, they are D1, D2, D3, D4, D1, D2, D3, and D4. Exp 22 involves combining all eight scanning directions and inputting them to the 8D scan block. This arrangement ensures that these experiments’ parameters and computational load remain consistent. We use the mIoU metric to assess the overall effectiveness of segmentation, and the IoU score for each class, across the three datasets.
Scanning strategies (as shown in Fig. 4) investigated in our experiments include those already adopted in previous studies. Specifically, the scanning strategies in Exp 1, Exp13, Exp19, Exp21, and Exp22 correspond to those used in Samba [21], Vim-based [18], VMamba-based [17], PlainMamba [19], and RSMamba [20], respectively.
Results
A. Impact of Patching Methods
The segmentation accuracy resulted from various patching sizes and strides is presented in Table II. A consistent finding was seen from the performance analysis across three datasets: when processing image input sizes of 512 × 512, a 4 × 4 patch size with a stride of 4 yields the highest segmentation accuracies for all three datasets. While decreasing the stride sequentially to 4 and maintaining a fixed patch size, Mamba did not exhibit bottlenecks in handling these long sequences, indicating its potential to manage even smaller strides efficiently. Therefore, Mamba shows promise in handling long sequences with smaller strides. When the stride was fixed, reducing patch size improved segmentation performance, suggesting that the Mamba architecture was more adept at processing finer image patches. Based on these findings, we consistently used a 4 × 4 patch size and a stride of 4 in subsequent experiments (i.e., Exp 1–22).
B. Impact of Scanning Strategies
Tables III–VI present the semantic segmentation accuracies for ISPRS Vaihingen, ISPRS Potsdam, LoveDA, and UAVid, respectively, using the 22 scanning strategies detailed in Fig. 4.
Analyzing the outcomes of these experiments, we observed an interesting phenomenon across all three datasets used in our experiments. The segmentation accuracies resulting from the 22 scanning strategies seemed to be similar. Taking into account small performance differences among different scanning strategies within each dataset, as well as performance variations of a single scanning strategy across all three datasets, there was no apparent indication of any specific scanning strategy outperforming others, regardless of their complexity or involvement of single or multiple scanning directions. Any slight performance fluctuations observed were probably attributable to the randomness of the training process.
C. Comparisons With Representative Methods
A Mamba-based method was compared with representative state-of-the-art methods in remote sensing semantic segmentation tasks. In our comparative experiments, Vision Mamba with unidirectional scanning was used as the encoder, and UperNet as the encoder. The compared methods included Swin Transformer [26], ConvNeXt [6], and ResNet [24] as encoders, and UperNet [35], PSPNet [53], and DeepLabV3+ [34] as decoders within the encoder-decoder architecture.
To ensure fairness, all methods were trained using a fully supervised paradigm across the four benchmark datasets considered. The semantic segmentation accuracies of these methods are shown in Table VII. The results indicate that the combination of Vision Mamba with unidirectional scanning and UperNet achieved the best segmentation performance across all four datasets. Mamba's superior performance in segmenting remotely sensed images can be attributed to its advantage in contextual understanding. In high-resolution remotely sensed images, there are often categories representing large-area objects [22], such as water bodies, forests, and agriculture, where Mamba's segmentation performance significantly outperforms that of CNN-based methods constrained by their limited receptive fields [20]. Additionally, due to Mamba's strong inductive bias, it also performs better than ViT-based methods when training samples are limited. However, studies [21] have pointed out that Mamba-based segmentation currently lacks sufficient attention to areas with complex local features. The superiority of Vision Mamba in semantic segmentation of remotely sensed images has also been demonstrated in other studies [20], [54], [55] through similar comparative experiments.
Discussion and Future Work
For semantic segmentation of high-resolution remotely sensed images, our study finds that the utilization of particular scanning directions or combinations of different scanning directions, as proposed in existing Mamba-based approaches, does not effectively improve segmentation accuracy. Therefore, in the Vision Mamba framework, using a ViT-like flattening approach (i.e., D1 scanning) remains effective for semantic segmentation of such images. Moreover, employing a unidirectional scanning strategy such as D1 also reduces computational demands, allowing for deeper network stacking within limited computational resources.
Exploring the generalization of language models for vision tasks is of significant importance for the development of deep learning, as evidenced by the success of ViT [4]. Recent advancements in recurrence-based models [11], [28] suggest the ongoing exploration of effective ways to integrate them into visual tasks. Current efforts focus on designing strategies for scanning image patches to enhance the model's understanding of image sequences. However, our investigation into semantic segmentation of remotely sensed images revealed that Mamba-based models are not sensitive to different scanning strategies. Therefore, it is arguable that directing efforts toward exploring more effective ways, other than different scanning strategies, to enhance the Mamba model's understanding of remotely sensed images.
Our work does not intend to discredit the extensive efforts to improve scanning strategies in Vision Mamba but to demonstrate that these improvements have limited effects on semantic segmentation of remotely sensed images. This phenomenon is explainable: remotely sensed images differ from conventional images in terms of features. On one hand, when compared with conventional images such as pictures of people, the difference between patches representing the same semantic in remotely sensed images is minimal when converted into sequences. On the other hand, their causal linkage of the sequences is weaker than conventional images.
However, the effectiveness of varying scanning strategies in other types of datasets with more apparent causal relationships in sequences, such as COCO_Stuff [56] and Cityscapes [57], remains to be verified, which presents an interesting area for future work. Additionally, exploring the effectiveness of different scanning strategies in semantically segmenting multispectral [58], [59], hyperspectral [60], [61], and multimodal remotely sensed images [62] also represents a valuable direction for future research.
During our experimenting with different patching methods, we discovered an interesting phenomenon: reducing the stride improved segmentation accuracy, albeit at the expense of increased computational demands. This suggests that Mamba may perform better with strides smaller than the smallest stride (i.e., 4) used in our experiments for processing images of 512 × 512 pixels. However, the exponential increase in sequence length with smaller strides has precluded our experimentation with those smaller strides due to our computational resources available. Investigating more efficient computational methods to accommodate denser scanning is a meaningful direction for future study.
Conclusion
This study quantitively investigated the impact of 22 scanning strategies in the Mamba-based approach for semantic segmentation of high-resolution remotely sensed images, across the ISPRS Vaihingen, ISPRS Potsdam, LoveDA, and UAVid datasets. The experimental outcomes demonstrated that there was no discernible enhancement in segmentation accuracy resulting from the various scanning strategies, whether unidirectional scanning directions or their combinations. Therefore, for remotely sensed images, a simple flattening method was deemed sufficient in Mamba-based approaches. However, it was suggested that the effectiveness of multidirectional scanning methods for conventional images still required validation.
Our study also found that Mamba-based methods benefited from reducing the stride, leading to improved performance at the cost of increased computational resources. Therefore, it is valuable to develop more efficient computational methods to support denser scanning.