MSGFormer: A DeepLabv3+ Like Semantically Masked and Pixel Contrast Transformer for MouseHole Segmentation

In semantic segmentation, the efficient representation of multi-scale context is of paramount importance. Inspired by the remarkable performance of Vision Transformers (ViT) in image classification, subsequent researchers have proposed some Semantic Segmentation ViTs, most of which have achieved impressive results. However, these models often struggle to effectively utilizing multi-scale context, disregarding intra-image semantic context, and neglecting the global context of training data, i.e., the semantic relationships among pixels across different images. In this paper, we introduce the Sliding Window Dilated Attention and combine it with the Spatial Pyramid Pooling (SPP) to form a novel decoder called Sliding window dilated attention spatial pyramid pooling(SwinASPP). By adjusting the sliding window dilation rates, this decoder is capable of capturing multi-scale contextual information from different granularities. Additionally, we propose the Semantic Attention Block, which integrates semantic attention operations into the encoder. And adopt our proposed supervised pixel-wise contrastive learning algorithm, we shift the current training strategy to inter-image for semantic segmentation. Our experiments demonstrate that these methods lead to performance improvements on the SanJiangYuan MouseHole dataset and Cityscapes.


I. INTRODUCTION
Semantic segmentation is one of the fundamental tasks in computer vision, which aims to assign a semantic label to each pixel in an image.Fully Convolutional Networks (FCNs) [1] are the pioneering work that treats semantic segmentation as a pixel-level prediction task, and since then, many subsequent works have been inspired by FCNs.
Following the tremendous success of transformers in the natural language processing(NLP), many scholars have proposed incorporating transformers into visual tasks.Dosovitskiy et al. [2] proposed vision Transformer (ViT) for The associate editor coordinating the review of this manuscript and approving it for publication was Fahmi Khalifa .
image classification has achieved remarkable performance.Subsequently, in order to demonstrate the effectiveness of transformer in semantic segmentation, Zheng et al. [3] proposed SETR, had achieved state of the art on ADE20K and Pascal Context.Currently, the mainstream method employs a transformer backbone pretrained on ImageNet [4] as the encoder, in conjunction with a decoder based on CNN for finetuning on semantic segmentation task.CNN-based decoder designs primarily focus on addressing the issue of utilizing multi-scale contextual representations.In order to integrate multi-scale contextual information, most of these works incorporate atrous convolution [5] or pooling operations into the Spatial Pyramid Pooling(SPP) module [6], [7], [8].Segformer [9] designs a lightweight MLP as the decoder.However, CNN possess only a limited receptive field, which only model the local dependencies relationships of pixels when used as decoder, thereby neglecting long-range dependencies to some extent.Although atrous convolution can expand the receptive field, its scope is still limited.The segmentation performance of Segformer [9] is excessively reliant on the capacity of the encoder, which may compromise the upper bound of the model's performance.Therefore, by analyzing the above situation, we argue that a major issue with the current ViT for semantic segmentation is its inadequate utilization of multi-scale contextual information, thereby affecting performance.In order to overcome this limitation, we propose a novel attention mechanism, termed sliding window dilated attention.By setting varying dilation rates r, we can capture global contextual information at different granularities.Coupled with sliding window dilated attention, a SPP module evolves into a sliding window dilated attention spatial pyramid pooling(SwinASPP), similar to the Atrous Spatial Pyramid Pooling(ASPP) [7] and Pyramid Pooling Module(PPM) [10], which can utilize multi-scale representations for semantic segmentation.
The method employing a pretrained transformer backbone as the encoder and SwinASPP as the decoder for finetuning still has limitations, as it is unable to utilize the semantic-level contextual information within the images.Jin et al. [11] proposed ISNet, which enhances pixel representation by aggregating image-level contextual information and semantic-level contextual information in the decoder structure.However, ISNet is a CNN-based method that only incorporates semantic-level context in the decoder, while the encoder remains unchanged.To address the issues, we use the SeMask Attention Block, which integrates semantic information into the hierarchical vision transformer architecture and utilizes semantic context to enhance the global feature information captured by the transformer backone.We insert a semantic layer after each stage of the transformer in the backbone, and employ a lightweight semantic decoder to accumulate semantic features from all stages and utilize our SwinASPP decoder for the main pixel-level prediction.
Although we employ semantic-level context and imagelevel multi-scale context information, they only consider the local dependency relationships within a single image, neglecting the ''global'' context of the entire dataset, i.e. the semantic relationships between pixels across images.With the remarkable success of contrastive learning in the field of unsupervised representation learning [12], [13], it has been demonstrated the effectiveness of utilizing global context within training data to enhance performance.Motivated by this, we propose a pixel-wise contrastive algorithm for supervised semantic segmentation.Specifically, in addition to utilizing pixel-wise cross entropy loss for solving pixel classification, we employ pixel-wise contrastive loss to calculate pixel-to-pixel contrast, enforcing the embedding of pixels to be close to positive samples while pushing away negative samples.As the pixel-level classification information is provided during training, positive samples are pixels belonging to a same class, while negative samples are from different classes.In this way, global attributes in the embedding space can be captured to better reflect the inherent structure of the training data and achieve more accurate segmentation predictions.
The contributions of this paper include three fold: • We propose a Sliding Window Dilated Attention.By setting different dilation rates and combining with the SPP module, a transformer-based decoder is designed to explore multi-scale context information for semantic segmentation.
• We propose a SeMask Attention Block, which can incorporate semantic prior information into pretrained vision transformer backbone, providing semantic context to the encoder.
• We propose a supervised, pixel-wise contrastive learning approach for semantic segmentation.This method exploits the global context of the training data by expanding the training strategy beyond individual images to encompass multiple images.By calculating pixel-to-pixel contrast, our method leverages semantic relationships among pixels and between pixels and semantic regions.

II. RELATED WORK A. SEMANTIC SEGMENTATION
Semantic segmentation can be regarded as the extension of image classification from image level to pixel level.FCN [1] is a representative work in semantic segmentation, which is a fully convolutional network capable of performing pixel-level classification end-to-end.After that, in order to achieve precise segmentation, researchers have continuously improved the semantic segmentation CNN from various aspects, such as: enlarging the receptive field [5], [7], to capture semantic context.Semantic maps at each stage are aggregated using a simple upsample + sum operation and supervised for semantic context using weighted CE Loss.In the ASPP decoder, sliding-window dilated attention with dilation rates of r=1, 2, 4 is applied to capture features at different granularities.The final output feature maps undergo two distinct processes: i) They are subjected to a 1 × 1 convolution to reduce the channel dimension to N cls , and supervised by CE loss for the network's main prediction, and ii) They are passed through a projection head to map high-dimensional pixel embeddings into 256-dimensional ℓ 2 -norm feature vectors, which are utilized for calculating the contrastive loss.

B. MULTI-SCALE CONTEXT INFORMATION
In order to enhance the pixel representation in semantic segmentation networks, designing a reasonable context information aggregation scheme is a common approach.
In this work, we designed a Transformer-based decoder with global attention to explore multi-scale contextual information for semantic segmentation.

C. SEMANTIC CONTEXT INFORMATION
Zhang et al. [21] proposed a context encoding module to capture and utilize the semantic contextual information in images, which selectively emphasizes feature map related to the category.OCRNet [24], ACFNet [42], and SCARF [43], EMANet [44] model contextual relationships within specific semantic class region based on coarse segmentation.References [11] and [45] proposed specially designed modules that aggregate image-level and semanticlevel contextual information in the decoder to enhance pixel representation.More recently, IDRNet [46] employs an intervention-driven approach to transform pixel-level representations into semantic-level representations.Subsequently, it executes deletion diagnostics [47] procedure to model the relationships between semantic-level representations.These works are CNN-based, which captures the semantic context in the decoder.In this work, we argue that the approach mentioned above could lead to a potential loss of semantic information during the encoding stage.Therefore, we propose capturing semantic context during the encoding stage of the Segmentation Vision Transformer.

D. GLOBAL CONTEXT INFORMATION OF TRAINING DATA
Recently, unsupervised contrastive learning [12], [13], [48], [49] has been the most widely used method for learning representations without labels.It only requires learning to distinguish data in the abstract semantic feature space, making the model not only more simplified but also more generalizable.Subsequently, [50], [51], [52], [53] have also demonstrated that label information can help contrastive learning in image-level pattern pre-training.Although some works [54], [55], [56] have addressed the contrastive learning problem in dense prediction tasks, they typically consider contrastive learning as a pre-training step for dense image embedding, and calculate the contrast among pixels using augmented versions of a same image, simply utilizing local context within a single image.References [57] and [58] proposes to mine the contextual information beyond individual images to further augment the pixel representations.In recent study, [59] have aggregated dataset-level contextual information beyond the input images using a memory module.
We propose a pixel-to-pixel contrastive learning method for supervised semantic segmentation, which explores the global pixel relationships in the training data.

III. METHOD
In this section, we introduce the semantically masked and pixel-wise contrastive transformer in detail.First, we describe the overview of our transformer encoder.Then, we elaborate our SwinASPP decoder.Finally, introduce the loss function we used in our model, especially the contrastive loss will be described in detail.In the process of training, the Transformer Layer outputs is the inputs of Semantic Layer.The intermediate semantic prior features and semantically masked maps Fig. 3(b) will be returned by semantic layer.The semantically masked maps from each stages are aggregated using the SwinASPP decoder for final dense-pixel prediction.The semantic prior features from each stages are aggregated using a lightweight upsample and sum operation-based semantic decoder to predict the semantic-prior for our model.

1) UNIFIED TRANSFORMER LAYER
It unifying convolution and self-attention in a concise transformer encoder.In the shallow layers, it use convolution neural networks to decrease computation redundancy.In the deep layers, it use self-attention to capture long-range global dependency.By stacking local and global Unified transformer blocks hierarchically, it can flexibly integrate their cooperative capabilities to promote representation learning while achieving a balance between computational complexity and accuracy. of Z V is N ×C, where C is the embedding dimension.M Q generates a semantic graph and calculate the semantic attention matrix utilizing both M K and M Q .This matrix is passed through softmax and used to update Z V , as shown in Fig. 3(b).The Semantic attention equation can be defined as follows: We apply matrix multiplication between the feature values and semantic attention weights.The resulting matrix product is subsequently transferred through a linear layer and multiplied by a learnable scalar constant λ for smooth fine-tuning.Following a residual connection, we ultimately obtain the adapted features.These features include abundant semantic information, which we refer to as semantic masking features.Subsequently, the semantic query M Q is used to optimize the semantic prior graph.

B. SwinASPP DECODER
In order to incorporate multi-scale features, we utilize the architecture of spatial pyramid pooling to combined with sliding window dilated attention and get the novel SPP module called SwinASPP.The structure contains five branches including one shortcut connection, one image pooling branch and three sliding window dilated attention with r = (1, 2, 4).Then the results of the five branches are concatenated together.A MLP layer reduce channel dimension of fused feature map to 512, and then upsample to 1/4 of the image size, fuses with the first stage of the encoder's output.Finally, a 1×1 convolution takes the feature to predict the segmentation logits with H 4 × W 4 ×N cls resolution.

1) SLIDING WINDOW DILATED ATTENTION
Conventional self-attention mechanisms possess a global receptive field.However, they incur significant computational cost.To capture multi-scale contextual information from diverse receptive fields while maintaining a balanced computation complexity, we propose a sliding window hole attention with varying dilation rates.
where Q refers to query matrix, K r and V r refers to the key and value matrix results after applying the sliding window operation with the dilation rate r to K and V , respectively.

C. LOSS FUNCTION
In the training process, both the cross-entropy loss function and pixel contrastive loss function are utilized.The total loss L T is calculated as the sum of two pixel-wise crossentropy losses L 1 , L 2 and a pixel-wise contrastive loss L NCE v .Loss L 1 is calculated based on the primary prediction from the SwinASPP decoder, while loss L 2 is derived from the semantic prior prediction of our lightweight decoder.As the cross-entropy loss function only explores the relationships between pixels within a single image, it overlooks the global context among the entire training dataset images.Therefore, the pixel contrastive loss function is introduced to enhance the intra-class compactness and inter-class discreteness of all images.During training, pixels within the same class will be continuously pulled closer together, while pixels from different classes will be pushed apart.

1) CROSS-ENTROPY LOSS
The current semantic segmentation task assigning a semantic class to each pixel in an image, treating it as a pixel-level classification task.Specifically, let encoder-decoder produce a dense feature F ∈ R H ×W ×D .Then a segmentation head g SEG maps F into a categorical logits map O=g SEG (F)∈R H ×W ×|C| .We define our losses on O and M as follows: where [i, j] denotes the current predicted pixel, c denotes the ground-truth label of pixel [i, j], 1 c denotes for converting the class label stored in GT into a one-hot format.F denotes the main prediction of the network, and M denotes the semantic prior prediction.

2) PIXEL-WISE CONTRASTIVE LOSS a: PIXEL-TO-PIXEL CONTRAST
The cross-entropy loss function treats each pixel independently for prediction, without considering the relationships between pixels within the same image and different images.
To address this problem, we employ a pixel contrastive learning approach that regularizes the embedding space and explores the global structure of the training data.Essentially, our contrastive loss computation involves training image pixels as data samples.For pixel v with ground-truth semantic label c, the positive sample is defined as other pixels belonging to class c, while the negative sample is defined as the pixels not belonging to class c.The supervised pixel-level contrastive loss is defined as: where v + is an pixel embedding of positive sample for P v , N v refer to pixel embedding collections of the negative sample, the range of temperature hyper parameters is τ > 0. Note that the positive samples, negative samples and the current pixel v are not limited to a same image.

b: PIXEL-TO-REGION CONTRAST
In contrastive learning, memory bank is a key technique that helps learn good representations by utilizing a large amount of data during training.However, due to the segmentation task setting with a large number of pixel samples, most of them store all the training pixel samples directly, such as traditional memory [60], which will significantly slow down the training process.Maintaining a few latest batches in the queue, such as [61], [62], and [63], is not an optimal solution either, because the most recent batches contain only a limited number of images, reducing the diversity of pixel samples.Hence, we choose to create a pixel queue for each category.
For each category, we randomly select a small number of pixels (i.e., U) from each image in the latest mini-batch and add them to the queue with a size of T≫U.In practical use, we find this strategy to be very effective, but the undersampled pixel embeddings are too sparse to utilize only a small amount of information from the image.Therefore, we further construct a region memory bank that stores more representative embeddings absorbed from semantic regions of the image.
In particular, for a segmentation dataset with |C| semantic classes, our regional memory is constructed with a size of |C|×N×D, where D is the dimension of pixel embeddings, and N is the size of the region memory.The (c, n)-th element in the region memory is obtained by average pooling the Ddimensional feature vectors of all pixel embeddings labeled as class c in the current image.Utilizing region memory allows our pixel contrastive loss function to explore the relationship between from pixel to region.When calculating the Eq. 6 for the current pixel v belonging to class c, the stored region embeddings with the same class c are considered as positive samples, while the negative sample is defined as the region embeddings not belonging to class c.Hence, the overall training objective is: where α and β are the coefficient.We empirically set α = 0.4 and β = 0.2.

IV. EXPERIMENTS A. DATASETS AND METRICS 1) MOUSEHOLE DATASET
In the grasslands of the Sanjiangyuan Region, rodent pest is one of the factors accelerating grassland degradation.The extensive digging, root excavation, and grass consumption by rodents lead to widespread death of pasture.We employ the MouseHole dataset to study the relationship between rodent pest and grassland degradation, this dataset comprises 7,562 finely annotated RGB images captured in the Sanjiangyuan Region of Qinghai Province, covering a total of four semantic classes: eroded grassland around MouseHole, non-eroded grassland around MouseHole, stone, and cow dung.All images in the dataset have a resolution of 512×512.A total of 6,187 images were used as the training set, while 688 images were divided into the validation set, and 687 images served as the test set.

2) CITYSCAPES
CityScapes is one of the most challenging scene parsing datasets containing 5,000 fine-annotated images with 19 categories.The dataset comprises 2,975/500/1,525 images for the training, validation, and test sets, respectively.

3) METRICS
We report mean Intersection-over-Union (mIoU) over all classes for evaluation.

B. IMPLEMENTATION DETAILS 1) TRAINING
All experiments presented in this section are implemented using the MMSegmentation1 codebase on a server with 8 NVIDIA GeForce GTX 1080Ti.We employ the backbone UniFormer [38] and SwinASPP to comprehensively validate the proposed algorithm.We adhere to the conventions of [9] for training hyper-parameters.To ensure fairness, we initialize backbone with pre-trained weights on ImageNet [4], while the remaining layers being randomly initialized.For data augmentation, we use scaling with a ratio randomly sampled from (0.5,0.75,1.0,1.25,1.5,1.75),color jitter and horizontal flipping.We randomly crop large images and pad small images to a same size of 512 × 512 for MouseHole dataset and 768 × 768 for Cityscapes.In order to train the model for semantic segmentation tasks, we employed the AdamW [64] optimizer with a base learning rate γ 0 .We adopt the polynomial annealing policy to schedule the learning rate γ = γ 0 1 − N iter N total 0.9 . A linear warmup strategy was used for 1,500 iterations.We set the base learning rate γ 0 to 0.00006, weight decay to 10 −2 and train for 160K iterations with a batch size of 16 for MouseHole dataset and 8 for Cityscapes.

2) INFERENCE
To deal with varying image sizes during the inference, we maintain the aspect ratio constant and resize the images to the smaller edge resolution, and then rescale to the original dimensions before calculating the evaluation metrics.

C. ABLATION STUDIES 1) SwinASPP DECODER
In order to prove that SwinASPP improves efficiency, we compare it with DeepLabv3+ [31] and SegFormer [9].To ensure a fair comparison, they utilized Uniformer [38] as the encoder for both.Table 1 presents a comparison of parameters, FLOPs, and mIoU.In Table 2, We investigated the influence of various dilation rates on performance.

2) SeMasK BLOCK
We conducted ablation studies on different variants of the SeMask Block.We investigate the impact of semantic attention and the number of SeMask blocks (N c ), reporting results through single-scale inference on the MouseHole val dataset.In Table 3, by replacing the Semantic Attention Block with a simple self-attention block on the Uniformer-S variant, it becomes evident that the simple attention does not contribute to result improvement.This demonstrates the effectiveness of our SeMask Block.In Table 4, we investigate the impact of the number of SeMask attention blocks on performance by varying the N c values within each semantic layer of the Uniformer-S variant.We observe that N c = [1, 1, 1, 1] is the optimal setting.

D. PIXEL-WISE CONTRASTIVE LOSS
We verify the design of our contrastive loss function.In Table 5, our baseline employs Uniformer as the encoder and SwinASPP as the decoder.We respectively incorporate pixel contrast and region contrast, and observe consistent performance gains (pixel contrast from 77.74% to 77.86%, region contrast from 77.74% to 77.97%).Finally, the combination of both forms of contrast yields improved segmentation performance, highlighting the necessity of TABLE 5. Ablation on pixel contrast and region contrast.They both contribute to performance improvements, but their combination yields even better results.
jointly considering pixel-to-pixel contrast and pixel-to-region contrast.

1) MOUSEHOLE DATASET
Utilizing SeMask Uniformer as the encoder and SwinASPP as our primary predictor during training, along with the incorporation of cross-entropy and pixel-level contrast loss functions, we achieved a leading performance of 78.3% on the mIoU metric.The comparison of our results with those of other models is presented in Table 6.

2) CITYSCAPES
We conducted experiments on the Cityscapes dataset and reported the results in Table 7.The results show that our model achieves a competitive performance with mIoU of 81.43%.Therefore, our approach can obtain better feature representations for semantic segmentation.demonstrate that our MSGFormer generates segments that are both more accurate (as shown in the second and fourth rows) and more complete (as shown in the third row) in complex grassland scenes.Fig. 5 shows that our approach obtain significant improvements in challenging areas, such as small objects and object boundary.This is because our Transformer encoder can capture semantic context, while pixel contrast enables discriminative representations, retaining more detailed semantic information.The improved regions are marked with solid boxes.

V. CONCLUSION
In this work, we observe several limitations in the current Semantic Segmentation ViT models, including the lack of an efficient decoder to utilize multi-scale context and the disregard for rich semantic relations among pixels across different images.Additionally, direct finetuning of the segmentation encoder failed to consider the image's semantic context comprehensively.Therefore, we propose the Sliding Window Dilated Attention, integrated it into the SPP to capture multi-scale contextual information from different granularities efficiently.By means of pixel-wise contrastive learning, we achieved cross-image category-discriminative representations under supervised settings, learning global context from the training data.We propose the Semantic Attention Block, which utilizes semantic attention to capture semantic context and enhance the semantic representation of feature maps.Finally, we conduct experiments on the MouseHole dataset of SanJiangYuan project and the public dataset Cityscapes, our approach demonstrate improvements in semantic segmentation performance.We believe that the Transformer architecture proposed in this work holds important reference value for future research in this field.

FIGURE 1 .
FIGURE 1.Comparison between current transformer-based segmentation model(left) and ours(right).MSGFormer by incorporating an additional semantic layer within the encoder backbone to utilize semantic priors and by employing a pixel-wise contrastive learning algorithm for semantic segmentation.These simple modifications eventually yield performance improvements.

FIGURE 2 .
FIGURE 2. The overall structure of network.After the unified transformer layer, we introduce a semantic layer with SeMask blocks Fig. 3(b)to capture semantic context.Semantic maps at each stage are aggregated using a simple upsample + sum operation and supervised for semantic context using weighted CE Loss.In the ASPP decoder, sliding-window dilated attention with dilation rates of r=1, 2, 4 is applied to capture features at different granularities.The final output feature maps undergo two distinct processes: i) They are subjected to a 1 × 1 convolution to reduce the channel dimension to N cls , and supervised by CE loss for the network's main prediction, and ii) They are passed through a projection head to map high-dimensional pixel embeddings into 256-dimensional ℓ 2 -norm feature vectors, which are utilized for calculating the contrastive loss.
A. SEMANTICALLY MASKED ENCODEREncoder has four different stages,each stage consists of two layers: The transformer layer,which is N f number of Unified Transformer blocks Fig.3(a) stacked together; The Semantic layer, which is N c number of Semantic Attention blocks Fig.3(b).The transformer layer is followed by a semantic layer to form our SeMask layer.

FIGURE 3 .
FIGURE 3. SeMask block.As shown in (a), N f Uniformer blocks are stacked at each transformer layer, and N c semantic attention blocks, as shown in (b), are stacked in each semantic layer at every stage Fig. 2. The output Z from the last Uniformer block is fed into the first semantic attention block in the semantic layer.
3) QUALITATIVE RESULTSIn Fig.4, we compare the qualitative results of SegFormer and MSGFormer on the Mousehole dataset.The Fig.4 results

FIGURE 5 .TABLE 6 .
FIGURE 5. Qualitative results on Cityscapes-Val.The improved areas are marked with yellow solid boxes.

TABLE 7 .
Performance comparison on Cityscapes-Val.All experimental results are obtained under the input size of 768 × 768.We use the results from the official MMSegmentation trained model.

TABLE 1 .
Comparision of SegFormer and DeepLabV3+ with ours.''Ours'' refers to the encoder using only Uniformer-S, while the decoder employs SwinASPP.

TABLE 3 .
Ablation on Semantic Attention.A simple self-attention block result in performance degradation.