Self Reinforcing Multi-Class Transformer for Kidney Glomerular Basement Membrane Segmentation

The precise segmentation of the glomerular basement membrane (GBM) can aid pathologists in making accurate pathological diagnoses. However, conventional methods solely focus on segmenting GBM from the background, disregarding the interconnections between GBM and its similar surrounding tissues, which leads to imprecise boundary segmentation of GBM. To address this issue, we employed a multi-category segmentation method to model the distinctions and interconnections between GBM and its similar surrounding tissues. Our experimental results demonstrate that this method can more accurately segment GBM with blurred boundaries. Historically, scholars have primarily used convolution to build models. This approach has limitation that only local information is modeled without effectively extracting global information. To address this issue, we propose a more reasonable structure combining convolution and attention, which we call the Self-Reinforcing Attention Mechanism. Experimental results indicate that the addition of the attention mechanism can help the segmentation of GBM by yielding more continuous boundaries. Finally, we incorporate the feature maps of each layer of the model in the loss function, allowing the model to focus semantic information at varying scales while also providing control over the model’s focus by adjusting the weight. Our experimental results demonstrated that the proposed method has higher performance and better generalization ability than the state-of-the-art approaches.


I. INTRODUCTION
Chronic glomerular disease (CGD) is a long-term condition characterized by the glomerulus not functioning properly.CGD can worsen over time and eventually lead to glomerulus failure.Even mild CGD can increase the risk of developing serious conditions such as cardiovascular disease, heart attack, and stroke.Additionally, CGD can exacerbate diabetes.However, with early detection and treatment, many people can live long lives with the condition.

Transmission electron microscopy (TEM) is an effective screening technique for CGD, as it can observe pathological
The associate editor coordinating the review of this manuscript and approving it for publication was Essam A. Rashed .changes of various glomerular cell structures that cannot be resolved under a light-based microscope.TEM provides basic or important diagnostic information for 44.3% of patients and is therefore necessary for the examination of nephropathy.
With the advent of deep learning, some scholars have proposed a multi-stage network architecture.Hao et al. [26] proposed a CNN-based two stage network called MN-Net.The network initially employs a detection model to locate glomeruli in whole slide images and subsequently employs the classification module to classify glomerular diseases.Yang et al. [27] also proposed a multi-stage model.In contrast to the MN-Net [26] proposed by Hao et al., this model incorporates a lesion identification component to identify glomerular lesions relevant to the disease.Hao et al. [26] emphasized the significance of GBM segmentation in the diagnosis of glomerular diseases.It can been seen from figure 1 that the most challenging aspect of glomerular basement membrane (GBM) segmentation is the non-uniform shape and gray level changes of the image.The contrast between the GBM segment and surrounding tissues such as dense matter, endothelial cells, and podocytes is low, and there is no obvious boundary line.Additionally, the shape and width of the GBM segment vary from patient to patient.TEM's gray scale distribution is wide due to the uncertainty of sample prefabrication and uneven illumination during imaging, making it difficult to judge GBM with the naked eye.It is precisely because of the difficulties mentioned above that early segmentation of GBM mainly relied on manual labor [16], [17], but this is a labor-intensive process.Later scholars used various means to assist in segmenting GBM.Ong et al. [18] utilized adaptive window-based tracking to segment GBM.Kamenetsky et al. [19] segmented GBM through region division and dynamic contour modeling.Wu et al. [21] obtaining the center line of the GBM by interpolating manual mark points.
Existing CNN-based approaches have shown continuous performance improvement through the introduction of various mechanisms and structures.However, most of the existing models are convolution-based binary classification models that focus too much on GBM and ignore other tissue information in the feature maps.These models fail to recognize the distinctions and interconnections between GBM and its surrounding similar tissues.Inspired by the shortcomings of traditional models, we propose SRAFormer(Self-Reinforcing Attention Transformer) with the following modules:

II. RELATED WORK
The earlier techniques for measuring the thickness of GBM primarily relied on manual labor [16], [17], but this was a labor-intensive process.In 1993, Ong et al. [18] utilized adaptive window-based tracking to segment glomerular TEM images, and since then, several semi-automatic or fully automatic methods have been proposed.Kamenetsky et al. [19] and Rangayyan et al. [20] achieved GBM segmentation and measurement through region division and dynamic contour modeling.Wu et al. [21] proposed one method involved obtaining the center line of the GBM by interpolating manual mark points and then autosegmenting the GBM through distance mapping and low-pass filtering.Dikman et al.
proposed another method used a threshold and morphological method with no manual marking.Liu et al. [23] not only segmented the GBM but also measured its length and counted the number of slits.There are also some studies that apply deep learning to detecting kidney disease, and the precise segmentation of GBM will also promote these studies.
Hao et al. [26] proposed a CNN-based network called MN-Net.This network is mainly divided into two parts, a glomerulus detection network and a classification network.The detection network is utilized to locate glomeruli on whole slide images, then the classification network output the disease classification results.The author also mentioned in the article that GBM segmentation can further improve the reliability of this model.Yang et al. [27] also proposed a multi-stage model.A detection model was built for glomerulus detection, and the detected glomeruli went through a classification model to determine the glomerular disease.A lesion identification model was then applied to find out glomerular lesions relevant to the disease.All of the above works have shown that the segmentation of GBM is crucial for the detection of kidney disease.Although these methods contributed to identifying the GBM, several challenges remain.Most of these methods require manual initialization and do not achieve complete automation.Furthermore, subjective errors may be unconsciously mixed in during the initialization process.Some methods can only segment truncated GBM segments or only consider GBM without comprehensively considering the labels that affect the judgment.Therefore, a high-quality segmentation of the entire GBM image still poses a significant challenge.
In recent years, learning-based methods have emerged as a promising approach for medical segmentation.For instance, Pranet [11] proposed a Reverse Attention Network to segment polyps, which enhances the model's attention to boundaries.While convolutional networks dominated the field for a while, the attention mechanism began to shine in computer vision (CV) with the introduction of VIT [4].Pre-training on large-scale datasets is a vital prerequisite for Transformer to achieve great success.VIT [4] compared the performance of transformer and ResNet after pre-training under different scale datasets, demonstrating that only under large-scale or even ultra-large-scale datasets can transformer surpass or equal ResNet.Due to the strong specialization in the medical field, data annotation has always been expensive, resulting in small medical dataset scales that differ from traditional computer vision fields.Traditional convolutional neural networks (CNNs) have inductive bias and translation invariance that enable them to achieve good performance on small datasets.In contrast, Transformer models' global attention mechanism allows them to outperform CNN models when sufficient pre-training data is available.As a result, researchers have attempted to combine the advantages of both models in the hopes of creating a model that can still leverage the benefits of attention mechanisms on small datasets.FCBFormer [2] consists of two branches: FCB (fully convolutional branch) and TB (transformer branch).The FCB branch is biased towards extracting the feature map's details, while the TB branch extracts global features.After fusion by PH (prediction head), the output of the two branches is predicted.This attempt by FCBFormer has achieved State-of-the-Art results, demonstrating that combining convolution and transformer is feasible.However, this method requires a considerable amount of computation, and the interaction between convolution and Transformer is too weak.Simultaneously, from another aspect, the previously mentioned techniques solely concentrate on the GBM area while disregarding other information in the image.
Different from the method of separately segmenting GBM and background, we additionally annotate several tissues, which can assist the model to segment GBM.Our experiment results showed that the multi-category segmentation improves the IoU by 2 percentage compared with the single segmentation GBM.Traditional GBM segmentation methods mostly use convolutional neural network-based models.In this paper, we add attention mechanism to enable the model to learn the distinctions and interconnections between GBM and other tissues globally.Due to the high cost of obtaining medical data, the size of the medical data set is often small, and we desire that the model can still achieve good results on small data sets.Given the good performance of convolutional networks on small datasets, we propose a structure that combines convolution and attention mechanisms.It can not only achieve good performance on small data sets, but also has excellent global feature extraction ability.At the same time, in order to allow the model to pay attention to semantic information of different scales, we assign certain weights to the feature maps of each stage and the feature maps participate in the construction of loss function.

III. METHOD A. OVERALL ARCHITECTURE
The SRAFormer (Self Reinforcing Attention Transformer) architecture is depicted in Figure 2. Inspired by the success of U-Net in biomedical image segmentation, SRAFormer also employs a hierarchical structure similar to that of U-Net.It consists of an encoding path(left side) and a decoding path(right side).In the encoding path, considering that the amount of calculation of Transformer increases exponentially with the pixels of the picture, we also use the patch embedding operation to divide the picture into many patches.Unlike Swin Transformer [3] and VIT [4], which directly use convolution operations [25] to complete this step, we use depthwise separable convolution in MobileNets [25] to complete this operation.Specifically, we first use a patch-size × patch-size convolution kernel with a stride [patch-size,patch-size] to decrease the number of picture pixels to  In the decoder path,due to the inevitable loss of semantic information in the process of encoder path, we utilized the U-Net-like structure and the Skip-Connection operation to restore the details of the image, and the features of multiple scales are fused simultaneously.At the same time, each feature map of the encoding path will participate in the composition of the loss function with a certain weight.
The basic components of the model and the core design concepts of this article will be explained in the following sections:

1) MULTI-CLASS SEGMENTATION
When medical experts make pathological diagnosis, they often judge by comprehensive analysis of different tissues.However, in previous segmentation work on GBM, most people have used binary segmentation, which directly segmenting GBM from the background.However, this method fails to consider the relationships and differences between GBM and surrounding tissues.As shown in Figure 3, when GBM is adhered to some tissues, binary segmentation does not provide a clear boundary for this situation.In order to comprehensively segment GBM, we additionally marked several labels that may affect the accuracy of GBM segmentation: Compacts, Podocytes, Endothelial-cells and Mesangialarea.By labeling the surrounding tissues and employing multi-class segmentation without any modifications to the model, we have observed an improvement in the model's boundary delineation.

2) SELF-REINFORCING ATTENTION MECHANISM
In order to comprehensively segment GBM, we additionally marked several labels.For the model to judge GBM from a global perspective like an specialist, we use attention mechanism to allow the model to judge the relationships and differences between labels.But Transformer's success depends heavily on large-scale or even ultra-large-scale datasets.However, due to the strong specialization in the medical field, data annotation has always been expensive, resulting in small medical dataset scales that differs from traditional computer vision fields.Therefore, we also need the local feature extraction of traditional convolution to ensure that the model still performs well on small datasets.FCB-former [2] employs a dual-branch approach consisting of both convolutional and Transformer components for medical semantic segmentation.However, this approach lacks interaction between the two branches, thus limiting the full utilization of the advantages of both convolution and Transformer.To better integrate attention with CNN, we propose a novel attention mechanism as follows: where x denote the input features of self-attention, CRB denote the convolution residual block; MSA denote standard multi-head self-attention.
We call this structure Self-Reinforcing Attention Mechanism.First, the attention operation is performed on the feature map and added back to the original feature map through the residual connection.In this way, the feature map contains both global information and texture information.Then we performs a convolution operation on the feature map.This step also has a residual connection operation.The reason for designing such a structure is that the attention mechanism is a long-distance modeling method.Under the constraints of the loss function, it can find out the really important parts for GBM segmentation and weight them.Then, through the residual connection, the original feature map can contain more semantic information.A feature map with rich semantic information like this helps the model to better segment GBM.

3) TRANSFORMER BLOCK
In [4], the vision transformer (ViT) was proposed to perform image recognition tasks by partitioning images into multiple patches with positional embedding as inputs, and pre-training on large datasets.ViT achieved state-of-the-art results in image recognition.However, due to the significant computational requirements of ViT, it cannot be directly applied to medical images, as their resolutions are often much larger.In [3], the Swin-Transformer was introduced, which uses window attention and shift window methods to reduce computational complexity.We therefore adopt the Swin-Transformer attention block in our work, in which one image is partitioned into non-overlapping windows, each containing M × M patches.The computational complexity of a global MSA module and a window-based approach on an image of h × w patches is then evaluated: where the former is quadratic to patch number h×w, and the latter is linear when M is fixed.Based on these formula, we can see that when the image is very large, the difference in calculation amount will be very large.In Figure 2, there are two types of swin transformer blocks, called windowbased multi-head self attention(W-MSA) module and the shifted window-based multi-head self attention(SW-MSA) module respectively.The former does attention calculation in local window and there is no correlation among windows.The latter applies shifted window mechanism to build global attention.Each swin transformer block contains LayerNorm(LN) layer, multi-head self attention module,residual connection and MLP module.The W-MSA and SW-MSA appear in pairs, two successive W-MSA and SW-MSA constitute the basis module of swin transformer.
Based on such window partition mechanism, continuous Swin Transformer blocks can be formulated as: where ẑl and z l represent the outputs of (S)W-MSA module and the MLP module.When computing self-attention, we follow the previous works [5], [6] as follows: 105896 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
where Q, K , V ∈ R M 2 ×d are the query, key and value, d is the dimension of query or key, and M 2 is the number of patches in a window.

4) MULTISCALE SEMANTIC LOSS
The different stages of the model contain various semantic information.Owing to the small receptive fields in lower layers, feature maps contain rich detail and texture information as well as clear boundary information, which play a crucial role in feature map reconstruction and model segmentation.Correspondingly, high-level feature maps also contain global semantic information.In this regard, we incorporate each layer's feature maps into the loss function with a certain weight, enabling the model to focus on the impact of different scale semantic information on model performance.Following some parameter tests, we designated the weights of the four stages as 0.22, 0.37, 0.6 and 1, respectively.Our loss function for multiscale semantic information can be formulated as: W stage CE_Loss stage (11) where Pred ij,c denote predict label of (i,j) location, Gt ij,c denote the ground truth of (i,j) location, C denote the number of class, W c denote the class balance weight to solve class imbalance, W stage denote the weight of every stage loss.

IV. EXPERIMENTS
We conduct experiments on our own datasets and Kvasir-SEG [24].Our datasets used in this paper come from 699 electron microscope images of kidney biopsy with a resolution of 2048 × 2048 resolution provided by Southern Medical University.Considering the small amount of experimental data, in order to enhance the reliability of the experiment, we use k-fold cross-validation, where k is set to 5. When data is devided to k folds, the model is trained on k-1 folds, and the fold left out is used for test.The above process is repeated k times, and each time a different fold is selected as the test set.In this way, k models are obtained.Then the performance of the k models is averaged to obtain the final performance.The collected data includes patients with minimal change nephropathy (MCD), minimal change nephropathy (igA), membranous nephropathy (MN), thin glomerular basement membrane (GBM)nephropathy, diabetic nephropathy, light mesangial proliferative glomerulonephritis, and lupus nephropathy.These original data annotated information by pathologist include Compact matter, integrate podocytes, podocytes, endothelial cells, mesangial area, and GBM.This paper focus on identification of GBM.Compared to the simple segmentation of GBM, we also retain several auxiliary segmentation features here.They are compact matter, podocytes, mesangial area.The Kvasir-SEG dataset is used for semantic segmentation of gastrointestinal endoscopy images.Specifically, it consists of annotated images of the gastrointestinal tract, including the esophagus, stomach, duodenum, and colon.

A. TRAINING SETTINGS AND METRICS
During the training phase, we use k-fold cross-validation.Following [9] and [10], we employ two primary metrics, namely the mean Dice and mean IoU, for the purpose of evaluation.Additionally, we introduce several auxiliary metrics, namely Precision, and Recall, in order to comprehensively assess the model's performance.These metrics can be represented as follows: where the intersect_area symbol represents the intersection area of prediction and ground truth on all classes, the union_area symbol represents the union area of prediction and ground truth on all classes, the pred_label_area represents the prediction area on all classes, and label_area represents the ground truth area on all classes.

B. IMPLEMENTATION DETAIL
We utilized the PyTorch framework and conducted training of the network on four NVIDIA GeForce RTX 3090 GPUs.The OS used was ubuntu14.04.The optimization algorithm employed was adam [11], with the initial learning rate set to 6 • 10 −5 .The betas were set to (0.9,0.999), while the weight decay was set to 0.01.For the learning rate schedule, we implemented the Poly schedule.The network was trained end-to-end, with a batch size of 2. The model was trained for a total of 2200 epochs, with evaluations conducted every 220 epochs, and the model being saved every 220 epochs.Data augmentation is a widely used technique to address the scarcity of medical data.In our study, we utilized data augmentation methods to augment the robustness of the model and improve its performance.Our original images were 2048 × 2048 pixels in size, which is too large for the model.To address this, we randomly resized the input image to 512-3072 pixels, and then randomly cropped an h × w pixels image from the resized image.Here, we set h and w to 512, but for Kvasir-SEG, we set h and w to 352, as per the convention set by [8], [12], and [13].The cropped image has a probability of p = 0.5 of undergoing the image flip operation.In addition to geometric distortion methods, we also utilized photometric distortion methods, which included performing a random brightness operation on the image, with the brightness β being uniformly sampled from [−32,23], and the α being set to 1.We also changed the contrast of the image, with the contrast factor being uniformly Notably, photometric distortion methods were only applied to the image, whereas the rest of the augmentations were applied to both the image and the corresponding segmentation map.
To ensure accurate evaluation of model performance, we also trained and evaluated mature and state-of-the-art models that predicted full-scale segmentation maps.These models included Swin-T [3], BiSeNetV2 [7], STDC 2 [8], FCBFormer [2], and PraNet [11].We trained most of the models using the reliable mmsegmentation framework of open-mmlab, with minimal modifications made to the models.However, since mmsegmentation lacked the implementation of PraNet and FCBFormer, we utilized the official codebase provided by the authors of the respective papers for these two models.

C. EVALUATION
Figure 4 showcases various predictions generated by each model on our custom datasets.Our model's segmentation map consistently outperforms existing models, particularly in cases of challenging morphology, such as strip labels, point labels, and blob labels.Our model demonstrates superior segmentation performance across all label types compared to traditional convolutional network segmentation, resulting in a significant improvement in performance.Our approach is effective for several reasons.Firstly, unlike traditional binary segmentation, we employ a multi-class segmentation method and annotate the surrounding tissue of the glioblastoma (GBM).This allows the model to learn the distinctions and interconnections between the GBM and similar surrounding tissues, resulting in clearer segmentation of the boundaries.Secondly, we propose a novel architecture that integrates convolution and transformer better.This approach adds an attention map to the original feature map, enriching the feature map with detailed and global information.Moreover, to compensate for the loss of details caused by convolution and pooling, we add the attention map to the convolutional feature map.Finally, we assign a certain weight to each layer's feature map when constructing the loss function.This enables the model to focus directly on semantic information at different scales and prioritize learning at a specific scale.
In Figure 5, we show some example predictions generated by each model for Kvasir-SEG.As evident from the segmentation map, our model significantly surpasses traditional convolutional models.Specifically, the boundary division of the conventional convolutional model lacks clarity, smoothness, and completeness.For instance, the boundary of stdc2 and bisnetv2 appears extremely blurry, and the polyps are intermittent and distributed in blocks.However, our proposed model demonstrates superior segmentation performance on irregular boundaries and achieves a highly competitive recognition rate for small objects, as indicated by the segmentation diagram.

1) PRIMARY EVALUATION
The primary evaluations of our datasets are summarized in Table 1, where it is evident that SRAFormer surpasses existing models in all metrics.In fact, as shown in Table 1, 105898 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.our model performs similarly to the pre-trained Swin-Tiny, while significantly outperforming the non-pretrained Swin-Tiny model.From Table 2, it is that the multi-class segmentation of GBM results in an IOU improvement of approximately 2.5 percentage points compared to binary classification, without any modifications.This highlights that the inclusion of GBM similar tissues in the segmentation target allows the model to learn the distinction between GBM and surrounding tissues, along with their internal relationships.This results in improved segmentation accuracy.Table 3 displays the results of primary evaluations of Kvasir-SEG.It should be noted that some previously proposed models, such as FCBFormer and Pranet, yield worse results than reported in their origin papers.This could be attributed to the slightly different training environments and split image sizes, and we put the original results of these models in Table 4. FCBFormer and Pranet were originally binary category segmentation models that only required segmentation of the detection target and background, while the other models were originally used for multi-category segmentation.Here, multi-category segmentation was directly transferred to binary category segmentation.Pranet employs boundary enhancement operations, including the reverse attention module and enhanced boundary loss calculation in the loss function, which our multi-classification model does not include.Our model is more like a general model without specific optimization, but its performance surpasses that of Pranet and even the pre-trained FCBFormer.This is due to the efficient combination of Transformer and convolution that maximally retains image information.
One of the key factors contributing to FCBFormer's impressive results is its attention branch, which directly leverages the pre-trained pyramid vision transformer v2 (PVTv2) from ImageNet [15].We conducted a test on FCBFormer without pre-training, which resulted in a 5-point drop in IOU.Interestingly, our non-pretrained model outperformed the pre-trained FCBFormer, further validating the strong performance of our model.Table 3 also serves as a crossdataset test.Considering the results from both Table 1 and Table 3, it is evident that our model exhibits state-ofthe-art performance on two distinct datasets, demonstrating its ability to transfer well to other datasets with only minor adjustments required.

2) ABLATION STUDY
In this section, we test the impact of each component of SRAFormer on the segmentation results.Table 5 is the experiment done on our dataset, and Table 6 is the experiment done on Kvasir-SEG.

a: EFFECTIVENESS OF SELF-REINFORCING ATTENTION
We have investigated the importance of Self-Reinforcing Attention, and as can be seen from Tables 5 and 6, the performance of the model has been improved on different datasets after incorporating Self-Reinforcing Attention.This indicates that Self-Reinforcing Attention can indeed compensate for the loss of feature map details, and the new interaction model between convolution and Transformer is also effective.

b: EFFECTIVENESS OF MULTISCALE SEMANTIC LOSS
We further investigated the impact of Multiscale Semantic Loss, as shown in Table 5 and Table 6, where the addition of Multiscale Semantic Loss resulted in an increase in mIoU of 0.57% and 4.3%, respectively.These experiments demonstrate that the introduction of Multiscale Semantic Loss can assist the model in directly focusing on the impact of semantic information at different scales, thereby improving the model's performance.

c: EFFECTIVENESS OF SELF-REINFORCING ATTENTION AND MULTISCALE SEMANTIC LOSS
We also conducted experiments combining Self-Reinforcing Attention and Multiscale Semantic Loss to test whether the model's performance would be further improved by their joint effect.As shown in the experimental results in Table 5 and Table 6, the model performs better on GBM segmentation when the two are combined.

V. CONCLUSION
In this article, we introduce SRAFormer, an innovative architecture for segmenting GBM.Unlike previous two-category segmentation methods, our model not only segments GBM but also GBM-like tissue to aid in segmentation.Compared to FCBFormer, which simply combines transformers and fully convolutional networks, our model employs a more efficient combination structure of convolution and transformer.We also made some modifications to the loss function.The experimental results show that SRAFormer achieves exceptional accuracy (mIoU = 63.1% on GBM dataset, mIoU = 89.09% on Kvasir-SEG dataset) without any pre-training.Another advantage of SRAFormer is its ability to transfer to other datasets with minimal modification.
However, because GBM mostly appears in bands with variable and irregular shapes, our model is still limited by the scarcity of data.In future research, we plan to explore the impact of semi-supervised learning on GBM segmentation and design a data generative model to compensate for the lack of data.

FIGURE 1 .
FIGURE 1. Electron microscopic image of renal biopsy.(a) is the original electron microscope image, (b) is the marked image, red represents Compacts,green represents Podocytes,yellow represents Endothelial cells, blue represents Mesangial area, purple represents GBM.(c) is the marked image which only mark the GBM.

1 )
Replace binary segmentation with multi-class segmentation.Based on the results of ablation experiments, it has been demonstrated that the multi-classification model outperforms the binary classification model.This is primarily because the model is capable of learning the distinctions between labels, particularly those between GBM and other similar tissues, leading to more precise boundary segmentation.Additionally, the model can capture the interrelationships among different labels and gain more comprehensive semantic knowledge, thereby enabling the model to be integrated and segmented as a unified whole.2) Self-Reinforcing Attention Mechanism for both local modeling ability of convolution and global feature extraction ability of attention, allowing for better feature extraction and understanding of the interconnections between different labels.3) Multiscale Semantic Loss.Incorporate the feature map of each stage with a certain weight to the loss function.This way, the loss function can enable the model to emphasize the semantic information of different scales and adjust the model's focus by manipulating the weight.

H
Patch−size × W Patch−size .Here, patch-size is set to 4.Then, we use C 1 × 1 convolution kernels to increase the channel number of the image from 3 to C. Compared to the Transformer patch embedding operation, it reduces the computation.Then, four consecutive Base Modules are used to process the feature maps after Patch Embedding.The structure of the Base Module is depicted in Figure 2. Except for the last Base Module, each Base Module will change the pixels of the feature map to half of the original and double the number of channels.

FIGURE 2 .
FIGURE 2. (a) The overall architecture of SRAFormer.(b) The base Module of backbone.(c) and (d) are the components of (b).(e) The Pctch Embedding.(f) The Residual Block.W x denote the weight of the loss in Stage x in the total loss (a); UP denote the nearest neighbor upsampling (a); PH denote the prediction head (a); Seg Loss denote the cross entropy loss (a); Base Module denote the base module of backbone (b); W-MSA denote the window multi-head attention (c); SW-MSA denote the shifted window multi-head attention (c); LN denote the layernorm (c); Down denote 3 × 3 conv which stride is equal to 2 (d); RB denote the residual block (f); GN denote the group norm (f); SILU denote th SILU function (f).

FIGURE 3 .
FIGURE 3. Comparison of the segment results of binary segmentation and multi-category segmentation.

FIGURE 4 .
FIGURE 4. Segment results of different methods on our datasets.

FIGURE 5 .
FIGURE 5. Segment results of different methods on Kvasir-SEG.

TABLE 6 .
Ablation study for SRAFormer on Kvasir-SEG.These experiments all use multi-class segmentation methods, and the models are not pre-trained.

TABLE 1 .
Compare the metrics of different models on our datasets.

TABLE 2 .
Experimental comparison between two-category segmentation and multi-category segmentation using SRAFormer.No parameters and model structure have been changed here.

TABLE 3 .
Primary evaluation of different models on Kvasir-SEG and a cross-dataset test.

TABLE 4 .
Results originally reported for existing models.

TABLE 5 .
Ablation study for SRAFormer on our datasets.These experiments all use multi-class segmentation methods, and the models are not pre-trained.SRA denote Self-Reinforcing Attention.MSLoss denote Multiscale Semantic Loss.