Semi-Supervised Skin Lesion Segmentation With Coupling CNN and Transformer Features

An automatic skin lesion segmentation algorithm not only facilitates the dermatologist’s workload on skin cancer analysis but also provides a platform for early cancer prediction. Over the years, several deep learning methods have been proposed to addDepartment of Information Technology, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabiaress the skin lesion segmentation problem. However, training deep models usually requires a large-scale annotated dataset, which is not feasible in the medical domain due to the annotation burden. In addition, the low data regime highly increases the overfitting potential for the neural network. To address these limitations in an end-to-end manner, we propose to incorporate unlabelled samples during the training process. Our network offers a semi-supervised training schema, wherein the first stage performs a supervised training strategy to learn semantic segmentation map while the second step focuses on the unsupervised technique to enrich the encoder module. Specifically, unlike the literature work on skin lesion segmentation, we design a surrogate task on top of the convolutional and Transformer representations to learn data-driven features from the image itself to alleviate the requirement of the large annotated dataset. The effectiveness of the proposed method is demonstrated using three different skin lesion segmentation datasets, namely ISIC 2018 (dice score 0.905), ISIC 2017 (dice score 0.898) and PH2 (dice score 0.940). Particularly we observed that including the unsupervised samples can increase the dice score by 2%.


I. INTRODUCTION
Computer-Aided Diagnosis (CAD) is a severe counterpart for medical experts to assist them in their daily treatment diagnosis by interpreting medical images [1]. Deep Learning (DL) brought a solid foundation for computer vision tasks, and CAD systems are no exception [2], [3]. Among many medical image analysis tasks, image segmentation is a de facto step in which its presence is not negligible. Medical image segmentation is embedded in various medical applications, including skin lesion segmentation. Human skin tissue consists of three types, i.e., dermis, epidermis, and hypodermis. The epidermis is a susceptible tissue, which under severe solar radiation, could trigger the embedded melanocytes to produce melanin at a significant level. Fatal skin cancer is a result of melanocyte growth, which is known as melanoma.
The associate editor coordinating the review of this manuscript and approving it for publication was Huiyu Zhou. The American Cancer Society anticipated the approximate melanoma skin cancer cases around 99,780, with death cases of 7,650, 7.66% of all cases [4] for 2022. Early disease recognition plays a crucial role in medical diagnosis, as it has been reported that detection of melanoma in early phases could increase the relative survival rate to 90% [5]. Although dermatologists could detect malignant melanoma in medical images from dermoscopy, it could be a tiresome task and needs the proficiency of a dermatologist [6]. To this end, skin lesion segmentation is highly desired and could assist the dermatologist with appropriate treatment.
Automatic segmentation plans to cut out desired regions from irrelevant counterparts by pixel-wise classification. Hence, for skin lesions, the segmentation task is a binarization most of the time, separating the malignant region from its neighbor. Explicitly, automated skin lesion segmentation is interfered with by occasional intraclass factors, i.e., skin colors, textures, tissue size, the geometrical shape of a lesion, illumination and contrast due to the various dermoscopic imaging tools, and the interclass factors such as the presence of hair, blood vessels, ruler marks, and occlusion. Conventional automatic skin lesion segmentation techniques are typically based on classical computer vision and machine learning approaches such as adaptive thresholding [7], active contours [8], region growing [9], and unsupervised clustering [10], [11]. The methods, as mentioned earlier, heavily depend on reliable engineered handcraft features to determine lesion boundaries from the background. Therefore, DL methods revolutionized this domain through their end-to-end automatic feature extraction and classification baseline.
In the last decade, Ciresun et al. [12] made the first attempt to use the Convolutional layers in the medical image segmentation task. Afterward, several architectures were proposed to enhance the segmentation performance, not particularly in the medical domain, such as Fully Convolutional Network (FCN) [13], FC-DenseNet [14], and U-Net [15] for medical image segmentation. These architectures advanced the image segmentation such as the images obtained from medical domain. U-Net, an encoder-decoder alongside skip connections network, has demonstrated tremendous State of the Arts (SOTA) performance in medical image segmentation since 2015. To this end, various modifications have been introduced for various medical applications with different image modalities, e.g., U-Net++ [16], U-Net3+ [17], ResU-Net [18], DenseU-Net [19], 3D U-Net [20], V-Net [21], S3D U-Net [22]. Ramani et al. [23] used seminal U-Net for melanoma lesion segmentation in the skin lesion segmentation task. Bi et al. [24] employed a cascade multistage FCN ensemble model to produce a segmentation map. MS-UNet [25] is a multi-stage U-Net-based model that utilizes a deep supervision loss schema to learn intermediate features better which in turns increases the segmentation performance. These methods suffer from a common problem as they cannot capture long-range context information for the accurate localization of semantic features to produce monotonous segmentation results. This drawback is caused by the Convolutional Neural Network (CNN) deficiency due to the convolution layers' limited receptive field. Loss of abstract localization features through the layers is not the desired result for semantic segmentation especially in the medical domain that requires an accurate extraction of boundary regions of organs and tissues. Thus, supplementing long-range dependencies and learning conceptualized features from the image is required.
The strength of U-Net is based on the symmetrical design of the encoder-decoder and the intersection of the encoder path to the decoder path with skip connections. Feature representation in CNN layers loses its localization due to the successive convolution and downsampling operations. In addition, the successive upsampling operation makes the model lose more detailed spatial features. Although the U-Net structure tries to alter this loss of global and contextual information with skip connections, these shortcuts are still insufficient. As a result, this outline inspires that segmenta-tion representation improves drastically if the model hinders the loss of spatial information besides capturing long-range dependencies and integrating them into the decoder path. Moreover, a mechanism to include un-labeled dataset in the training stage is not presented in the U-Net model to enrich the feature representation capacity.
In this paper, we propose to couple CNN and Transformer encoders to capture both local and global representation. Nevertheless, training CNN/Transformer models usually require a large labelled dataset, which is not always available in the medical domain. Besides that, although integrating large encoder modules (e.g., CNN/Transformer) increases the model freedom (high number of parameters) to learn underlying data distribution, a lack of labelled dataset results in an unstable and overfitted model. To overcome this limitation, we propose to incorporate the unlabelled samples during the training process. Particularly, we offer a semisupervised training technique, where the first step takes the advantage of the supervised training strategy to learn semantic segmentation map whereas the second step focuses on leveraging the unsupervised data during the training process. Specifically, we design a surrogate task to learn data-driven features from the image itself to alleviate the requirement of the large annotated dataset.
Our contributions can be summarized as follows: • Coupling CNN and Transformer modules to model local and global representation • Semi-supervised technique to utilize unlabelled samples during the training process • State-of-the-art resutls on three challenging skin lesion segmentation benchmarks

II. RELATED WORKS A. SKIN LESION SEGMENTATION
In contradiction with conventional feature engineering methods, DL does not need further hand-crafted feature extraction, and can be used effectively in skin lesion segmentation [26], [27]. Broadly speaking, Yuan et al. [28] proposed a CNN with deep layers with small convolutional kernels to generalize their model with various image acquisition qualities. Alahmadi et al. [29] proposed a network that captures both local and global representation of medical images using a supervised learning technique. MSU-Net [25] has been proposed as a multi-stage U-Net-based network that simultaneously captured low-level features with fused context information in two successive stages of U-Net with a recursive perspective. Taghanaki et al. [30] proposed a modification for the U-Net skip connection to capture the most informative channel in feature map channels in each stage and transfer it to the corresponding stage in the decoder path. This transformation minimized the parameters, which led to light weighing of the network and better feature aggregation. DSM [31] utilized a multi-scale connection block within skip connection to handle the tissue variation size and aggregate the multi-stage output in the decoder path in a deep supervision strategy. DPFCN [32] employed a dense pooling schema with overlapping windows to acquire densely feature maps. Xie et al. [33] proposed the MB-DCNN model with two segmentation networks, i.e., coarse-SN and enhanced-SN, alongside a mask-CN classification network. The first localization information extracted with coarse-SN was transfered to the classification network, and the resultant class activation map was fed into enhanced-SN to obtain accurate lesion segmentation. Pourya et. al. [34] addressed the automatic skin lesion segmentation challenge from a differnet perspective. Their design offers a multi-scale representation with a scalewise fusion mechanism to alleviate the effect of overlapped background with the object of interest (skin lesion). More precisely, their approach utilizes the dilated pyramid convolution to capture multi-scale representation, by proposing a scale-wise fusion module they model the interaction among scales to enrich feature representation in the boundary area.
All the reviewed methods have a mutual bottleneck of ignoring global context information, which is a crucial factor in the medical image segmentation task. Hence, a parallel module to compensate for the loss of global contextual representation seems necessary.

B. TRANSFORMER
Li et al. [35] employed dense deconvolutional layers with cascade pooling to extract features hierarchically to capture longrange dependencies. SSP [36] developed an FCN with a shape prior information to preserve the global context of the segmentation region by penalizing non-star segmentation results. SegAN [37] leaned the segmentation map by an adversarial learning strategy with a multi-scale loss to enhance the long-range spatial dependencies. Wang et al. [38] leveraged simultaneously spatial and channel attention to recalibrate the feature representation by updating each feature value by a weighted sum of all other features. FCA-Net [39] proposed a factorized channel attention block to determine relevant channel patterns from feature maps. Abraham et al. [40] inspired by Attention U-Net [41] integrated spatial attention gate in skip shortcuts of encoder-decoder for the better interweaving of localization feature maps and coarse feature maps alongside focal tversky loss. CPFNet [42] applied a pyramid module on feature maps to capture global context. Attention Deeplabv3+ [43] applied a two-stage attention mechanism to capture informative channels and scale relevant from atrous convolution layers.
Unlike the mentioned attention mechanism, Transformer emerged by Vaswani et al. [44] proposed self-attention mechanism in Natural Language Processing (NLP) domain for machine translation tasks where it was a pure encoderdecoder network. Its success over traditional recursive leveraged modules and layers made it out to a vision domain. The first pioneering Vision Transformer (ViT) by Dosovitskiy et al. [45] was a simple stacked encoder built by Transformer blocks. After the Vision Transformer (ViT) success in major vision tasks and prior knowledge of the importance of attention mechanism in segmentation, ViT is broadly used either as a complement to CNNs or a standalone backbone design in these tasks. TransU-Net [46] was one of the earliest impressions of ViT in medical image segmentation tasks, where it embraced the Transformer as a complement to CNNs in the encoder path to capture longrange dependencies. However, due to the quadratic computational complexity of Transformers, they were not offered as a single standalone backbone until the Swin Transformers [47] for their linear computational complexity versus being the solitarily Transformer. Swin U-Net [48] is a solely Swin Transformer network based on U-Net design. It captures long-range dependencies for better medical segmentation results due to the deformable nature of body organs and tissues.

C. SEMI-SUPERVISED SEGMENTATION
The semi-supervised technique can be categorized into traditional hand-crafted features and novel deep learning-based approaches. The former uses prior knowledge (e.g., clustering) to perform feature matching whereas the deep learningbased methods utilize representational learning to learn data-driven features. An iterative procedure developed by Bai et al. [49], where pseudo labels for mask-free images are predicted by the network and distilled by Conditional Random Forest (CRF), and these labels are used to fed to the network again. Zhang et al. [50] proposed a new Deep Adversarial Network (DAN) to utilize unlabelled data in a semi-supervised way for predicting unannotated images. Yu et al. [51] developed the mean teacher model with uncertainty map guidance for semi-supervised left atrium segmentation. Zhang et al. [52] utilized shape-aware prior information to leverage the unlabelled data and impose a geometric shape constraint on the segmentation output. What all these methods have in common, is their prior knowledge assumption, which might not be feasible for any task. Differently, our unsupervised technique learns a mapping function which is consistent over different augmentations. More precisely, we create two augmented versions of the input image and then fed each augmented image into the encoder module then using an auxiliary decoder module, we minimize the cross-entropy loss between the two generated feature maps. Hence, using cross-entropy loss, our encoder architecture learns the mapping function which is robust to slight variation (e.g., augmentation) and consequently can learn more generic representation from unlabelled samples.

III. PROPOSED METHOD
The overall structure of our proposed network is depicted in Figure 1. To incorporate the unlabelled samples during the training process, our method utilizes an auxiliary decoder module to learn consistency over the augmentation map. In addition, our design offers a combination of CNN and Transformer encoder for robust local to global representation. In the next subsections, each module will be presented comprehensively. A. CNN REPRESENTATION Figure 1 shows that our proposed method is built with two encoder flow branches that complement each other. We use a seminal U-Net [15] network in the first branch to model CNN representation E CNN . This architecture E CNN parameterized with θ 1 , applies successive convolutional layers on a given image x ∈ R H ×W ×C (H , W , C are spatial height, width and channels dimension, respectively.) to extract pixel-level contextual information. More precisely, in our design, we follow the original structure of the U-Net model [15] and deploy a four-block encoder architecture, wherein in each block we use two convolutional layers followed by the Relu action and max pooling operations to produce the feature map. The resulting feature map contains local semantic information, however, due to the locality nature of the convolutional operation, it is ineffective in capturing object-level (e.g., global) representation. Therefore, to alleviate this limitation, we utilize the Transformer module as a complementary feature extractor.

B. LONG-RANGE CONTEXTUAL REPRESENTATION
The second branch (E TF parameterized with θ 2 ) objective is compensating convolutions deficiency in capturing long-range dependencies by utilizing Transformer. Similarly to [45], we feed the input image x ∈ R H ×W ×C with respect to the first branch to the Transformer module by dividing it into the N = [ HW p 2 ] non-overlapping patches where p × p is the dimension of each patch. Later a patch encoder E(x p ; ω) applies on serialized patches to project from p 2 · c space to K embedding space. A 1-D learnable positional embedding I pos ∈ R N ×K adds to the projected sequence of each patch to preserve each patch's spatial information: where I ∈ R (p 2 ·C)×K denotes the projected patch embedding. We then stack the multiple Transformer blocks to learn long-range dependencies. Each Transfomer block composed with Multi-head Self Attention (MSA) where consists of M parallel self-attention heads to scale different patch interaction learning's: and Multi Layer Perceptron (MLP) modules to learn longrange contextual dependencies by: Norm() depicts layer normalization [53] and t i ∈ R HW p 2 ×d represents encoded semantic representation. In our design we used the public implementation of the vision transformer [45] with three self attention head to encode the global representation.

C. FEATURE FUSION
As presented in the previous two subsections, our encoder module applies both CNN and Transformer encoders to extract local and global representation. To combine these two feature sets, we first reshape the Transformer representation into the same spatial dimension as the CNN feature set, then we simply concatenate these two feature sets to create the final encoder representation.

D. SEGMENTATION DECODER
Our decoder module utilizes the same structure as seminal U-Net model to produce the segmentation map. More precisely, in our supervised section we utilize four block CNN decoder (similar to the CNN encoder but with replaced upsampling instead of pooling operation) D SUP with parameters φ to progressively increase the spatial dimension while reducing the feature map to predict the skin lesion area. We apply dice loss L(θ, φ; ) between the predicted segmentation map and the ground truth mask to learn the segmentation task in a supervised manner, where θ = θ 1 ∪ θ 2 indicate the CNN and Transformer encoder's parameters and φ represents network parameters related to the segmentation task.

E. SURROGATE TASK
One of the challenges in medical image segmentation is to provide large annotated dataset to train the segmentation network. To tackle this issue, we propose integrating a supplementary decoding head D US to alleviate the lack of labelled data during the training process. To this end, we propose to include an auxiliary loss function to reduce the dissimilarity between the representation of two augmented versions (x 1 and x 2 ) of the same image, where x i = Aug(x) and Aug() indicates a random augmentation function. We denote this loss as L(θ, γ ; ), where γ represents network parameters related to the surrogate task. To model a surrogate task, several methods have been proposed in the literature, including predicting rotation [54], solving jigsaw puzzles [55], and filling removed parts of an image [56]. Note that skin tissue in dermatology concept is direction variant. In contrast, consider a rotated car image with wheels above the car roof; in this example, it is obvious to predict the rotation [57]. Therefore, it is evident that due to the nature of the application, we should consider an appropriate surrogate task. To accomplish this, we utilize an auxiliary dataset, , with N unlabelled samples. We apply data augmentation technique two times to each images, resulting in a peer-to-peer mapping of augmented images (Y U i,1 , Y U i,2 ) for each image in the dataset (X U i ). Using MSE loss, we force the encoder module to learn the feature representation space which is robust to slight variation (e.g., reducing the feature dissimilarity for two augmented version of the image). Choosing MSE loss over other losses was empirical, and it is evidence of using MSE loss as a reconstruction loss in the Auto encoder-decoder concept. Equation 4 is formulated the used MSE loss as follows: The final objective function during the training is a weighted sum of two counterparts dedicated to the semantic segmentation task and surrogate task, respectively. The first term, L(θ, φ; ), is a function of the parameters θ and φ of semantic segmentation encoder-decoder term. Also, L(θ, γ ; ) is a function of encoder parameters θ and surrogate network parameters of γ . Equation 5 represents the joint loss functions of network as follows: where λ is a regularized term to control the weight of surrogate task.

A. DATASETS
We applied our proposed method to three publicly available dermoscopic datasets to demonstrate the efficacy of our

B. IMPLEMENTATION DETAILS
We implemented our proposed method using PyTorch framework on a single NVIDIA RTX 3090 GPU. All the samples from the datasets were resized to 224 × 224 resolution. In all of the settings of our experiment, the networks weights initialized by ImageNet pre-trained weights. We used a polynomial learning rate decay with initial learning rate of 1 × 10 −3 for better convergence, where i denotes the i-th epoch of training as follows: We set the batch size and the total number of epochs to 4 and 100, respectively. SGD optimizer with momentum 0.9 and weight decay 0 is employed. To alleviate the low samples of datasets and generalize our proposed network, we utilized unlabelled samples during the training process to benefit from unsupervised techniques to enrich encoder representation. Note that during the training using the ISIC 2018 dataset, we used ISIC 2017 samples as an unsupervised dataset. Similarly, unlabelled samples from ISIC 2018 are utilized during the training of the model on the ISIC 2017 and the PH2 datasets.

C. EVALUATION METRICS
For the performance evaluation and present comparison results with other methods, we used four metrics, i.e., Sen- where TP and TN represent the correct number of skin lesion pixels and background pixels, respectively. FP is a number of background pixels that are miss-labelled with the skin lesion label, and FN denotes the number of skin lesion pixels that are incorrectly predicted as background pixels.

D. RESULTS ON THE ISIC 2017 DATASET
We compared our method under the same circumstances with the SOTA approaches. In Table 1, the comparison results of the proposed network comparing to the seminal U-Net [15], Att U-Net [61], DAGAN [62], TransUNet [46], MCGU-Net [63], MedT [64], FAT-Net [65], and MSA-UNet [66] is provided. Our network improved the DSC and accuracy metrics of the seminal U-Net model by 8.99% and 4.27%, respectively. Furthermore, comparing to the CNN based approaches [15], [63] our network produces better results in all metrics, which indicate the effectiveness of both Transformer module incorporated in our structure and the semi-supervised technique used in our strategy. Besides that, comparing to the recently proposed MSA-UNet [66], our method exhibits a better performance due to the strength of the unsupervised technique utilized in our method. We also displayed a visual comparison of the obtained results in Figure 2. As can be seen from Figure 2, our proposed method produces a soft and precise segmentation results on the object boundary and effectively separates the skin lesion with irregular shapes and scales from the overlapped background.

E. RESULTS ON THE ISIC 2018 DATASET
We dissected our method with SOTA methods in the literature, including seminal U-Net [15], Att U-Net [61], DAGAN [62], TransUNet [46], MCGU-Net [63], MedT [64], FAT-Net [65], and MSA-UNet [66]. The evaluation settings are the same for all methods for a fair comparison. The statistical comparison is illustrated in the Table 2. As it is clear from Table 2 the MCGU-Net [63] with an attention mechanism and pretrained VGG backbone produces better performance than other CNN based approaches. MSA-UNet [66] outperformed both CNN and Transformer based approaches  due to the usage of the combination of CNN and Transformer modules. More precisely, the MSA-UNet utilizes a pyramidal feature representation underlying the network to compensate for the loss of global context in challenging samples, even though it can not achieve noticeable results regarding our work. in addition, comparing to both CNN and Transformer methods our semi-supervised training strategy not only achieved the highest score for most of the metrics but also outperformed all supervised learning strategies. Moreover, we depicted a visualization comparison in Figure 3.
In some challenging samples, like the more minor contrast variance of lesion region with neighboring pixels, it is evident that our method still performs well.

F. RESULTS ON THE PH 2 DATASET
Finally, for further comparison studies, we investigated our method alongside some SOTA, including seminal U-Net [15], Att U-Net [61], DAGAN [62], Tran-sUNet [46], MCGU-Net [63], MedT [64], FAT-Net [65], and MSA-UNet [66] on the PH 2 dataset. Like the previous experiments, the settings are the same for a fair comparison. The statistical comparison is depicted in the Table 3. Att U-Net [61], using an attention mechanism, achieved a better performance than U-Net. In addition, MSA-UNet [66] used a combination of CNN and Transformer rather than a conventional convolution, which became a facilitator to extract more discriminant features in an encoder, resulting in a better performance comparing to the other SOTA approaches. Our method utilizes the semi-supervised segmentation method  and outperformed the other SOTA approaches. In addition, we displayed a visual results in the Figure 4. It is evident that our method can easily handle complex brightness and contrast distributions which show the cogency and generalization of our network.

G. ABLATION STUDY
As part of this section, we conducted an ablation study to evaluate the impact of the proposed semi-supervised technique utilized in our pipeline and the Transformer module coupled with the CNN encoder to enrich the encoder representation. Different settings were used to analyze the contributions of each strategy. Our goal was to demonstrate how these techniques can be effectively incorporated into a skin lesion   in table 4). In addition, we trained our model without using an auxiliary task to demonstrate the effect of the unsupervised technique incorporated in our strategy (denoted as Baseline + Transformer). According to our findings, each strategy contributes to the model performance and they provide a strong representation of the network features. Based on the experimental results shown in Table 4, using the Transformer module along with the hierarchical features of the seminal U-Net (baseline) helps the model to learn a multi-scale representation with rich and generic features, and significantly increases the model's performance. Moreover, the generalization performance is further enhanced by incorporating the auxiliary task. Our finding is in line with the semi-supervised literature [57] that the auxiliary task can enrich the segmentation encoder and consequently result in a better performance. Moreover, in terms of model selection, one should be noted that our method is not limited to a specific segmentation network, such as U-Net, and can be incorporated into any segmentation network for higher performance gain. To visually analyze the effect of suggested modules on the segmentation results, we provided sample comparison results in Figure 5. It is obvious that by incorporating each module the segmentation results become better. Specifically, comparing to the Atrous based and Baseline + Transformer methods the final setting (proposed method) works quite well on the boundary area without over and under estimation. It should also be noted that in some of our experiences as can be seen in Figure 6, our method fails to segment the skin lesion area similar to the ground truth mask due to the noisy annotation provided by the dataset. In clinical applications, noisy annotations are a common scenario, so they can largely decrease model preference. This might explain why clean annotation is important in the training process.
To visualize the effectiveness of our suggested network compared to the SOTA approaches, we provided Figure 7. In our comparison, we provided the segmentation result achieved by the MCGU-Net [63] and MSA-UNet [66] approaches comparing to our suggested network. It can be observed that our method produces smooth segmentation results with precise boundary separation. We can also observe that comparing to the MCGU-Net [63], our method has better estimation of the skin lesion boundary and it is in line with MSA-UNet [66] approach. It is also worthwhile to mention that for the second sample, the MCGU-Net underestimates the segmentation map while the MSA-UNet slightly produces an overestimation. On the contrary, our method produces a better segmentation map for the second sample with slight FN predictions.

V. CONCLUSION
In this paper, we proposed a semi-supervised technique to enhance the semantic segmentation task. In our strategy, we proposed to incorporate the unlabelled samples during the VOLUME 10, 2022 training process to encourage the feature learning paradigm. Our suggested network offers a semi-supervised training schema, wherein the first stage performs a supervised training strategy to learn semantic segmentation map while the second step focuses on the unsupervised technique to enrich the encoder module. Several experimental results on three public datasets demonstrated the effectiveness of our approach for the semantic segmentation task.