CACDU-Net: A Novel DoubleU-Net Based Semantic Segmentation Model for Skin Lesions Detection in Images

Skin lesion segmentation is a critical task in the field of dermatology as it can aid in the early detection and diagnosis of skin diseases. Deep learning techniques have shown great potential in achieving accurate lesion segmentation. With the help of these techniques, the lesion segmentation process can be automated, thus reducing the impact of manual operations and subjective judgments. This aids in improving the work efficiency of medical professionals by saving their time and lowering their corresponding effort, and in enabling better allocation of healthcare resources. This paper proposes a novel CACDU-Net model, based on the DoubleU-Net model, for performing skin lesion segmentation better. For this, firstly, the proposed model adopts a pre-trained ConvNeXt-T as an encoding backbone network to provide rich image features. Secondly, specially designed ConvNeXt Attention Convolutional Blocks (CACB) are utilized by CACDU-Net to refine feature extraction by combining ConvNeXt blocks with multiple attention mechanisms. Thirdly, the proposed model utilizes a specially designed Asymmetric Convolutional Atrous Spatial Pyramid Pooling (ACASPP) module between the encoding and decoding parts, using atrous convolutions at different scales to capture contextual information at different levels. The image segmentation performance of the proposed model is evaluated against existing mainstream models on two skin lesion public datasets, ISIC2018 and PH2, as well as on a private dataset. The obtained results demonstrate that CACDU-Net achieves excellent results, especially based on the two core metrics used for the evaluation of image segmentation, namely the Intersection over Union (IoU) and Dice similarity coefficient (DSC), according to which it surpasses all other models. Moreover, experiments conducted on the PH2 dataset show that CACDU-Net has strong generalization ability.

The associate editor coordinating the review of this manuscript and approving it for publication was Gustavo Callico .
The etiology of skin cancer is complex, and it often occurs in skin tissues exposed to sunlight. When skin cells lose control of their growth, they can develop into skin cancer, with melanoma being the deadliest type. Young and middle-aged individuals account for about two-thirds of malignant melanoma cases, while individuals aged 65+ account for about one-third. In 2018, the estimated number of melanoma cases was 287,700, with 60,700 deaths, [2]. In recent years, the number of skin cancer incidences has continued to increase. Early diagnosis and timely treatment are the most effective ways to cure melanoma. However, if a person is diagnosed late, the survival rate is only 15% in the advanced stage. Medical researchers have summarized several clinical diagnostic methods for melanoma based on the color, shape, texture, and visual features of pigmented networks and streaks in the skin lesion area under dermoscopy. These methods include the asymmetry, border, color, and differential structure (ABCD) rules [3], pattern analysis [4], the Meng's method [5], and the seven-point feature method [6]. However, the complexity of the skin lesion area, such as body hair, borders, and blood vessels, greatly hinders medical personnel from making accurate judgments. Thus, skin lesion segmentation remains a challenging task.
Currently, skin lesion segmentation methods can be divided into two categories [7], [8]: (i) traditional machine learning (ML) methods [9], such as edge-based [10], regionbased [11], threshold-based [12], [13], and clustering-based segmentation methods [14], [15]; and (ii) deep learning (DL) methods. Traditional ML image segmentation methods analyze the differences between the foreground and background of the image and manually design features from information such as grayscale, contrast, and texture in the image for segmentation. With the rise of ML, segmentation methods that extract features purely manually became the mainstream methods at that time. However, these methods can miss out a lot of detailed information. Also, due to some limitations, such as the complexity of designing and extracting the features, ML technology is limited in further development in the field of segmentation. DL can fully utilize the intrinsic information of images and thus gradually became the preferred technology in the field of image segmentation. With the rapid development of convolutional neural networks (CNNs) [16] in the field of image segmentation, there are already specialized medical segmentation models that have achieved great success in on-site and assisted diagnosis. Significant breakthroughs have also been made in the field of skin lesion segmentation. Ghafoorian et al. [17] proposed a multi-branch deep CNN (DCNN) for extracting multi-scale contextual features. However, their network is too shallow to extract high-resolution features. With the development of batch normalization (BN) [18] and residual structure [19], the problems of network degradation and gradient disappearance were solved, by making the network deeper. Yu et al. [20] reported that deep architectures can extract highly discriminative features for skin lesion segmentation, but these networks ignore global features because they focus on local contexts, thus limiting the use of deep architectures in achieving more accurate results.
Recently, attention mechanisms have become popular in DL for extracting global features to enable accurate segmentation. In [21], attention mechanisms were used in combination with the popular U-Net architecture [22] to select discriminative features by weighting different channels for different organs with varying sizes, shapes, and other features. However, the use of a single attention mechanism failed in lesions with complex features.
The motivation of this paper was to develop a skin lesion segmentation model based on DoubleU-Net, which to employ multi-scale feature extraction modules for the extraction of highly discriminative deep features, on one hand, and attention mechanisms for refining the features extracted by the decoder after upsampling, on the other hand. The result of these efforts was a novel CACDU-Net model (https://github.com/1194449282/CACDU-Net), which demonstrated excellent performance in image segmentation experiments conducted on the ISIC2018 [23] and PH2 [24] public datasets, and our own private dataset.
The main contributions of the paper reflect three aspects: 1) The DoubleU-Net [25] network architecture is improved by employing two U-Net networks, named Network1 and Network2, each consisting of encoder and decoder parts. The latest ConvNeXt-T CNN [26], which utilizes a large 7 × 7 convolution, followed by downsampling, and extracts features in four stages, is employed in the encoding stage of Network1. Different attention mechanisms, combined with standard convolution and ConvNeXt blocks for feature extraction, are applied in both the decoding stage of Network1 and the encoding and decoding stages of Network2.

2) Specially designed ConvNeXt Attention Convolutional
Blocks (CACB) are used to provide attention information in both channel and spatial dimensions, focusing on the lesion itself rather than on irrelevant information such as body hair, bubbles, vessels, and measurement scales. Additionally, the use of a stacked U-shaped architecture perfectly combines multi-level features, capturing long-term dependencies in obtaining a global contextual view to help the network achieve accurate segmentation of skin lesions.

3) A newly designed Asymmetric Convolutional Atrous
Spatial Pyramid Pooling (ACASPP) module is utilized between the encoding and decoding parts to provide multi-scale semantic information to the network, which is helpful for identifying lesions of different sizes. Asymmetric convolution is employed by ACASPP in conjunction with dilated convolution, whereby different shapes of asymmetric convolution absorb information from different angles and different dilation rates of dilated convolution capture information at various scales.

A. MEDICAL IMAGE SEGMENTATION
With the development of artificial intelligence, CNNs have gradually been applied to medical image segmentation. Fully 82450 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
convolutional networks (FCN) [27] were pioneers in image segmentation being able to predict every pixel, with endto-end, pixel-to-pixel training, thus solving the problem of spatial resolution. In 2015, Ronneberger et al. [22] proposed a new end-to-end semantic segmentation network called U-Net, based on FCN, which is suitable for medical image segmentation. The difference between U-Net and FCN is that U-Net uses convolutional operations with the same number of layers in the upsampling and downsampling stages and connects the downsampling and upsampling layers by skip-connections. Therefore, the features extracted by the downsampling layers can be directly transmitted to the upsampling layers, thus improving the pixel localization and segmentation accuracy of the network. Specifically, U-Net is a U-shaped symmetric encoder-decoder network that uses skip-connections to merge high-level and low-level semantic features [54]. Zhou et al. [28] proposed UNet++, based on the U-Net framework, using a series of nested and dense skip-path connections between the encoder and decoder subnetworks, further reducing the semantic relationship between the encoder and decoder, and achieving better performance in liver segmentation tasks. U-Net++ densely replaces the cropping and concatenation operations in the skip-connections of U-Net with convolutional operations to obtain better feature information and compensate for the information loss caused by sampling [49]. Inspired by U-Net and ResNet [19], DoubleU-Net [25] adds two encoders and decoders to the U-Net model to improve segmentation accuracy. Later, U2-Net [29], named this way because each encoding and decoding layer is nested with U-Net, has shown significant improvements based on some evaluation metrics. Google transplanted Self-Attention (SA) from natural language processing [30] to computer vision and proposed ViT [31] as the backbone. Due to its powerful feature extraction ability, how to combine ViT and its variants with U-Net to obtain better results has been the focus of researchers in recent years [49]. For instance, Swin-Unet [32] combines Swin Transformer [33] with U-Net, and shows better segmentation results. Similarly, SegFormer [34], built on the Transformer architecture, not only demonstrates high performance but it is also very efficient, achieving state-of-the-art results with fewer parameters than other semantic segmentation models. Recently, in 2022, Huang et al. proposed an efficient hierarchical encoder-decoder network called MISS-Former [35], which, due to its unique design components, exhibits improved ability to capture long-range dependencies and local environments. In the current paper, a novel CACDU-Net model is proposed, based on DoubleU-Net, which demonstrates improved performance in skin lesion segmentation.

B. ASYMMETRIC CONVOLUTION
Asymmetric convolution is a type of convolution operation used in CNNs. Compared to regular convolution, asymmetric convolution has more adjustable parameters and stronger feature extraction capability. In regular convolution, the kernel is usually square or rectangular, with equal width and height, hence it is referred to as symmetric convolution. In contrast, asymmetric convolution allows the width and height of the kernel to be set to different values, enabling the model to better adapt to features of different shapes. EDA-Net [36] is an efficient asymmetric convolutional dense module that decomposes 3 × 3 convolution into 1 × 3 and 3×1 convolutions to reduce computational cost. However, its performance degrades in semantic segmentation. To address this issue, Ding et al. [37] proposed a one-dimensional asymmetric convolution to enhance features in the horizontal and vertical directions, and then aggregate the acquired information into a kernel layer to ensure good image recognition performance. Recently, MACU-Net, proposed by Li et al. in [38], applied asymmetric convolution blocks to the field of semantic segmentation, successfully improving the representational power of convolutional layers.

C. ATROUS CONVOLUTION
In the DL field, atrous convolution (also known as dilated convolution) was first proposed in the DeepLab v1 model [38], [39] in 2014 to increase the receptive field and improve the accuracy of image segmentation. Subsequently, more and more DL models began to adopt dilated convolution. In 2015, Szegedy et al. [40] used dilated convolution in the Inception v3 network, which helped to capture a wider range of contextual information and thus improved the performance of image classification and object detection. In 2016, He et al. [19] proposed ResNet, which can better capture detailed information in images through the use of dilated convolution, which helps improve the performance of tasks such as image classification and object detection. In experiments, ResNet showed excellent results on the ImageNet dataset. To fully utilize the image features extracted by deep and shallow networks, a common solution is to fuse multi-scale features [54]. In semantic segmentation, parallel multi-branch structures are usually used to fuse features with different receptive fields. In the DeepLab v2 model proposed by Chen et al. [41] in 2017, an Atrous Spatial Pyramid Pooling (ASPP) module is used as a simple and effective decoder module for clear segmentation. The ASPP module uses dilated convolution with multiple different sampling rates in parallel and fuses them through pooling operations to extract feature information at different scales. This design can effectively capture object features of different sizes and resolutions, thereby improving the performance of the segmentation model. During the feature extraction process, shallow layers contain small receptive fields to represent geometric details, while deep layers contain large receptive fields to represent semantic information [54]. Based on the ASPP module, an Asymmetric Convolutional Atrous Spatial Pyramid Pooling (ACASPP) module is proposed further in this paper for use by the elaborated CACDU-Net model, which uses a different mechanism to adjust kernel sizes.

D. ATTENTION
Attention is a commonly used technique in DL. It assigns different weights to input information based on their relevance, which can be adjusted under different circumstances. Therefore, attention mechanism has high advantages in scalability and robustness. In the field of medical image segmentation, Oktay et al. [21] proposed Attention-UNet based on the U-Net network, which is a novel attention gate (AG) network for medical image processing that can focus more accurately on the regions of interest, suppress irrelevant features, and highlight useful features. Jie Hu et al. [42] proposed the Squeeze and Excitation (SE) attention mechanism and verified it in multiple computer vision tasks, demonstrating that it can significantly improve the performance and generalization ability of CNN models. Woo et al. [43] proposed the Convolutional Block Attention Module (CBAM). Given an intermediate feature map, CBAM sequentially deduces two independent channel and spatial dimensions, and then multiplies the attention map with the input feature map pixel-wise for adaptive feature refinement. Recently, Transformer models have also been widely applied in the field of medical image segmentation, whose self-attention module captures long-range dependencies, while convolution only collects information from neighboring pixels. However, Transformer requires a large amount of training on large-scale datasets to obtain satisfactory results, which poses difficulties in its application to small medical image datasets. In summary, embedding appropriate attention modules at suitable locations in the network for skin lesion segmentation can reduce the impact of irrelevant information such as body hair and bubbles, and obtain more accurate segmentation results [54].

III. PROPOSED CACDU-NET MODEL
This section first introduces the overall structure of the proposed CACDU-Net model, shown in Figure 1, and then describes the details of each module.

A. OVERALL STRUCTURE
As shown in Figure 1, the proposed CACDU-Net model consists of two stacked U-Net structures, namely Network1 and Network2, which utilize different encodings to extract features and perform skip connections. More specifically, Network1 is used to extract coarser features, while Network2 is utilized to extract finer features. This design allows the model to achieve superior segmentation performance at different scales, thereby improving the overall segmentation accuracy. It is worth noting that the prediction results of Network1 are passed through a Sigmoid function to become the weights of Network2's input. Specifically, the Sigmoid function is applied to the output of Network1, an image of size 256 × 256×1, to transform it into a weight object of the same size, with values ranging from 0 to 1. Then, a matrix multiplication on this weight object and the input image of size 256 × 256×3 is performed, resulting in the input for Network2, which also has a size of 256 × 256×3. This enables Network2 to obtain high evaluation scores at an early stage and accelerate the prediction process. The superiority of this architecture is confirmed by the conducted ablation experiments, presented in Section IV.
B. NETWORK1 Figure 2 illustrates the overall structure of Network1. It can be seen that this network adopts a U-shaped architecture composed of encoding, middle, and decoding parts. Skip connections are inserted between the encoding and decoding parts to pass data through. ConvNeXt-T [26], pre-trained on the ImageNet dataset, is used as an encoding part. The middle part performs multi-scale feature extraction using ASPP, with dilation rates set to 6, 12, and 18. Unlike the decoding part of U-Net, all traditional 3 × 3 convolution kernels are replaced by ConvNeXt Attention Convolutional Blocks (CACB), described in Subsection III-E. C. NETWORK2 Figure 3 shows the overall structure of Network2. It can be seen that this network also adopts a U-shaped architecture, which is completely symmetrical and composed of encoding, middle, and decoding parts. Skip connections are used to pass data through, and the network receives, and aggregates features encoded by Network1. The middle part performs multi-scale feature extraction using an ACASPP module, with dilation rates set to 6, 12, and 18. Both the encoding and decoding parts use CACB blocks, which accelerate feature propagation and information flow. Unlike the purpose of Network1, Network2 is designed to further extract features from the input data.

D. CONVNEXT
ConvNeXt [26] is a CNN, designed to improve feature extraction capability and model performance. Similarly to ResNet, it borrows many successful ideas from the Transformer, but with improved accuracy and efficiency due to the larger kernel sizes and deeper convolutions used. Five versions of the ConvNeXt network were proposed by its authors, namely T/S/B/L/XL, each involving four stages. The only difference between these versions relates to the number of channels and the number of repeated stacked blocks used in each stage [49]. The ConvNeXt-T network refers to the version with the smallest depth and width. Each feature resolution stage of the ConvNeXt-T network consists of multiple residual ConvNeXt blocks (Figure 4a).
As shown in Figure 4b, each ConvNeXt block includes a 7 × 7 depthwise convolution, two 1 × 1 layers, and a non-linear Gaussian error linear unit (GELU) activation [44]. Layer normalization (LN) [45] is used before the Conv 1 × 1 layer. Unlike traditional convolutions, ConvNeXt replaces 3 × 3 convolutions with 3 × 3 depth convolutions, uses a reverse bottleneck structure, and employs GELU and LN instead of the Rectified Linear Unit (ReLU) and BN [18], with fewer activation functions and larger convolution kernels up to 7 × 7. As shown in Figure 4c, the ConvNeXt network utilizes a separate downsampling layer to downsample the features.
As the design of ConvNeXt-T ensures both accuracy and efficiency, it was chosen as the backbone network of the proposed CACDU-Net model.

E. CONVNEXT ATTENTION CONVOLUTIONAL BLOCK (CACB)
Single attention mechanism is insufficient to achieve satisfactory results in complex lesion segmentation. CBAM [43] integrates spatial and channel attention mechanisms to better extract useful feature information, reduce sensitivity to noise and irrelevant features, and improve model accuracy and robustness. Inspired by CBAM, a ConvNeXt Attention Convolution Block, abbreviated as CACB, is proposed here, as shown in Figure 5, which consists of a 3 × 3 convolution, a ConvNeXt block, and a CBAM channel attention module and a CBAM spatial attention module to extract channel and spatial attention features, respectively. After the 3 × 3 convolution, BN and ReLU activation functions are performed.
The channel attention block aims to make the neural network focus on global features and suppress unnecessary features such as body hair, measurement scales, blood vessels, and bubbles [54]. This module performs global max pooling and global average pooling on each channel of the input feature map, and then generates two vectors of shape R C×1×1 (where C represents the number of channels). These two vectors are then inputted into a multilayer perceptron (MLP), which reduces the number of parameters by sharing weights. The MLP contains only one hidden layer, and its weight vector has a shape of R C/r×1×1 (where r represents the reduction ratio, which is set to 16 in this paper). The MLP is implemented through two fully connected layers to generate two processed channel attention vectors. Finally, these two vectors are pixel-wise added and processed by a Sigmoid activation function, and the feature map size is restored to the same size as the input feature map. The channel attention block functioning is summarized in [54] as follows: where F denotes the input feature map, σ denotes the Sigmoid activation function, F c avg and F c max denote the feature maps obtained after global average pooling and global max pooling along the channel dimension, respectively, and W 0 ∈ R C/r×C and W 1 ∈ R C×C/r denote the weights of the MLP.
Different from the channel attention block, the spatial attention block can capture long-term dependency relationships to obtain a global contextual view, and selectively aggregate contextual information according to the spatial attention map to achieve more accurate segmentation performance of skin lesion boundaries [54]. The spatial attention block is more sensitive to lesion edges with similar skin colors in the surrounding area and thus can effectively extract the curve structure features of the edges. More specifically, average pooling and max pooling operations are first performed along the channel axis of the feature map to identify the regions with the maximum information in the feature map. Then, the results of the pooling operations are concatenated to create an efficient feature descriptor. Next, convolutional layers work on the concatenated feature descriptors to generate the spatial attention map, which indicates the positions that should be emphasized or suppressed in the feature map. The specific operations are shown below: where f 7×7 denotes a convolution operation with a filter size of 7 × 7, while the channel information of the 2D feature map is represented by F c avg ∈ R 1×H ×W and F c max ∈ R 1×H ×W (where H denotes the height and W denotes the width), respectively.

F. ASYMMETRIC CONVOLUTIONAL ATROUS SPATIAL PYRAMID POOLING (ACASPP)
As reported in [37], square convolution kernels capture features with uneven scales. Specifically, the weights at the center crossing position (i.e., kernel skeleton) have larger magnitudes, while the contributions of the points in the corners to feature extraction are lesser. The design deficiency of the square convolution kernel can be compensated by the use of asymmetric convolution kernels. Shown in Figure 6a, ASPP uses different atrous rates for convolution operations at different sampling rates to extract features within different receptive field ranges, thus capturing multi-scale information. Based on this, the idea of asymmetric convolution proposed in [37] is combined with dilated convolution to design a novel ACASPP module, used by the proposed model to capture features from different receptive fields. As shown in Figure 6b, each dilation rate has two corresponding branches, i.e., 1 × 3 convolution (horizontal kernel) and 3 × 1 convolution (vertical kernel), respectively using BN and ReLU to improve numerical stability, in order to obtain a cross-shaped receptive field. The 3 × 3 convolution in ASPP captures features with a larger receptive field, while the horizontal and vertical kernels ensure the saliency of features on the skeleton, expanding the width of the network [49]. Then, each branch is concatenated, and a 3 × 3 convolution is used to restore the channel number when recovering the input. Finally, the result is pixel-wise added with the ASPP result and outputted. If y [i] denotes the output signal and x [i] denotes the input signal, then the atrous convolution can be represented as: where k denotes the kernel size and d denotes the dilation.
If H k×k,r (x), where r denotes the dilation rate, represents an operation consisting of Conv2d convolution, BN, and ReLU activation function, then ASPP can be expressed as follows: and ACASPP is given by:

G. LOSS FUNCTION
Composite loss functions, especially those related to dice, often achieve better segmentation results and higher model performance than single loss functions. In medical image segmentation, class imbalance often occurs during experiments, which can result in model training being biased towards densely distributed pixel classes, making it difficult for the model to learn the features of small objects and thus reducing the network's performance. Therefore, a combination of loss functions is used in the conducted experiments for segmentation supervision. The Binary Cross Entropy (BCE) loss function is widely used in various fields, including semantic segmentation. When using BCE, each pixel is evaluated in sequence, ignoring the contextual labels, and weighting the segmented pixels and background pixels, which greatly helps the network convergence. Because the BCE loss can more effectively calculate the gradient values corresponding to different categories during backpropagation, the problem of gradient disappearance can be better tackled when using it. The BCE loss is defined as follows: (6) where g i denotes the segmentation result of pixel i produced by a physician, and p i denotes the segmentation result of pixel i produced by the network.
The DSC loss function is named after the Dice similarity coefficient (DSC), which is a metric used to evaluate the similarity between two samples. The DSC loss function performs well in scenarios where there is a severe imbalance between positive and negative samples. During model training, it focuses more on the foreground region mining, making the predicted results closer to the actual results. However, if the predicted results in the experimental process are not exactly identical to the true results marked by pixels, there is a possibility of negative impact of the DSC loss function on backpropagation, which makes training a model very difficult. However, using the DSC loss function can reduce the occurrence of overfitting. The DSC loss is defined as follows: To accelerate convergence of the network, alleviate the impact of gradient vanishing, minimize the class imbalance issues during backpropagation, and improve skin disease segmentation, a combination of these two loss functions is used for training the proposed model, as follows:

A. DATASETS AND DATA PREPROCESSING
In the experiments, the International Skin Imaging Collaboration Challenge dataset (ISIC2018) [23], the PH2 dataset [24], and a private dataset are used. ISIC2018 is currently the largest skin lesion image dataset in the world, providing professionally annotated digital skin lesion images to facilitate the development of CAD for melanoma and other skin cancers [54]. The PH2 dataset was jointly collected by the Pedro Hispano Hospital in Matosinhos, Portugal, and the Dermatological Services Department of the University of Porto. The private dataset was provided by Peking Union Medical College Hospital, which includes skin lesion images of acne and lupus erythematosus. ISIC2018 contains 2594 skin microscopy images with segmentation mask labels. For the experiments, this dataset was randomly divided into training, validation, and test sets at a ratio of 7:1:2. Prior to model training, 1/3 of the training set's images was randomly selected to simulate additional random body hair on them by means of a computer program. Additionally, during the training process, operations such as horizontal flipping, vertical flipping, random brightness, Gaussian blur, mean smoothing filtering, and random hue saturation were applied on the ISIC2018 training set (Figure 7). It should be noted that neither of these additional operations were applied on the validation and test sets. The PH2 dataset,  containing only 200 images, served as an additional set for testing the models trained on the ISIC2018 dataset. The private dataset, containing 1010 images, was randomly divided into training, validation, and test sets at a ratio of 8:1:1 for conducting experiments on it. Table 1 shows details of this splitting of the datasets for conducting the experiments for model performance comparison. The ablation study experiments, presented in Subsection IV-D4, were performed only on the ISIC2018 dataset.

B. EXPERIMENTAL ENVIRONMENT
The experiments were conducted in Pytorch version 1.12.1 [46], using Python version 3.10.6, and operating system Ubuntu 22.04. All experiments were conducted on a computer equipped with a 12th Gen Intel®Core™i5-12400 CPU, 16GB RAM, and an NVIDIA GeForce RTX 3060 with 12GB memory. The number of training epochs was set to 150. The Adam optimizer [47] was used with an initial learning rate of 1e-4, weight decay of 1e-6, momentum of 0.9, and batch size of 8. As for the input image size, this was set to 256 × 256 pixels for all models, except for Swin-Unet and MISSFormer for which a size of 224 × 224 pixels was used.

C. EVALUATION METRICS
In the experiments, six evaluation metrics were used to measure the segmentation performance of compared models, namely the Intersection over Union (IoU), DSC, accuracy, sensitivity, specificity, and precision.
IoU, also known as the Jaccard index, is one of the most commonly used metrics in semantic segmentation. IoU is defined as the ratio of the overlap area between the predicted segmentation and the ground truth and their union area. In our case, it is calculated as: where TP (true positives) represents the number of correctly identified pixels as being part of an object (i.e., a skin lesion, in our case), FN (false negatives) represents the number of incorrectly identified pixels as being not part of an object, and FP (false positives) represents the number of incorrectly identified pixels as being part of an object. DSC has become the most universally used metric in the evaluation of image segmentation models. It is defined as twice the overlap area between the predicted segmentation and the ground truth divided by the sum of pixels in both of them. DSC is calculated as follows: Accuracy (Acc) is used to evaluate the overall pixel-level segmentation performance, calculated as follows: where TN (true negative) represents the number of correctly identified pixels as being not part of an object. Sensitivity (Sen) represents the proportion of skin lesion pixels that are correctly segmented, as follows: Specificity (Spe) is defined as the proportion of non-lesion pixels that are correctly segmented, as follows: Precision (Pre) represents the proportion of predicted positive samples, as follows:

D. RESULTS AND ANALYSIS
The proposed CACDU-Net model was compared to the mainstream medical image segmentation models by conducting experiments on the aforementioned three datasets, the results of which are displayed in this subsection.

1) ISIC2018 DATASET
The ISIC2018 public dataset contains a relatively large number of skin images, including many difficult-to-segment images [54]. Therefore, the results obtained on this dataset are the most convincing among the three datasets used in the experiments. Thus, the ablation study experiments, presented further below, were conducted only on this dataset. Table 2 presents the segmentation performance comparison results obtained by state-of-the-art models on this dataset using experimental configurations identical to those used for the proposed CACDU-Net model (the best result on each metric is shown in bold). Here, CACDU-Net achieved excellent results, especially based on the two core evaluation metrics used in image segmentation, namely IoU and DSC, according to which it outperforms all other models. More specifically, the first runner-up (DoubleU-Net) respectively scored 0.0189 points less for IoU and 0.0120 points less for DSC. In addition, based on accuracy, the proposed CACDU-Net model also outperformed all models in skin lesion segmentation by leaving the first runner-up (U2-Net) behind by 0.0046 points. According to the other three evaluation metrics used, the proposed CACDU-Net model also performed well in this group, by taking correspondingly the second place on sensitivity, third place on precision, and fourth (shared) place on specificity. Figure 8 illustrates the loss variation curves of the proposed CACDU-Net model on both the training and validation sets, as well as its DSC and IoU training and validation curves. Figure 9 shows the Receiver Operating Characteristic (ROC) curves of the compared models, along with their Area Under the ROC curve (AUC) values, achieved on this dataset. As can be seen from this figure, the proposed CACDU-Net model clearly outperforms all other models, as its ROC curve is the closest one to the upper left corner, which indicates the highest overall accuracy.
A visual comparison of the skin lesion segmentation results, achieved by different models on this dataset, is shown in Figure 10. Table 3 presents the segmentation performance comparison results obtained on the same dataset by other state-of-the-art models, whose results are taken from the specified literature sources (the best result on each metric is shown in bold). In this group, CACDU-Net also demonstrated excellent results, especially based on the two core evaluation metrics used in image segmentation (i.e., IoU and DSC) according to which it outperformed all considered models. More specifically, the first runners-up (ICL-Net and TransCeption, respectively) scored 0.0037 points less for IoU and 0.0010 points less for DSC. In addition, based on specificity and precision, the proposed CACDU-Net model also outperformed all considered models by leaving the first runners-up (TransCeption and M-CSAFN) behind by 0.0046 and 0.0081 points, respectively. Regarding the other metrics, CACDU-Net also performed well, by taking second place on accuracy and fourth place on sensitivity.

2) PH2 DATASET
In order to test the segmentation performance of the trained model on a new dataset and verify its generalization and VOLUME 11, 2023  robustness capabilities, experiments were conducted on the PH2 public dataset, which contains only 200 images. For this, the proposed model was trained on the ISIC2018 training set and tested on all PH2 images. Table 4 presents the segmentation performance comparison results obtained by these experiments (the best result on each metric is shown in bold). Again, the proposed CACDU-Net model outperformed all mainstream models  based on the two main evaluation metrics, by scoring 0.0115 and 0.0074 points higher than the first runner-up (Seg-Former) for IoU and DSC, respectively. In addition, based on accuracy, CACDU-Net also outperformed all mainstream models by leaving the first runner-up (SegFormer) behind by 0.0043 points. According to the other three evaluation metrics used, the proposed CACDU-Net model also performed relatively well, by taking correspondingly the second place on sensitivity, sixth place on precision, and seventh place on specificity. In particular, CACDU-Net performed better in segmenting larger lesions, where U-Net usually failed to segment the entire lesion area, with shapes significantly different from ground truth images. These results demonstrate that the additionally introduced modules indeed improve the segmentation performance and lead to good generalization capabilities. Figure 11 shows the ROC curves of the compared models, along with their AUC values, achieved on this dataset. As can be seen from this figure, the proposed CACDU-Net model clearly outperforms all other models, as its ROC curve is the closest one to the upper left corner, which indicates the highest overall accuracy.
A visual comparison of the skin lesion segmentation results, achieved by different models on this dataset, is shown in Figure 12.

3) PRIVATE DATASET
Next, experiments were conducted on the private dataset. Compared with the ISIC2018 dataset, this dataset contains 82458 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.  a smaller number of images, resulting in shorter training time and faster convergence of the network. However, the lesions in this dataset are shallower and have blurred edges, making segmentation more difficult. Table 5 presents the segmentation performance comparison results obtained by experimenting on this dataset (the best result on each metric is shown in bold). Again, the proposed CACDU-Net model outperformed all mainstream models based on the two core evaluation metrics, by scoring 0.0099 and 0.0072 points higher than the first runner-up (MISSFormer) for IoU and DSC, respectively. In addition, based on accuracy, CACDU-Net also outperformed all mainstream models by leaving behind the first runner-up (MISS-Former) by 0.0009 points. According to the other three evaluation metrics used, the proposed CACDU-Net model also performed relatively well, by taking correspondingly the second place on sensitivity, fourth place on precision, and fifth place on specificity. Not achieving the best results on these three metrics indicates that the proposed model has certain shortcomings and limitations in accurately detecting diseased regions and excluding non-diseased regions.   Figure 14 shows the ROC curves of the compared models, along with their AUC values, achieved on this dataset. As can be seen from this figure, the proposed CACDU-Net model clearly outperforms all other models, as its ROC curve is the closest one to the upper left corner, which indicates the highest overall accuracy.
The visual comparison of the skin lesion segmentation results achieved by different models on this dataset, shown in Figure 15, illustrates that existing mainstream models can effectively predict larger lesions, while CACDU-Net has a significant advantage in predicting multiple smaller lesions.

4) ABLATION STUDY
In order to verify whether each of the newly designed modules does indeed improve the network performance, ablation study experiments were conducted with U-Net and DoubleU-Net models, used as baselines, on the ISIC2018 dataset. The results of these experiments are shown in Tables 6 and 7 (the best result on each metric is shown inbold).
Considering the challenges posed by lesions with different shapes, colors, and blurry edges, contained in the images, the step-by-step addition of designed modules to U-Net and DoubleU-Net demonstrated gradual improvement, compared to the baselines and configurations used in the previous steps, on five (out of six) evaluation metrics compared to U-Net and on four evaluation metrics compared to DoubleU-Net (including the main two metrics -IoU and DSC). For instance, the combined integration of all three designed  modules into the original U-Net model gave up the first place (to the 'U-Net+ConvNeXt +ACASPP' configuration) only on precision (by only 0.0014 points). For DoubleU-Net, the 'DoubleU-Net+ConvNeXt +ACASPP' configuration demonstrated the best result for precision; however, for sensitivity, the best result was achieved by the baseline.
On the other hand, these experiments also demonstrated that DoubleU-Net has overall better performance than U-Net. Thus, it was preferred as a basis for the development of the proposed CACDU-Net model.

V. CONCLUSION
Fast and accurate segmentation of skin lesions is crucial for the subsequent treatment of melanoma and other skin cancers. Traditional methods are time-consuming and labor-intensive, heavily dependent on tuning a large number of parameters. In light of this, the paper has proposed a newly designed U-shaped encoder-decoder neural network model, called CACDU-Net. Firstly, it utilizes a pre-trained ConvNeXt-T network as an encoding part to provide rich image features, which allowed it to achieve high values of evaluation metrics at the beginning of the training, thus increasing the network's inference speed. Secondly, the proposed model uses specially designed ConvNeXt Attention Convolutional Blocks (CACB) to provide attention information in both channel and spatial dimensions, focusing on the lesion itself rather than on irrelevant information such as body hair, bubbles, vessels, and measurement scales. Additionally, the use of a stacked U-shaped architecture perfectly combines multilevel features, capturing long-term dependencies in obtaining a global contextual view to help the network achieve accurate segmentation of skin lesions. Thirdly, CACDU-Net utilizes a newly designed ACASPP module, inserted between the encoding and decoding parts, to provide multi-scale semantic information to the network, which is helpful for identifying lesions of different sizes. Based on ASPP, ACASPP adds an asymmetrically structured dilated convolution to finely extract multi-scale information and enhances the network's robustness. In terms of the loss function, a weighted sum of the commonly used binary cross entropy (BCE) and Dice similarity coefficient (DSC) loss functions is used to define a new loss function for solving the problem of extremely uneven numbers of positive and negative samples.
Results, obtained by experiments conducted on three skin lesion image datasets, confirmed that the proposed CACDU-Net model outperforms all existing mainstream models on at least half of the six evaluation metrics used, including the main two metrics for image segmentation evaluation, namely IoU and DSC. In addition, the proposed model demonstrated robustness and strong adaptability to multiinterference images, at the expense of utilizing a relatively large and computationally expensive neural network.
Importantly, the designed modules proposed in this paper can be used on their own in various U-shaped encodingdecoding networks to enhance their segmentation performance, which constitutes an additional contribution made in the area for use in practical applications.
In the future, we plan to explore the following research routes. Firstly, due to the width of the network (96,192,384,768) in the encoding stage of the ConvNeXt-T structure in Network1, if some appropriate operations can be added to fully combine it with the U-Net network structure, it may further improve the segmentation accuracy. Secondly, we will explore simple post-processing methods, such as connected component analysis, constrained optimization, and linear or nonlinear smoothing, which may also help improve network performance. Thirdly, we will attempt to apply the proposed model to other medical imaging-related tasks, such as lung segmentation, heart segmentation, breast segmentation, and retinal vessel segmentation. We think that using the proposed CACDU-Net model for performing these medical image segmentation tasks, combined with appropriate preprocessing and post-processing techniques, can yield more advanced segmentation results.