ConvNeXt-ST-AFF: A Novel Skin Disease Classification Model Based on Fusion of ConvNeXt and Swin Transformer

Automatic classification of dermatological images is an important technology that assists doctors in performing faster and more accurate classification of skin diseases. Recently, convolutional neural networks (CNNs) and Transformer networks have been employed in learning respectively the local and global features of lesion images. However, existing works mainly focus on utilizing a single neural network for feature extraction, which limits the model classification performance. In order to tackle this problem, a novel fusion model, named ConvNeXt-ST-AFF, is proposed in this paper, by combining the strengths of ConvNeXt and Swin Transformer (ConvNeXt-ST in the model’s name). In the proposed model, the pretrained ConvNeXt and Swin Transformer networks extract local and global features from images, which are then fused using Attentional Feature Fusion (AFF) submodules (AFF in the model’s name). Additionally, in order to enhance the model’s attention on the regions of skin lesions during training, an Efficient Channel Attention (ECA) module is incorporated into the ConvNeXt network. Moreover, the proposed model employs a denoising module to reduce the influence of artifacts and improve the image contrast. The results, obtained by experiments conducted on two datasets, demonstrate that the proposed ConvNeXt-ST-AFF model has higher classification ability, according to multiple evaluation metrics, compared to the original ConvNeXt and Swin Transformer, and other state-of-the-art classification models.


I. INTRODUCTION
Malignant melanoma is a type of skin cancer characterized by abnormal growth of skin cells, [1].If not diagnosed early, malignant melanoma has a high mortality rate, [2].According to data released by the World Health Organization, there were 104,350 reported cases of skin cancer and 11,650 The associate editor coordinating the review of this manuscript and approving it for publication was Ines Domingues .deaths in the United States in 2019, [3].Research has shown that early treatment of malignant melanoma can significantly reduce the mortality rate of patients, [4].Dermoscopy is an essential tool used by physicians for diagnosing skin cancer, [5].It helps reduce the reflection effect on the skin surface, providing doctors with clearer and more detailed images of lesions.This enables dermatologists to observe deeper characteristics of the affected areas.However, melanoma shares very similar features with other skin diseases, and even experienced dermatologists can achieve an accuracy rate of only around 75% in diagnosing melanoma using dermoscopy, [6].Therefore, the diagnosis of melanoma remains a timeconsuming and error-prone process.
In recent years, researchers have proposed artificial intelligence (AI)-assisted diagnostic methods to help physicians achieve faster and more accurate diagnosis of melanoma, [7].Unlike traditional methods, such as the 7-point checklist [8] and the ABCD rule [9] that rely on color and shape of the lesion area for classification, the AI-assisted diagnostic methods utilize deep features extracted from images to classify skin diseases, [10].However, the extraction of deep features from images faces the following challenges: (1) the contrast between normal skin regions and lesion areas in dermoscopy images is low, making it difficult for a neural network to focus on the skin lesion area during feature extraction; (2) dermoscopy images often contain artifacts such as body hair and blood vessels, which badly affect the extraction of meaningful features; (3) the intra-class variation of lesion area features is high, while the inter-class similarity of lesion area features between different classes is also high, thus complicating the classification process.
To date, various methods have been proposed to address these challenges, but the accuracy of classification is still not ideal, [11].In the field of AI-assisted diagnosis, convolutional neural networks (CNNs) and Transformer networks [12] are the mainstream feature extraction networks used.CNNs traverse the feature maps of images using convolutional kernels of different sizes to extract features from different positions of images.The CNN advantage lies in extracting local features of images, [13].The core module of Transformer networks is the multi-head self-attention module, which can capture global features of images, based on global contextual information, [14], [15].In recent years, researchers have attempted to improve classification accuracy through model fusion, [16].However, existing fusion methods mostly focus on fusing features extracted by different CNN networks [17], and do not effectively leverage both local and global features of images.Additionally, researchers often preprocess dermoscopy images before feature extraction to remove image artifacts and improve contrast, [18].To overcome the aforementioned problems, a novel fusion model is proposed here for skin disease classification.
The following are the main contributions of the paper: 1) The incorporation of an Efficient Channel Attention 3) The utilization of Attentional Feature Fusion (AFF) submodules is proposed for feature fusion, allowing the model to dynamically allocate weights based on the importance of local and global features in the input during the training process, thereby enhancing the quality of fused features.The remaining structure of this paper is the following.Section II provides an overview of related work done in the field of skin disease classification, including a brief summary of existing models and their corresponding research outcomes.Section III presents a detailed description of the proposed model.Section IV describes the experimental setup and results of the experimental performance comparison of the proposed model with state-of-the-art models.Finally, Section V concludes the paper by summarizing the contributions of this study and setting up future directions for research.

II. RELATED WORK
CNNs have been widely used for image feature extraction, leveraging different-sized convolutional kernels to capture local features at different positions of the input images, [19].With advancements in neural network technology, CNN models, such as VGG [20], ResNet [21], DenseNet [22], Effi-cientNet [23], and ConvNeXt [24], have gained significant attention in image classification.To focus more on the lesion areas, CNN models are often combined with attention mechanisms to improve classification performance.For instance, Zhang et al. [25] designed an Attention Residual Learning (ARL) module that combines residual learning and attention learning mechanisms to construct an Attention Residual Learning Convolutional Neural Network (ARL-CNN).These authors conducted experiments on the ISIC2017 dataset, demonstrating that their model adapts to focusing on the skin lesion areas during training.Wan et al. [26] proposed a Multi-Scale Long Attention Network (MSLANet) that fuses contextual information through three Long Attention Networks (LANets).Additionally, MSLANet extracts multiscale information through self-supervised learning.Their network achieved area under the curve (AUC) values of 93.7% and 92.4% on the ISIC2017 and ISIC2020 datasets, respectively.However, CNN models are primarily adopted for extracting local features, making them less effective at capturing contextual information and long-range dependencies, thereby limiting their ability to extract global features effectively.
In recent years, Transformer networks, based on the multihead self-attention mechanism, have become popular for extracting global features by capturing global contextual information, [15].Researchers have made various attempts to explore the classification performance of Transformer networks in the field of skin disease classification.For instance, Ayas [27] utilized the Swin Transformer model for skin disease classification and introduced weighted cross-entropy loss to address class imbalance.The model achieved sensitivity and specificity of 82.3% and 97.9%, respectively, on the ISIC2019 dataset.He et al. [14] designed a Fully Transformer Network (FTN) that learns long-range contextual information to improve the baseline performance of CNNs in skin disease classification.These authors introduced a Spatial Pyramid Pooling (SPP) module in the multi-head attention (MHA) to reduce computational complexity and memory consumption.Cai et al. [28] employed Vision Transformer (ViT) as a backbone network for extracting deep image features and fused the extracted features with patient metadata.Their method achieved an accuracy of 93.81% on the ISIC2018 dataset.However, these Transformer-based models tend to capture global features while neglecting the consideration of local features.
To effectively utilize different types of features, designing feature fusion architectures is an effective approach that considers the feature extraction preferences of different networks to improve classification accuracy.In [29], Mahbod et al. proposed a Multi-Scale Multi-CNN (MSM-CNN) method for fusing features extracted by three CNN models on six different scales of cropped images, achieving a balanced multi-class accuracy of 86.2% on the ISIC2018 dataset.Maqsood and Damaševičius [17] presented a unified Computer-Aided Diagnosis (CAD) model for segmentation and classification of skin lesions.Their model fused features extracted by four pre-trained CNNs using a convolutional sparse image decomposition fusion method and employed univariate measurement and Poisson distribution feature selection methods to select the best classification features.The classification accuracy of this model on the HAM10000, ISIC2018, ISIC2019, and PH2 datasets was 98.57%, 98.62%, 93.47%, and 98.98%, respectively.However, existing works tend to focus on designing feature fusion architectures based on CNN models, without effectively combining the local and global features of images.
Therefore, combining the feature extraction advantages of both CNNs and Transformers to effectively utilize the local and global features of images for skin disease classification is a promising research direction.

III. PROPOSED MODEL: ConvNeXt-ST-AFF
In contrast to the methods employed in [16], which fuse image features extracted by VGG16, AlexNet, ResNet-18, and ResNet-101, the presented research opts for the fusion of two distinct types of feature extraction networks, namely CNNs and Transformers.This approach aims to effectively utilize both local and global image features for classification.Inspired by the work presented in [13], we chose to merge the features extracted from each block of the ConvNeXt and Swin Transformer networks, rather than solely fusing the final features.This strategy enables the Swin Transformer network to continually provide global feature information to the ConvNeXt network.To address the limitations of fixedweight fusion methods like summation and concatenation, the proposed model employs Attention Feature Fusion (AFF) submodules, allowing the network to dynamically adjust the weights of each input feature during training.To further enhance the integration of local and global image features, a channel-shuffled operation is incorporated within each AFF submodule to improve information interaction between different channel weights.Additionally, to mitigate class imbalance issues of the datasets utilized in the experiments, data augmentation is applied to the images before network training.Furthermore, denoising techniques are employed to reduce image noise and enhance contrast in dermatoscopic images.Based on these considerations, the proposed ConvNeXt-ST-AFF model was elaborated with the overall structure illustrated in Figure 1.
The overall process of skin disease image classification can be divided into three steps, each contributing to enhancing the model's capability to effectively classify skin disease images: 1) The first step involves using a denoising module to remove artifacts and pseudo-features from the input skin disease images, while simultaneously enhancing Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
their contrast.This preprocessing step helps improve the overall quality of the images, making it easier for subsequent modules to extract meaningful features.The pseudo-algorithm of the proposed model is shown in Figure 2.
In the following subsections, a detailed description of the functionality of each main module is separately provided.

A. DENOISING MODULE
The presence of artifacts, such as body hair and blood vessels, in skin disease images may have a negative impact on the model classification performance, [30].Image denoising is an effective method for reducing artifacts in images, [31].When using traditional wavelet-based image denoising methods, it is necessary first to determine the type of noise present in an image, such as Gaussian noise or salt-and-pepper noise, and then choose specific wavelets for denoising based on the noise type.Additionally, parameter adjustments need to be made manually for different types of images during the denoising process.
In recent years, neural network-based image denoising methods have demonstrated excellent performance.The most significant difference between them and traditional methods lies in their generalization ability.Through extensive training on image data, neural network-based methods automatically learn model parameters that adapt to different image data and noise distributions during training, without relying on manual parameter tuning.This results in better generalization across various types of images and noise distributions, [32].
Therefore, in our study, we employed neural networkbased denoising methods.Specifically, we utilized the design of RED-Net [33] to construct a smaller denoising network that employs an end-to-end training approach.It takes the original skin lesion image as input, trains the network to understand the distribution of artifacts in skin lesion images, selectively reduces their impact, and ultimately outputs a denoised clean image.As shown in Figure 3, the denoising module, used in this study, consists of an encoder, a decoder, and skip connections.The encoder comprises five 2D convolutional blocks for feature extraction, while the decoder consists of five 2D transpose convolutional blocks for recovering the denoised image.
The denoising module of the proposed model employes symmetric convolutional layers and transpose convolutional layers to ensure consistency in the sizes of input and output images.Specifically, convolutional layers serve as an encoder for extracting image information.They are capable of learning the essential information within the image and eliminating features, including noise, that are irrelevant to network training.However, this process may result in the loss of certain image details.On the other hand, transpose convolutional layers primarily focus on restoring fine details in the image.They upsample the low-resolution image output from the encoder, ensuring that the recovered image maintains the same resolution as the original image.Inspired by U-Net and VGG-UNet [34], we have introduced skip connections between the symmetric convolutional and transpose convolutional layers to better leverage shallow features containing more detailed information and deep features containing more semantic information for image recovery.Since the denoising module is used in the preprocessing step before image feature extraction, we have intentionally avoided using any pooling operations within the denoising module.This is because pooling operations have the potential to discard valuable details from the original image, resulting in loss of critical information in the recovered image.
The denoising module not only reduces the impact of artifacts in the images but also effectively enhances the contrast between lesion and non-lesion areas.

B. FEATURE EXTRACTION MODULE
The feature extraction module of the proposed model utilizes ConvNeXt and Swin Transformer as feature extraction networks.
Due to the high cost of acquiring dermatological data, training a network from scratch solely using a dermatological dataset may result in insufficient training, leading to a lack of generalization ability.This problem can be addressed through the application of transfer learning.
Transfer learning is a machine learning method that leverages knowledge transferred from related learned tasks to a new task, enabling a network to achieve higher performance with limited training data, [3].In the context of image classification tasks, it is common to employ a network pre-trained on a large-scale image dataset, such as ImageNet [35], and fine-tune it for the new task.
Mathematically, the operation of transfer learning is defined, as per [17], as: where d s denotes the source domain whose learning function is defined as L s , L d , u s x , v s x ∈ψ.The operation for the target domain d t is defined as: where (x, y) denotes the size of the training data (x≪y), and v s i and v t i denote the labels of the training data.The learning task for the target domain is defined in [17] as L t , u t y , v t y ∈ ψ.By employing this operation, all modified networks are trained on a specific dataset.The specific training process of transfer learning is illustrated in Figure 4.

1) AFF SUBMODULE
Skip connections are a crucial component of CNNs.They merge low-level detail features with high-level semantic features through summation or concatenation operations, enabling the network to obtain richer semantic representations, [36].However, traditional summation-or concatenation-based feature fusion methods can only assign fixed weights to different features.
To enhance the feature fusion operation in skip connections, the proposed model incorporates Attentional Feature Fusion (AFF) submodules [37].AFF leverages two input features to generate dynamic fusion weights, allowing the network to adaptively select features based on their importance.The specific implementation process of the AFF submodule, depicted in Figure 5, is the following.Firstly, features X and Y are combined through an initial feature fusion operation.The fused feature is then fed into a Multi-Scale Channel Attention Module (MS-CAM), which generates a fusion weight α (ranging from 0 to 1), used as the weight for feature X , while (1 − α) is used as the weight for feature Y .These weights allow to perform adaptive feature fusion based on the importance of features X and Y .Through the attention-based feature fusion process, the network can dynamically adapt the weighted fusion of features X and Y , thereby enhancing the performance of the model.
The calculation process of the AFF submodule is defined in [37] as follows:   is calculated, as shown in [37], as: where the convolution kernels for PWConv 1 and PWConv 2 are of size C/r×C × 1 × 1 and C×C/r×1 × 1, respectively.Given the local feature L(X ) and the global feature g(X ), the weight a in MS-CAM is calculated as follows: where M (X ) ∈R C×H ×W denotes the attention weights generated by MS-CAM and ⊕ denotes the element summation.

2) SWIN TRANSFORMER SUBMODULE
The Swin Transformer [38] is utilized as a global feature extraction submodule for image processing by the proposed model.Figure 7 illustrates the structure of two consecutive Swin Transformer blocks, each consisting of a Windows Multi-Head Self-Attention (W-MSA) or Shifted W-MSA (SW-MSA) subblock, a Multi-Layer Perceptron (MLP) subblock (shown in Figure 8), two LayerNorm (LN) normalization layers, a DropPath layer, and two skip connections.W-MSA and SW-MSA subblocks are applied alternately in two consecutive Swin Transformer blocks.
The computational flow for the two consecutive Swin Transformer blocks is defined in [38] as follows: where ẑl , z l , ẑl+1 and z l+1 denote the output features of the W-MSA and MLP subblocks of the first block, and the output features of the SW-MSA and MLP subblocks of the second block, respectively.The W-MSA subblock partitions the input feature map into multiple non-overlapping windows and computes selfattention within each window.This design significantly reduces the computational complexity of the network.However, due to the window partitioning, the use of W-MSA can lead to information loss between windows.To address this issue, the Swin Transformer submodule employes a SW-MSA subblock.The SW-MSA module incorporates an offset matrix B ∈ R M 2 ×M 2 when calculating self-attention as follows [38]: where Q, K , V ∈ R M 2 ×d denote the query, key, and value matrices, respectively, d denotes the dimension of the query and the key, and M 2 denotes the number of patches in a window.The values in matrix B are obtained from the offset matrix B ∈ R (2M −1)×(2M −1) .
The SW-MSA subblock facilitates information propagation between windows.By combining the W-MSA and SW-MSA subblocks, the Swin Transformer submodule is able to efficiently process large-scale images and effectively capture their global features.

3) CONVNEXT SUBMODULE
In the feature extraction module of the proposed model, a ConvNeXt network [24] is used as a submodule for local feature extraction from the skin disease images.The ConvNeXt network consists of ConvNeXt blocks and Downsample blocks.As illustrated in Figure 9a, each ConvNeXt block includes skip connections, depthwise convolution layers, LN normalization layers, MLP blocks, Layer Scale, and DropPath.The downsample block structure is depicted in Figure 9c.
The attention mechanism can enhance the network's focus on the lesion regions during the training process, thereby improving the model performance, [39].After comparing the effects of different attention modules, we ultimately chose to incorporate an Efficient Channel Attention (ECA) subblock [40] into the original ConvNeXt block as to enhance its feature extraction capability.The structure of the ConvNeXt block, modified in this fashion, is illustrated in Figure 9b.ECA employes a type of channel attention mechanism that generates corresponding weight values for each channel in the feature map based on their importance.This allows the network to pay more attention to important feature channels.The implementation process follows these steps.When the feature map is inputted into ECA, it first undergoes global average pooling.Then, ECA adaptively computes the size of the one-dimensional convolution kernel and generates the weights for each channel in the feature map.Finally, the normalized weights are multiplied with the original feature map to produce the channel-weighted feature map.The ECA structure is depicted in Figure 10.

4) FEATURE FUSION
Different feature extraction networks have different preferences for extracting features.Combining features extracted from multiple networks can improve the adaptability of the fused features and thus enhance classification performance.Therefore, in the proposed model, ConvNeXt and Swin Transformer networks are selected as two feature extraction networks.The Swin Transformer network can establish 117466 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
long-range dependencies, effectively extracting global features from the images.On the other hand, the ConvNeXt network traverses the feature map of the input images with convolutional kernels to extract local features.To fully leverage the advantages of these networks, the feature extraction module of the proposed model utilizes an AFF submodule at the end, allowing it to dynamically generate corresponding weights based on the importance of the input features.The weighted features are then fused to make full use of both local and global features for skin disease classification.
The specific fusion process of the final AFF submodule follows these steps.Firstly, the features extracted by the Swin Transformer network are linearly projected to the shape of B × C × H × W .They are then concatenated with the features extracted by the ConvNeXt network.A 1 × 1 convolution is applied to transform the channel-shuffled [41] feature map to the desired number of channels.Finally, the fused feature map is inputted into MS-CAM to compute the weights.The weighted features are then separately weighted and summed to obtain the fused features.
Through the feature fusion of the AFF submodule, the proposed model can effectively utilize both local and global features of the images, thereby improving its classification performance.

C. DATA AUGMENTATION
Neural networks require extensive training on annotated data to achieve high performance.However, acquiring skin disease image data is costly, resulting in a relatively small number of images in skin disease datasets, along with a severe class imbalance issue [42].This imbalance leads to lower classification accuracy for classes with fewer images [43].Therefore, before conducting the model training, we employed data augmentation [44] to increase the diversity of image information in the utilized datasets.This helps mitigate the impact of class imbalance on the model training results while enhancing the model's robustness and generalization capabilities.The specific steps taken were the following: (1) The images in the training set were randomly cropped.Then, the cropped images were resized to 224 × 224 pixels.This step helped in focusing on relevant regions of the images while maintaining a consistent input size; (2) The RandAugment method [45] was applied to enhance the augmented images further.More specifically, in order to increase the diversity of the training set, each image underwent two randomly selected image augmentations out of fifteen available options, such as histogram equalization, contrast adjustment, color inversion, random rotation, increasing exposure, random horizontal (vertical) translation, etc. Utilization of these data augmentation techniques allowed to increase the diversity of the images in the training set, thus effectively mitigating overfitting and enhancing the model's classification performance.Additionally, these techniques contributed to achieving a more stable training process, [46].Experimentally, after applying these data augmentation techniques, we found out that the ConvNeXt values of all evaluation metrics were indeed significantly higher than those of ConvNeXt without augmentation.Therefore, in the main experiments described in the next section, we chose to use the experimental results of ConvNeXt after applying data augmentation as the baseline for comparison.

A. DATASETS
In the experiments, two datasets were used -a private dataset containing acne images and the publicly available ISIC2019 dataset.
The acne dataset was provided by Peking Union Medical College Hospital, with a given consent for use by all participants.The dataset contains 2,900 skin lesion images classified into six classes (Figure 11): acne (AC), melasma (ME), rosacea (ROS), discoid lupus erythematosus (DLE), Ota nevus (ON), and seborrheic dermatitis (SD).All images and their corresponding labels have undergone rigorous review by dermatologists.For conducting the experiments, this dataset was randomly split into a training set and a testing set in an 8:2 ratio as shown in Table 1.
The second dataset used was the ISIC2019 dataset [47].It is a publicly available large-scale skin lesion image dataset, consisting of 25,331 skin lesion images containing images of eight classes of skin diseases (Figure 12): actinic keratosis (AK), basal cell carcinoma (BCC), benign keratosis (BKL), dermatofibroma (DF), melanoma (MEL), melanocytic nevus (NV), squamous cell carcinoma (SCC), and vascular lesion (VASC).For conducting the experiments, this dataset was also randomly split into training and testing sets with an 8:2 ratio as shown in Table 2.

B. EXPERIMENTAL SETUP
In the model training, a hyperparameter initialization was performed, as follows: (i) in the experiments on the ISIC2019 dataset, the initial learning rate was set to 0.01 and the number of epochs was set to 100; (ii) in the experiments on the acne dataset, the initial learning rate was set to 0.0005 and the number of epochs was set to 60.In both sets of experiments, a cosine annealing strategy was employed to reduce the learning rate during model training, with a minimum learning rate of 1e-6.The batch size was set to 16.All models were optimized using the stochastic gradient descent (SGD) optimizer [39], with a momentum of 0.9 and a weight decay of 0.0001.The loss function used was the cross-entropy loss function.
The experiments were conducted on a host machine running Linux 3.10.0-1062.el7.bclinux.x86_64and equipped with an Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz and an NVIDIA Tesla V100S PCIe 32GB GPU.The CUDA version used was 11.0 and the PyTorch version used was 1.12.1.

C. EVALUATION METRICS
To evaluate the classification performance of the proposed model in comparison to state-of-the-art models used for skin lesion image classification, multiple metrics were used to achieve comprehensive assessment.
A useful tool for visualizing the model classification performance is the confusion matrix.It presents a model's classification results for each class in a tabular form.An example of a simple confusion matrix is shown in Figure 13.
In Figure 13, True Positive (TP) represents the samples whose true class is positive and the model correctly identifies them as positive, False Negative (FN) indicates the samples whose true class is positive but the model incorrectly identifies them as negative, False Positive (FP) refers to the samples whose true class is negative but the model incorrectly identifies them as positive, and True Negative (TN) represents the samples whose true class is negative and the model correctly identifies them as negative.
The performance of multi-class classifiers can be evaluated using various metrics [48], briefly described below.
Accuracy (Acc), a.k.a.detection rate, is the most straightforward evaluation metric for classification tasks.It represents the proportion of correctly classified samples out of the total number of samples, as follows: Precision (Pre), a.k.a.positive predictive value (PPV), refers to the proportion of true positive samples among the samples classified as positive, as follows: Recall (Rec), a.k.a.sensitivity or true positive rate (TPR), is the proportion of true positive samples out of all actually positive samples, as follows: Specificity (Spec), a.k.a.selectivity or true negative rate (TNR), is the proportion of true negative samples out of all actually negative samples, as follows: 117468 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.F1-score (F1) is the harmonic mean of precision and recall, calculated as follows: The higher the F1-score, the better balance between precision and recall.

D. SELECTION OF NEURAL NETWORKS FOR USE IN THE PROPOSED MODEL
Tables 3 and 4

E. ABLATION STUDY
Tables 5 and 6 present the results of each step of the ablation study performed with the proposed model on the ISIC2019 and acne datasets, respectively.In the first step, by adding the denoising module to ConvNeXt, the values of all metrics were improved, compared to those of ConvNeXt, except for precision on the acne dataset.In the second step, by incorporating the ECA attention module alone into ConvNeXt, the values of all metrics were improved, compared to those of ConvNeXt, except for precision on the ISIC2019 dataset and for recall, specificity, and F1-score on the acne dataset.In the third step, by directly fusing the features extracted by ConvNeXt and Swin Transformer, the values of all metrics were improved, compared to those separately achieved by ConvNeXt and Swin Transformer, except for precision w.r.t.ConvNeXt on the ISIC2019 dataset.In the fourth step, by adding AFF submodules to the combined scheme of ConvNeXt and Swin Transformer, further improvement was achieved on all metrics, compared to the previous step, except for recall, specificity, and F1-score on the acne dataset.Finally, in the last step, when using all modules, resulting in the proposed ConvNeXt-ST-AFF model, top values of all metrics were achieved, except for precision on the ISIC2019 dataset.
Table 7 displays the classification accuracy of the proposed ConvNeXt-ST-AFF model under different hyperparameter settings.Due to hardware limitations, we conducted the testing in two steps.In the first step, with a fixed batch size of 16, we evaluated the impact of different initial learning rates on the model's classification performance.We experimented  with three commonly used initial learning rates: 0.01, 0.001, and 0.0005.The experimental results, shown in Table 7, demonstrate that the proposed model performs best on the ISIC2019 dataset (resp.acne dataset), when using an initial learning rate of 0.01 (resp.0.0005).In the second step, we set the initial learning rate to the corresponding optimal value, determined for the particular dataset in the previous step, and tested the influence of different batch sizes on the model's classification performance.The obtained results indicate that when the batch size is set to 16, the model achieves its highest classification performance on both datasets.Based on this confusion matrix, further calculations were performed to obtain the values of evaluation metrics for each  class, as shown in Table 8.As can be seen from this table, the proposed model performs especially well on the BCC, NV, and VASC classes, according to all metrics.However, for the AK and DF classes, the values of precision, recall, and F1score are not ideal due to the fact that these classes contain fewer images and images with higher similarity than the other classes.Based on this confusion matrix, further calculations were performed to obtain the values of evaluation metrics for each class, as shown in Table 9.As can be observed from this table, 117470 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.the proposed model performs especially well on the AC, DLE, ROS, and ON classes.However, its performance for the SD class is not ideal, because this class contains fewer images and images with higher similarity than the other classes.

H. DISCUSSION
Table 10 shows performance comparison of the proposed model with state-of-the-art-models, based on their respective results reported in the corresponding literature sources (''-'' in the table indicates missing data in a source).The proposed model clearly outperforms all other models, according to all evaluation metrics used, except for accuracy where the leader is the model described in [17].Furthermore, compared to the lightweight models in [50] and [51], although the proposed model has a larger number of parameters, it has shown significant improvements in classification performance.

V. CONCLUSION
In the presented study, we have effectively mitigated the impact of class imbalance on classification performance by employing data augmentation.We also introduced an image denoising module for eliminating the artifacts, such as body hair in dermoscopic images, and for enhancing the contrast.Furthermore, we proposed a model fusion approach with adaptive weights whose effectiveness has been validated.
The ConvNeXt-ST-AFF model, proposed in this paper, was tested on the publicly available ISIC2019 dataset and a private acne dataset, showing excellent results on both.Moreover, the comparison of its results on the ISIC2019 dataset to those reported in the literature for state-of-the-art classification models clearly has demonstrated that the proposed model outperforms them, according to all evaluation metrics used, except for one of the models according to accuracy only.
This study demonstrates that the combination of Con-vNeXt and Swin Transformer can effectively leverage the feature extraction advantages of different networks, thereby enhancing the model's feature extraction capability.This approach, which takes into account both local and global features of input information, can be applied to improve the feature extraction capabilities of backbone networks in tasks such as object detection and image segmentation.Furthermore, it can also be effectively employed in other image processing domains, such as image enhancement, as well as in the field of bioinformatics, including applications in polyadenylation signal prediction, genome sequence analysis, and analysis of complex human diseases.
The main limitation of this study lies in the increased computational burden resulting from the model fusion approach.In the future, we plan to optimize the network architecture of the proposed model by incorporating the latest lightweight model design techniques, such as MobileViT [56] and FasterNet [57], in order to reduce computational demands.Additionally, we aim to further refine the denoising module to improve its effectiveness in eliminating image artifacts and reducing the influence of irrelevant features on model training.

FIGURE 1 .
FIGURE 1.The overall structure of the proposed ConvNeXt-ST-AFF model.

2 )
In the second step, each denoised image is fed into the feature extraction module.In the proposed model, this module incorporates a Swin Transformer network and a ConvNeXt network, both of which are pre-trained on large-scale datasets.The Swin Transformer excels at capturing global patterns and long-range dependencies in the images, while the ConvNeXt network focuses on extracting local features.By combining both global and local features, the proposed model gains a comprehensive understanding of the skin disease images.Additionally, AFF submodules are utilized in this step to fuse the features extracted by the Swin Transformer and ConvNeXt networks effectively.The AFF submodules merge the local and global features, enabling the model to capture synergistic information from both sources, leading to more discriminative feature representations.3) In the final step, the fused features from the feature extraction module are fed into the classification head, where the final classification of the skin disease images takes place.

FIGURE 2 .
FIGURE 2. The pseudo-algorithm of the proposed ConvNeXt-ST-AFF model.

FIGURE 4 .
FIGURE 4. Illustration of model training using transfer learning.
where Z ∈ R C×H ×W denotes the fused feature, ⊗ denotes an element-wise multiplication, and α = M (X ⊎ Y ) denotes the fusion weight, where ⊎ represents the initial feature fusion and M represents MS-CAM, which plays a pivotal role in aggregating local and global features.MS-CAM, shown in Figure6, utilizes point-wise convolution with a 1×1 convolutional kernel to transform the channel dimension of the feature map.This selection offers the advantage of reducing the computational cost associated with using different-scale convolutional kernels.Through MS-CAM, the AFF submodule can effectively leverage information from different scales of the image and facilitate the effective fusion of local and global features.MS-CAM utilizes two branches to extract channel attention weights.The first one employs Global Average Pooling to extract global feature attention, while the second one directly uses Point-Wise Convolution (PWConv) to extract local feature attention.The local feature L(X ) ∈ RC×H ×W

FIGURE 7 .
FIGURE 7. The structure of two consecutive Swin Transformer blocks.

FIGURE 9 .FIGURE 10 .
FIGURE 9. (a) The original ConvNeXt block structure; (b) The ConvNeXt block structure used in the proposed model; (c) The down-sample block structure.

13 .
A simple confusion matrix.
present the training results, obtained on the ISIC2019 and acne datasets, respectively, of six commonly used networks for image classification, including four CNNs and two Transformer networks (the median value of three different training runs performed on each dataset is shown in the tables).By comparing the training results, it can be observed that the ConvNeXt model performs the best among the four CNNs, according to all evaluation metrics, except for recall on the ISIC2019 dataset.The Swin Transformer demonstrates better classification performance among the two Transformer networks on both datasets, according to all evaluation metrics.Based on these results, it was decided to use ConvNeXt and Swin Transformer as the feature extraction sub-networks in the proposed fusion model.

F
Figure 14 illustrates the classification performance of the proposed ConvNeXt-ST-AFF model on ISIC2019 dataset in the form of a confusion matrix, where the number of correctly classified images is represented by the diagonal values.Based on this confusion matrix, further calculations were performed to obtain the values of evaluation metrics for each

FIGURE 14 .
FIGURE 14.The confusion matrix of the training results of the proposed ConvNeXt-ST-AFF model on the public ISIC2019 dataset.

FIGURE 15 .
FIGURE 15.The confusion matrix of the training results of the proposed ConvNeXt-ST-AFF model on the private acne dataset.

Figure 15
Figure 15 presents the classification performance of the proposed ConvNeXt-ST-AFF model on the acne dataset in the form of a confusion matrix, where the number of correctly classified images is represented by the diagonal values.Based on this confusion matrix, further calculations were performed to obtain the values of evaluation metrics for each class, as shown in Table9.As can be observed from this table,

TABLE 1 .
Splitting the private acne dataset into training and testing sets.

TABLE 2 .
Splitting the public ISIC2019 dataset into training and testing sets.

TABLE 3 .
Training results of state-of-the-art networks on the public ISIC2019 dataset.

TABLE 4 .
Training results of state-of-the-art networks on the private acne dataset.

TABLE 5 .
Ablation study results on the public ISIC2019 dataset.

TABLE 6 .
Ablation study results on the private acne dataset.

TABLE 8 .
Training results of the proposed ConvNeXt-ST-AFF model on the public ISIC2019 dataset.

TABLE 9 .
Training results of the proposed ConvNeXt-ST-AFF model on the private acne dataset.

TABLE 10 .
Performance comparison of the proposed ConvNeXt-ST-AFF model with state-of-the-art-models on the ISIC2019 dataset.