Recent Advances in Diagnosis of Skin Lesions Using Dermoscopic Images Based on Deep Learning

Skin cancer is one of the most threatening cancers, which spreads to the other parts of the body if not caught and treated early. During the last few years, the integration of deep learning into skin cancer has been a milestone in health care, and dermoscopic images are right at the center of this revolution. This review study focuses on the state-of-the-art automatic diagnosis of skin cancer from dermoscopic images based on deep learning. This work thoroughly explores the existing deep learning and its application in diagnosing dermoscopic images. This study aims to present and summarize the latest methodology in melanoma classification and the techniques to improve this. We discuss advancements in deep learning-based solutions to diagnose skin cancer, along with some challenges and future opportunities to strengthen these automatic systems to support dermatologists and enhance their ability to diagnose skin cancer.

classification and proposes some directions for current 74 research status and future research. 75 B. CHALLENGES 76 The so-called skin lesion classification is that there is a fixed 77 set of classification labels. For each input image, a classi-78 fication label is found from the classification label set, and 79 classification label is assigned to the input image. Although 80 the classification task seems simple, this is one of the core 81 problems in the field of computer vision. Many seemingly 82 different problems in the field of computer vision (such as 83 object detection and segmentation) can be attributed to image 84 classification problems. The difficulties and challenges of 85 skin disease classification and detection are summarized in 86 three levels in this article: the instance level, the category 87 level, and the semantic level, as outlined below. For a single instance of skin cancer, the size change caused by 90 the difference in the image acquisition process, the lighting 91 conditions, and the shooting angle of view, as well as the 92 distance, the non-rigid body deformation of the object itself, 93 and the partial occlusion of other objects, usually make the 94 apparent characteristics of the object instance. 95 2) CATEGORY LEVEL 96 Difficulties and challenges usually come from two directions. 97 Firstly, there is a large intra-class difference when the appar-98 ent characteristics of objects belonging to the same class are 99 quite different. The reasons are the changes in the various 100 instance levels mentioned above. Secondly, the difference 101 between different instances in the class has to do with inter-102 ference from the background: In the actual scene, the object 103 might not appear against a spotless background -in fact, often 104 the background may be very complicated and interfere with 105 the object of interest. This greatly dramatically increases the 106 difficulty of identifying the skin lesion. 107 3) SEMANTIC LEVEL 108 Difficulties and challenges are related to the visual semantics 109 of images. Difficulties at this level are often very tough 110 to deal with. Especially for the current level of computer 111 vision theory, a typical problem is what is called ''multiple 112 stability''. Having the same image but different interpreta-113 tions are related not only to the physical conditions such 114 as the person's viewing angle and focus, but also to the 115 personality and experience of the person, and this is precisely 116 the part that the visual recognition system finds difficult 117 to handle. and extensible. Keras also has the advantage of being simple, 199 flexible, and powerful. Because of these features, Keras is 200 viewed by newcomers as the go-to DL framework. Since 201 PyTorch was developed by Facebook and offers an easy-202 to-use interface, its popularity has gained momentum, par-203 ticularly in academia. PyTorch is the main competitor of TF. 204 MatConvNet is a toolkit based on CNN for Matlab, sup-205 porting both CPU and GPU. In fact, this toolkit not only 206 supports CNN, but also supports some other networks such as 207 RNN, LSTM, etc. Caffe is an early DL framework made with 208 expression, speed, and modularity. It is ideal for feedforward 209 neural networks and image processing tasks. Theano is based 210 on python whose development started in 2007. This library 211 is good at dealing with multidimensional arrays. With the 212 strong rise of Tensorflow, Keras and Pytorch,MatConvNet,213 Caffe, Theano are declining day by day, and fewer and fewer 214 researchers use them. 215 2) CONVOLUTIONAL NEURAL NETWORKS BACKBONES FOR 216 IMAGE CLASSIFICATION 217 A convolutional neural network (CNN), also known as ''Con- 218 vNet'', is a specific type of feed-forward neural network with 219 a stack of convolutional layers, each followed by pooling lay-220 ers in order to extract features from the input data and produce 221 a set of high level feature maps at each level of convolution. 222 The feature maps information is summarized using pooling 223 layers in order to reduce the number of parameters and uses a 224 fully connected layer to produce the final classification [16]. 225 The CNN structure evolution summarized in this arti-226 cle started with the neurocognitive machine model. At the 227 same time, the convolutional structure has appeared. The 228 LeNet [17] CNN structure became available in 1998. 229 However, the CNN's edge began to be overshadowed by 230 hand-designed features such as support vector machine 231 (SVM). With the introduction of rectified linear unit (ReLU) 232 and Dropout, as well as the historic opportunities brought by 233 graphics processing units (GPUs) and big data, CNN ushered 234 in a landmark breakthrough in 2012 -AlexNet [16]. Figure 6 235 presents the evolution of the CNN structure.

236
Today, researchers rarely build models from start to finish. 237 Common features of classic models have been encapsulated 238 in DL frameworks (such as TF or PyTorch). Researchers 239 only make some modifications on this basis. All the liter-240 ature collected in this study is based on the CNN model. 241 Compared with traditional machine learning, the CNN model 242 has excellent feature representation (automatically learned 243 from raw data). Currently, the primary method of skin dis-244 ease image recognition is to use a CNN in DL, and then 245 to use pooling for image recognition. The research work 246 collected in this study adopted famous CNN architecture, 247 such as AlexNet [16], VGG (short for ''Visual Geometry 248 Group'') [18], Inception [19], ResNet (short for ''residual 249 neural network'') [20], DensenNet [21], EfficientNet [22], 250 and so on. Figure 7 plots the state-of-art models' per-251 formances in dataset ImageNet [23] from 2011 to 2021. 252 Some researchers [24], [25], [26], [27], [28], [29], [30] have 253 VOLUME 10, 2022 FIGURE 6. The historical evolution of CNN structure has changed from an early attempt to a historic breakthrough, and then to the current prosperity.    images need to be labeled by experts with appropriate med-266 ical knowledge due to the similarity of lesion manifestations 267 between various skin diseases. Currently, the acquisition of 268 skin disease datasets is mainly divided into self-collected 269 and public datasets. Self-collected datasets are usually not 270 publicly available. Most published dermatological datasets 271 are image data obtained by using dermoscopic imaging and 272 collected from dermatological image databases. Universities, 273 in collaboration with renowned hospitals, also collect some 274 datasets.

275
Regarding public datasets for studying melanoma, the most 276 extensive collection of datasets can be found in the Interna-277 tional Skin Imaging Collaboration (ISIC) repository, which 278 including the total number of images, total number of disease 344 classes, whether the dataset is publicly available (and free to 345 use), and the papers using different datasets, are presented in 346 Table 2. samples, which is also known as recall or the ''true positive 362 rate (TPR)''. Specificity is also called the ''true negative rate 363 (TNR)'', and the higher the value is, the higher the probability 364 of diagnosis. SP describes the ability of the classifier to detect 365 the TNR.

366
The F-score is a trade-off between PREC and recall also 367 known as the ''F-measure''. The formula is expressed as: where β is used to reconcile the importance of PREC and 370 recall. When β = 1, they are equally important and this is 371 is as follows: For all the categories, average the precision and 428 recall, and then calculate the average value as macro-average. 429 A usage scenario might be the following: The amount of data 430 is not considered, so each category will be treated equally 431 (because the precision and recall of each category are between 432 0 and 1), and will be relatively highly affected by PREC and 433 high recall classes.

434
Generally speaking, a macro-average will compute the 435 metric independently for each class and then take the average 436 (hence treating all classes equally), whereas a micro-average 437 will aggregate the contributions of all classes to compute the 438 average metric. In a multi-class classification setup, micro-439 average is preferable if you suspect there might be class 440 imbalance.

441
Top-N accuracy is another metric, which indicates the 442 capability of a classifier to predict correct class in first N 443 attempts. This metric gives a deeper insight into the classi-444 fier's learning and discriminating ability. Because of the similarity in color, texture, edge contour, 453 and other features between different skin lesions, and the 454 difference in pathological tissues between different patients, 455 it is a big challenge to classify skin cancer. Deep convolu-456 tional neural networks have been used for general and highly 457 variable tasks across many studies [117], [139], [140], [143], 458 [144], [145], [146], [147], [148], [149], [150].

459
They can be used to classify skin lesions in two fundamen-460 tally different ways.

461
In the first, a CNN pretrained on another large dataset, such 462 as ImageNet, can be applied as a feature extractor. In this case, 463 classification is performed by another classifier, such as the 464 k-nearest neighbors (kNN) algorithm, SVM, or artificial neu-465 ral networks (ANNs). In the second way, a CNN can directly 466 learn the relationship between the raw pixel data and the class 467 labels through end-to-end learning. In contrast to the clas-468 sic workflow typically applied in machine learning, feature 469 extraction becomes an integral part of classification and is 470 no longer considered a separate, independent processing step. 471 If the CNN is trained with end-to-end learning, the research 472 can be divided into two different approaches: learning the 473 model from scratch, and transfer learning. techniques also include methods such as contrast enhance-535 ment and intensity adjustment, space correction, binarization, 536 morphological operations, gray-scaling, and noise reduction. 537 At this stage, noise and other artifacts are removed from 538 images. Fekri-Ershad et al.
[157] applied a color based image 539 retrieval method to perform melanoma detection); model 540 structure (which involves defining data input and dimen-541 sions, as well as network core modules, classifiers, and loss 542 function and network output); training the model (which 543 involves choosing backbone, defining parameters, and con-544 structing and performing training); and testing and applying 545 the model. We can also roughly divide the process into four 546 parts: Input, network, training, and output. When we try to 547 improve the effect of model training, we can optimize these 548 four aspects. The traditional melanoma image classification 549 method consists of multiple stages, and the framework is 550 more complicated. The end-to-end CNN model structure can 551 be put in place in one step, and the classification accuracy is 552 greatly improved.

553
In the past few years, there has been an increasing ten-554 dency, not only to develop and use different modern CNN 555 backbones to solve complex real-world problems, but also 556 to apply advanced techniques for achieving better training of 557 these models. Examples include using generative adversarial 558 network (GAN) models, and focusing on focal loss [28], 559 [36], [52], [158], [159], transfer learning techniques, data 560 augmentation methods, and the development of ensembles of 561 CNNs.

562
This study summarizes several basic guidelines regarding 563 factors that influence model performance, as described by 564 Ng [160]: (1) The expressive ability of the model (depth and 565 width); (2) the learning rate; (3) the optimizer; (4) the learn-566 ing rate adjustment strategy. In DL, model overfitting often 567 occurs, and methods to reduce the impact of model overfitting 568 usually include data augmentation (data enhancement can 569 increase the data size) and regularization.

571
Transfer learning is a new task that improves learning by 572 transferring knowledge from related tasks that have been 573 learned. For example, there are three tasks: task A, B, and C. 574 They use the same network structure. For a deep neural 575 network, the weights of the CNN layers in the front layer are 576 very close. Here the process of extracting an object features in 577 a CNN model, the first three layers may first extract vertical 578 edges, and then extract horizontal Edge, then extract the 579 round area. So the previous CNN weights do not need to 580 be trained. In order to avoid similar repeating tasks, task C 581 can then use the training results of task A or B to continue 582 training, which can reduce the number of parameters and 583 training time.

584
Migration ability is the criterion we need to consider when 585 deciding which task model to use. The larger the amount 586 of data in the original model, the stronger the migration 587 capability; and the more similar the problem scenarios of 588 the original model and the new problem, the stronger the 589 migration ability. The stronger the migration ability, the lower 590 the number of layers that need to be frozen, and vice versa. 591 VOLUME 10, 2022 FIGURE 8. Flow chart of melanoma diagnosis based on a general convolutional neural networks (CNN) model in a general way. Image processing is divided into image acquisition, image prepossessing, and dataset division. Image prepossessing includes image size adjustment, normalization, and noise removal. Melanoma image recognition mainly includes image feature extraction and classification models to classify the extracted features and output the results.

712
Generative adversarial networks (GANs) [196] provide a 713 path for sophisticated domain-specific data augmentation and 714 a solution to problems that require a generative solution. 715 They are based on a game theoretic scenario in which the 716 generator network must compete against an adversary. The 717 generator network directly produces samples. During the past 718 few years, GANs develop rapidly. These [56], [62], [ [56] proposed a skin lesion image 726 classification approach based on a skin lesion augmen-727 tation according to style-based GAN and DenseNet201. 728 This method generated high quality skin lesion images 729 and performed well on the ISIC 2019 dataset(its balanced 730 multiclass accuracy achieved 93.64%). Qin et al.
[169] also 731 applied style-based GANs data augmentation technology to 732 improve the skin lesion classification performance. While a 733 cycle consistent adversarial networks (cycle-GAN) for skin 734 lesion image synthesizing was adopted by Gu et al. [62]. 735 Pollastri et al. [109] proved that a Laplacian Generative 736 Adversarial Network (LAPGAN) can be employed to obtain 737 an accuracy boost equivalent to 138% more real annotated 738 images when the dataset is over 500 images. The basic idea of Autoaugment [197] is to use reinforcement 741 learning to find the best image transformation strategy from 742 the data itself, and learn different augmentation methods for 743 different tasks.

744
The latter two methods are often used for unsupervised data 745 augmentation.

747
The classification of skin lesions has in recent years relied 748 on the ensemble method to achieve highly accurate perfor-749 mance [29], [30], [    Better results are also reported in a comparative study of 807 DL architecture on melanoma detection using dermoscopic 808 images [208]. Preprocessing methods such as illumination 809 correction, contrast enhancement, and artefact removal are 810 suggested to improve image quality and obtain a better gen-811 eralization ability. Due to the imbalanced class distributions 812 of skin lesions, various augmentation approaches are adopted 813 in these methods. Various standard evaluation metrics, such 814 as SP, SE, ACC, and F-measure, are employed to evaluate 815 the obtained results. Finally, experiments show that ResNet50 816 outperforms its counterparts AlexNet, Xception, VGGNet16, 817 and VGGNet19 architecture, with a classification ACC as 818 high as 92.08% and an F-score equal to 92.74%.

819
A very interesting meta-analysis including more than 820 200 studies on the research emanating from the field of 821 computer science is reported by Dick et al. [208]. Combin-822 ing all the results for automated systems gave a melanoma 823 SE of 0.74 (95% CI 0.66-0.80) and an SP of 0.84 824 (95% CI 0.79-0.88). Although the SE was lower in studies 825 that used independent test sets than in those that did not, the 826 SP was similar. Moreover, in comparison with dermatolo-827 gists' diagnoses, computer-aided diagnoses showed similar 828 SEs and a 10 percentage point lower SP, but the differ-829 ence was not statistically significant. As main conclusion of 830 the meta-analysis, the ACC of computer-aided diagnosis for 831 melanoma detection may be considered comparable to that 832 of experts; nevertheless, the real-world applicability of these 833 systems is as yet unknown and potentially limited owing to 834 overfitting and the risk of bias of the available studies.

835
Responses to the main doubts arising from this type 836 of analysis may be found in studies carried out mainly 837 by physicians and focused on the well-recognized DL 838 CNN models. Among them, interesting results are reported 839 by Brinker et al.
[150] who compared AI algorithms 840 to classifications made by 157 German dermatologists. 841 Haenssle et al. [149] report results where, under less arti-842 ficial conditions and in a broader spectrum of diagnoses, 843 the CNN and most dermatologists performed on the same 844 level; they [140] also compared the diagnostic performance 845 of a CNN with that of a large international group of 58 der-846 matologists from 17 countries, including 30 experts with 847 more than 5 years of dermoscopic experience. Their data 848 clearly show that a CNN algorithm may be a suitable tool 849 to aid physicians in melanoma detection, irrespective of their 850 level of experience and training. An adequately trained DL 851 CNN can provide a highly accurate diagnostic classification 852 of dermoscopic images of melanocytic origin. Therefore, 853 physicians of all levels of training and experience may benefit 854 from assistance in the form of a CNN image classifica-855 tion. In a study by Brinker et al. [117], a CNN trained with 856 open-source images was exclusively capable of outperform-857 ing dermatologists of all levels of experience in dermoscopic 858 melanoma image classification. The CNN had lower vari-859 ance of results, indicating a higher robustness of computer 860 vision, compared to human assessment, for dermatologic 861 image classification tasks [139]. Maron et al. [145] showed 862 that the automated binary classification of dermoscopic 863 melanoma and nevus images can be extended to a multiclass 864 tion algorithm significantly enhances the performance of 896 the melanoma classification, outperforming the benchmarks. 897 Zhao et al. [56] applied inpainting algorithms to replace the 898 pixel values and used a black top-hat filter with a grayscale 899 image. Attia et al.
[79] performed a survey on hair detection 900 and also conducted experiments with hybrid CNNs. Since 901 DL uses a set of cascaded, sequential layers that operate on 902 the input data, each layer performs a non-linear processing 903 operation to extract a hierarchical representation (achieved 904 by extraction of feature maps) of the input pixels based on 905 the neighborhood. As the activation maps have higher values 906 at the ''hair'' or ''ruler marking'' pixels, this achieves the 907 purpose of detecting hair. After removal of the hair, the skin 908 lesion becomes clearer; removing hair can help the classi-909 fication model to better identify the lesion location in the 910 skin lesion image and improve the ACC of classification 911 results [56].

913
Imbalanced classification is the problem of classification 914 when there is an unequal distribution of classes in the training 915 dataset. The imbalance in the class distribution may vary, but 916 a severe imbalance is more challenging to model and may 917 require specialized techniques. Zhao et al. [56] propose a 918 skin lesion augmentation style-based GAN to address insuffi-919 cient data samples, unbalanced data, and missing labels data. publicly available skin lesion datasets. Moreover, most of the 987 classification labels for dermoscopic skin lesion images are 988 determined by pathological examination. Hekler et al. [218] 989 illustrate the potential of DL to assist human assessment for 990 a histopathologic melanoma diagnosis.

992
Smartphone applications (apps) provide users with an instant 993 assessment of skin cancer risk and offer the potential for ear-994 lier detection and treatment, which could improve the survival 995 of patients. Against the background of the high burden of 996 skin cancer in the world and limited access to dermatolog-997 ical care, particularly in remote areas, AI diagnostic tools 998 provide the possibility to improve triage and reduce the time 999 to excision for correctly diagnosed melanomas. If the mobile 1000 device is used properly, this could also reduce morbidity 1001 resulting from unnecessary biopsies. In a review paper [219], 1002 Freeman et al. show currently available apps, such as skin-1003 Scan, SkinVision, and TeleSkin. There is no skin cancer risk 1004 stratification smartphone app that has received U.S. Food and 1005 Drug Administration (FDA) approval to date [219]. A com-1006 bined reference standard comprising histology and clinical 1007 follow-up of benign lesions would provide more reliable and 1008 generalizable results. Smartphone algorithm-based apps for 1009 skin cancer all include disclaimers that the results should 1010 only be used as a guide and cannot replace health care 1011 advice [219]. In recent years, some researches have emerged that use wave-1015 length or polarization of light and combine sound information 1016 with skin lesion image information. In the field of biomedical 1017 imaging and diagnostics, polarization speckle is a growing 1018 fast. Wang et al. [162] used DL to extract skin lesion infor-1019 mation from polarization speckle, and improved the per-1020 formance in classifying benign and malignant skin lesions 1021 by 20%. Pölönen et al. [220] showed that use of the spectral 1022 and spatial domain will increase classification performance 1023 of CNNs. Dascalu et al. [221] acquired dermoscopy images 1024 by skin magnifier with polarized light with DL algorithm 1025 and sonified in the first phase; in the second phase, they did 1026 further analysis with a different DL. Whether it is spectral 1027 information or sound information, it has opened up a new way 1028 of thinking for skin lesion diagnosis. However, the existing 1029 public datasets hardly provide skin lesion data with light or 1030 sound information. So this is a challenge for most researchers. 1031

1032
Deep learning shows great potential in the image-based diag-1033 nosis of skin cancer. However, there is still a significant 1034 discrepancy between expectations and true relevance of DL 1035 in current dermatological practice based on dermoscopy. 1036 In numerous studies we have cited, e.g. [27]  for safety and clinical performance concerning the clinical 1096 investigation) and evaluated by the Certification Body during 1097 the CE mark certification process. The normative frame-1098 work seems to limit the actual possibility by Small Medium 1099 Enterprises to introduce smartphone applications, whereas 1100 the corresponding market may be more easily approached by 1101 large companies already qualified as Medical Device Manu-1102 facturers for other SW systems and/or equipment.

1103
On the basis of the legislative framework, according to the 1104 authors'opinion, the future research efforts should be better 1105 focused on the adoption of the DL-based software system 1106 only by dermatologist, thus matching also the following 1107 deontological features involved with the diagnosis of skin 1108 cancers: 1109 i. promotion of periodic visiting by the specialist whose 1110 attention may be captured by skin lesions that do not appear 1111 as suspicious for the unexpert patient and will not be ever 1112 examined through smartphone application; 1113 ii. improvements of psychological behavior against the 1114 pathology by the patient affected by melanoma that may be 1115 addressed on the correct diagnostic and successfully thera-1116 peutic pattern rather than be abruptly informed by an app on 1117 the high oncological risk of the self-examined lesion.

1118
According to the presented perspective, the main research 1119 topic should be the development of DL-based systems able 1120 to improve the diagnostic expertise of the dermatologist (not 1121 only to provide support and second opinion for the examina-1122 tion of the single suspicious nevus). For the user the software 1123 system should not appear as a black-box; rather, the classi-1124 fication results should be easily related to well-knowledge 1125 diagnostic methods (such as ABCDE rules, 7-Point Check 1126 List, and Menzie'score). As an example, the approach of the 1127 Semantic Segmentation [223] based on DL (already success-1128 fully experimented in other applications such as the real-time 1129 segmentation of road traffic video for the autonomous driv-1130 ing) could be investigated to provide an automatic system 1131 able to recognize the atypical features within the dermo-1132 scopic images of suspicious lesions. Moreover, the metrics 1133 themselves adopted to analyze the performance of the pro-1134 posed software systems should be revised for better show 1135 the efficacy in the clinical setting end the new intended 1136 aims. In detail, the differentiation among suspicious lesions 1137 to be excided and other types of classified nevi should be 1138 emphasized when the ROC curve is analyzed for the opti-1139 mal tuning of DL-software systems.Finally, the economic 1140 impact supported by the clinical organizations in terms of 1141 the savings for the number of excisions as well as the costs 1142 associated with the erroneous diagnosis should be taken into 1143 account during the performance evaluation of the developed 1144 or systems.