Discriminative Feature Learning for Skin Disease Classification Using Deep Convolutional Neural Network

Nowadays, skin disease among humans has been a common disease, especially in America millions of people are suffering from various kinds of skin disease. Usually, these diseases have hidden dangers which lead to human not only lack of self-confidence and psychological depression but also a risk of skin cancer. Diagnosis of these kinds of diseases usually required medical experts with high-level instruments due to a lack of visual resolution in skin disease images. Moreover, manual diagnosis of skin disease is often subjective, time-consuming, and required more human effort. Thus, there is a need to develop a computer-aided system that automatically diagnoses the skin disease problem. Moreover, most of the existing works in skin disease used convolutional neural networks (CNN) with classical loss functions, which limit the model to learn discriminative features from skin images. Thus to address the above mention problem we proposed a new framework by fine-tuning layers of ResNet152 and InceptionResNet-V2 models with a triplet loss function. In the proposed framework, first, we learning the embedding from input images into Euclidean space by using deep CNN ResNet152 and InceptionResNet-V2 model. Second, we compute L-2 distance among corresponding images from euclidean space to learn discriminative features of skin disease images by using triplet loss function. Finally, classify the input images using these L-2 distances. Human face skin disease images used in the proposed framework are acquired from the Hospital in Wuhan China. Experiment results and their analysis shows the effectiveness of the proposed framework which achieve better accuracy than many existing works in skin disease tasks.


I. INTRODUCTION
At present, skin diseases are one of the most infectious diseases to see among people. Because of the physical structure affected by the direct exposure to ultraviolet radiation i.e. use of different types of high-frequency wireless equipment for a long time and it can develop skin cancer. In the United States, according to the statistical report of 2012, more than 3.3 million non-melanoma cases including basal cell carcinoma and squamous cell carcinoma are tested amongst 5.4 million individuals [1]whereas, the new The associate editor coordinating the review of this manuscript and approving it for publication was Marcin Woźniak . melanoma cases are diagnosed is estimated over 91,270 in 2018. The number of new occurrences of skin cancer such as breast, lung, prostate gland and colon increases every year [2]. Additionally, research analysis shows that 20% of Americans will produce skin cancer during their life [3]. The diagnosis of skin diseases is very challenging due to the symptom of skin disease is an extended and continuously transforming process with occurring in certain areas of the skin. In order to diagnose skin disease, a lot of clues such as specific lesion anatomy, physical assortment, scaling, shade, and arrangement can be used. The recognition process may be quite complicated by individually analyzing specific components [4], because of primarily four significant medical diagnosis methods such as ABCD rules, Menzies method, 7-Point Checklist and pattern analysis are used for correctly analyzed the melanoma skin cancer. All these methods achieve an effective result for the skin cancer diagnosis. To distinctness of skin lesions [5], the extreme level of skills is needed instead of the human specialist diagnoses, which relies on an individual opinion that could be barely reproducible [6], [7]. The computer-assisted diagnostic method is more effective and well-performing because of using the standard feature extraction method with well-known classifiers such as SVM and ANN. The diagnostic systems [8]- [10] achieve effective results on melanoma (specific skin cancer), whereas these are ineffective to execute on a wide range of skin disease categories.
Since hand-crafted features learning algorithms usually are committed to a single or a minimal number of subcategories and not effective in classification for a higher number of skin disease categories. On the other hand, a hand-crafted feature extraction algorithm impractical due to variation in the nature of skin diseases [11]. To address this issue, it requires to feature learning [12] rather than feature engineering to select essential features by machine. In past years, many feature learning-based classification approaches [13]- [17]are proposed. Whereas, they are limited to dermoscopy or even histopathology and especially emphasize the mitosis diagnosis that is a cancer indication [18], [19].
To deal with the problem, we proposed a discriminative feature learning approach based on transfer-learning for skin disease classification. The contributions of the paper are as follows: • We proposed a new deep CNN based model for skin disease classification using triplet loss function.
• To learn discriminative features from skin disease images, we fine-tune CNN based model (i.e. ResNet152 and InceptionResNet-V2) with triplet loss function. To the best of our knowledge, no one used triplet loss function in the skin disease images.
• We perform layer-wise fine-tuning of pre-train deep CNN models, instead of block wise [20], to improve the performance of the end-to-end learning method.

II. RELATED WORK
Recently, deep CNN gets incredibly preferred to feature learning and image classification. A network can train on a large-scale dataset using high-performance GPU to achieve effective results. Many types of research [21]- [23] on Ima-geNet [24] show that the traditional methods using deep CNN obtained better performance over humans in object classification. Lately, Esteva et al. [25] designed a universal skin disease classification method by fine-tuning the VGG16 and VGG19 architectures to train the network. Their network obtained 60.0% and 80.3% Top-1 and Top-3 classification accuracy respectively, which is significantly surpassed the social skills in their experiments. They also encouraged to use a similar approach to obtain a better result.
Tajbakhsh et al. [26] explained that a pre-trained network is better to use a deep CNN on scratch due to limited labeled input data; they solved the skin disease problem by a pretrained network using images from other medical domain more effectively than train a deep CNN from the beginning. Giotis et al. [27] developed a decision support system using deep CNN, which utilized a different set of features such as color, visual and lesion texture. Haenssle [28] composed a method using deep CNN to classify the dermoscopy images with binary diagnostic categories. Dorj et al. [29] designed ECOC SVM based on deep CNN to organize skin cancer images of four diagnostic categories. Han et al. [30] proposed an image classification method using deep CNN, which classify the clinical images of 12 skin disease categories. Mohd et al. [31] proposed an image classification method to classify melanoma, the dataset of four types of skin diseases used in all experiments. Almansour and Jaffar [32] composed a technique using the k-means clustering and Support Vector Machine (SVM) to classify melanoma; they also demonstrate the comparative results. German et al. [33] explained the skin cancer diagnostic method by separately using AdaBoost MC. The dataset from a different category of skin lesions is used for cancer detection. Ioannis et al. [27] designed a support system using the image processing algorithm and the deep CNN for utilizing the different sets of features such as visual diagnostic attributes, color, lesion texture, affected area and degree of damage for melanoma.

III. METHOD
To preserve the stability of the method, we used two welltrained CNNs, the implementation details are discussed in the next section. We fine-tune the networks to obtain 128 − D embedding f (I) into the Euclidean space R d then calculate L-2 distance between all input images using corresponding 128 − D embedding, whereas L-2 distance is independent to each factor concerning the input images such as physical assortment, scaling, shade and arrangement. Although we have not directly compared the different class images, whereas computing the loss between these images then loss between same class images are projected at a single point into the Euclidean space. This triplet loss puts the margin between every image from the same class and differentiates them from other class images.The schematic view of the proposed method as shown in Figure 1.

IV. TRIPLET LOSS
Currently, deep metric learning approaches using Triplet Loss get more consideration due to interacting capability of performance management with an extreme of labels. Prabhu.et.al [34] presented an extreme multi-label classification. In general, conventional methods for multi-label classification linearly increase the number of parameters. Song et al. [35] and Weinberger et al. [36] learns an N-way softmax classifier with an extreme level of labels. Hence, the convolutional neural network using a triplet loss function learns a compact embedding to handle the classification problems effectively. Since triplet loss learns good embed- Furthermore it ensure that a particular face image I a (anchor) is closer to all images from same class I i p (positive) than images from the different class I i n (negative). Thus our objective is where f (I a ), f (I i p ) and f (I i n ) are the embedding of a triplet (I a , I i p , I i n ) from a set of all possible triplets T in training set with major N and threshold α t is a predefined margin that imposes between images from different class. The Euclidean distance minimizes the triplet loss is defined as These triplets would not be active during training in slower convergence. Hence it is crucial to select hard triplets, which are activated in training to improve the performance of the model. In the next section, we have discussed a triplet selection approach that we have used in this method.

V. TRIPLET SELECTION
To assure fast convergence, triplet selection is essential that select the triplets to violet triplet constraint by generating bias in selection. Since triplet selection requires to adjust the trade-off between hard triplets to generating bias instead of select triplet properly that imbalance the trade-off, we directly minimize the bias to avoid this problem. In particular, make an equal contribution of all possible triplets in an unbiased situation. We compute argmax of metric values ||f (I a )−f (I i p )|| 2 2 and argmin of metric values ||f ( where (I i p ) (hard positive) and (I i n ) (hard negative) for given image (I a ). Since computing argmax and argmin values for the entire dataset is infeasible. Furthermore, the poor images would dominate the hard positives and negatives and mislabeled that can be caused by poor training. To overcome this issue, we online generate triplets and create mini-batch of few input samples to compute argmax and argmin within minibatch [37].
To keep a notable presentation for positive L-2 distance, we need to create a mini-batch of a minimum number of input images from individual class. In our case, we choose 200 images from each class including a few random negative images to make a mini-batch. In case of an obtained indecisive result due to generate the triplets, for small mini-batch, choosing the hardest negative leads to bad local minima at the initial stage of training, especially for a poor network (i.e., f (I ) = 0)). Hence, we reduce this problem by selecting The distance of anchor from a negative image is greater than the positive image, whereas L-2 distance is closer to anchor positive distance. Thus we called semi-hard to these negative images and the range of these images is under the value of pre-defined margin α t .
In order to improve the convergence rate, we need to take a small mini-batch. In our case, we choose a mini-batch size of 800 samples to reduce the obstacle of appropriate hard triplet selection from each mini-batch.

VI. DEEP CONVOLUTIONAL NEURAL NETWORK
Deep CNN can be effectively trained in many cases of transfer learning [38], [39]. Researchers choose a pre-trained model VOLUME 8, 2020 for fine-tuning the weights by backpropagation, instead of training the network using randomly initialized parameters. Since previous layers of the pre-trained network have some generic features, which can be quite useful in decision function. Especially, those features are directly used, when network train on the new dataset.
In our method, we changed and fine-tuned two well-trained CNNs architectures: ResNet152 [40] and InceptionResNet V2 [41] for feature extraction. We customized both networks by removing the last fully connected layer. First, flatten the output from the feature map then take one FC layer with 512 neurons afterward use dropout layer with rate of 0.3 and finally the l2 regularization layer with 128 − D features to learn embedding vectors of dimension 128 as an identity descriptor, which avoids the connections to preserve the power of error signals for improve the network performance. Moreover, it helps to reduce the vanishing gradient problem.
In this method, we fine-tune two well-trained networks with a skin disease dataset by re-initializing the data layers. In this process, we use the NVIDIA GTX 980 Ti GPU to accelerate the computation for full implementations and Tensorflow library to improve the network definition (train-validation protocol) of the pre-trained models. In the data layer, the input data specified by a text file (sourceparameter), each line of text file indicates the individual file name with labels. At the training time, we set the new height, new width with the number of input images and batch size. As mentioned in the next section, we augment and resized the input images of the same value (224, 224) as the pre-trained networks to be compatible with this model. Thus, our networks have the same scale as the pre-trained networks, which are used in our implementation. To optimize the networks, we use stochastic gradient descent (SGD) [42] with a learning rate of 0.0001 and a momentum of 0.8.

A. DATASET
The dataset for the human face skin disease classification is built [43], the images are acquired from web pages by keyword searches. A total of 6144 images are obtained with

labels of 14 classes, then dermatologists from Wuhan Union
Hospital classify all skin images and unified these images of 14 classes into five categories: acne, spots, blackheads, dark circles and clean face, which is categorized based on characteristics of diseases. Clean faces are considered as the negative samples and each type of skin disease such as acne, spots, blackheads and dark circles are selected as positive samples. Afterward, make the ratio of positive and negative samples is 1:1 to avoid the sample imbalance problem. Finally, we preprocessed these images under few arguments such as rescale, zoom with range 0.3, rotation with range 0.3, horizontal flip, and resized the input with the size of width and height is 224 then augmented the dataset to 14000 samples to improve the results.

B. EVALUATION METHODS
In this experiment, deep CNNs using a triplet loss function classify extracted features into four categories, i.e., acne, blackheads, dark circles and spots.
The accuracy, sensitivity (recall) and specificity are computed in this analysis by using the following methods.
Accuracy (The probability of this method): Sensitivity or Recall shows the true prediction value with respect to wrongly predicted as negative ailments (true positive rate): Specificity shows the true negative rate: where TP is true positive or correctly predicted as positive ailments (i.e., same diseases), TN is true negative or correctly predicted as negative ailments (i.e., different diseases), FP is false positive or wrongly predicted as positive ailments (i.e., same diseases), FN is false negative or wrongly predicted as negative ailments (i.e., different diseases). We also used AOC (area under curve) for skin disease classification. The AOC curve graphically illustrates the classification ability of our method. The curve is plotted TPR (true positive rate) against FPR (false positive rate). The TPR and FPR are computed using following methods: The greater value of AOC indicates that the model is better (can see in Figure 3).

VIII. EXPERIMENTS
In all our experiments, we use a total of 12000 skin disease images of four categories to extract the 128-D features. 10% of training data are used for validation by splitting the dataset during training. In the next section, we describe the feature interpretation process in detail.

A. DEEP LEARNING FEATURE INTERPRETATION
In order to classify the skin diseases, we used the deep CNNs to extract the embeddings f (I) from the input image I into an Euclidean space R d . Here, we explore the different dimensions of embeddings and finally take 128 for all our experiments; it is possible to achieve better results by taking a more significant size after prolonged training.
It has been noted that training with a 128 − D floating vector that can be quantized to 128-bytes without accuracy loss, which is better for large-scale image recognition and clustering. Whereas smaller embeddings size can get minor loss of accuracy, which can be better to implement for mobile devices.

B. CLASSIFICATION
Here, we classify the feature vector of four classes of input images. After interpreting the features, we use triplet loss function with deep CNN to learn good embeddings and computes L-2 distance between all input images into ddimensional Euclidean space (R d ), which directly corresponds to image similarity. The L-2 distance between the same class images is small, whereas large for different class images are shown in Equation. The distance between images is independent of different imaging factors.
We used two pre-trained networks; ResNet152 and InceptionResNet-V2 to classify the preprocessed images. Figure 3 shows the area under the curve (AUC) for each skin disease category of proposed networks i.e. fine-tune the ResNet152 and fine-tune the InceptionResNet-V2. The loss function optimizes the training of networks. We used 70% of images as a training set,10% for validation to obtain the quick stopping epoch and 20% of data as the testing set is used to evaluate the method. We augmented each class's images due to the visual similarity between different types of skin diseases with some arguments that are mentioned before. We summarized the proposed network's accuracy with their classification performance with a number of samples using for prediction mentioned in Table 1. The next section described the results of the proposed method.

IX. RESULTS AND DISCUSSION
In this process, we use the NVIDIA GTX 980 Ti GPU to accelerate the computation for full implementations and Tensorflow library to improve the network definition (train-validation protocol) of the pre-trained models. To record the running time of fine-tuned ResNet152 with triplet loss and fine-tuned with InceptionResNet-V2 architectures, we selected 188 iterations and extracted the features of dimension 128. The running time of fine-tuned ResNet152 with triplet loss on the system with specification of 4 GB CUDA cores GPU is 34142 second while the fine-tuned with InceptionResNet-V2 architectures takes 43714 seconds for training. We recorded the performance of the proposed method based on several aspects such as different parameter value effects, training error, and accuracy (true accept rate).

A. TRAINING LOSS AND ACCURACY
We have plotted the loss and accuracy curve against the number of epochs, which is shown in Figure 2. In addition, the confusion matrices of proposed networks shown in Figure 4. The row values of confusion metrics denote the corresponding true label and column indices denote the corresponding predicted labels. The value that appears in each cell shows the prediction labels. Here, we can see that the diagonal cell of confusion metrics acquiring a high level of prediction, which indicates the low error rate (high probability of accurate prediction) for each category of skin diseases. To test the efficiency of the proposed method, we computed some evaluation factors, namely, sensitivity (recall), specificity and accuracy using confusion matrics, which are demonstrated in Table 1. Additionally, we computed AUC (true accept rate against false accept rate) values for all categories of skin diseases.

B. EFFECT OF PARAMETERS
To balance the training and validation loss measure the impact on classification results with the convergence rate, we tested the network performance on different α t values, which shown in Figure 5, the classification accuracy improved with the threshold, and started to drop when the threshold was taken about 0.20. Therefore, we could assume that we achieved the best result on a threshold value of 0.20.

C. COMPARISION
Here, we discussed the comparative study with some other traditional methods to examine the concernment of the proposed method. While four types of skin diseases i.e., acne blackheads, dark circles, and spots are classified in our method, skin diseases are classified in many traditional methods. Hence, we compared the results of the proposed method (accuracy, specificity, and sensitivity) with some previous works are shown in Table 2. The accuracy and specificity of our method is better than [25], [27], [31], [33], [44]- [49]. Even though the specificity of [44] is more than the proposed method but their sensitivity is comparatively very low.

X. CONCLUSION AND FUTURE WORKS
In this paper, we proposed a model using deep CNN with triplet loss function to improve the skin disease classification. We fine-tune all layers of ResNet152 and InceptionResNet-V2 to address the problem of facial skin disease images. First, we extract the 128-D features (embeddings) from training samples into Euclidean space then computes L-2 distances between corresponding images using learned embeddings. After that, we perform skin disease classification task considering L2 distance among images. The dataset used in the experiment consists of four types of skin diseases, i.e., acne, blackheads, dark circles, and spots. A total of 12000 input images were used for training, 2000 for testing, and 10% of training data for the validation set to evaluate the method. Our method outperforms then state-of-art works in skin disease classification. The proposed method can also be used for other disease classification tasks. Since the biological taxonomy organizes the dermnet images, thus the performance of the proposed method can be improved by designing a dataset with the help of a dermatologist to visually organized taxonomy. Her research interests focus on machine learning, image annotation, natural language processing, software engineering, and design pattern, and has published dozens of articles. She currently teaches courses, including system analysis, object-oriented software engineering, and seminar on deep learning. She has obtained four invention patents, including one US patent and three Taiwan patents. Her invention patent won two international patent competition gold medals in 2016 and 2017, respectively. In addition to research, she devotes herself to teaching and received two times' Excellent His research interests include cloud networking, smart environment (smart city, smart health), social media, the IoT, edge computing and multimedia for health care, deep learning approach to multimedia processing, and multimedia big data. He has authored and coauthored approximately 200 publications, including refereed journals, conference papers, books, and book chapters. He is a Senior Member of the ACM. He has served as a member of the organizing and technical committees of several international conferences and workshops. He has served as the co-chair, general chair, workshop chair, publication chair, and TPC for over 12 IEEE and ACM conferences and workshops. He is currently the Co-Chair of the 2nd IEEE ICME Arabia. His research interests include image and speech processing, cloud and multimedia for healthcare, biometrics, and security. He has authored and coauthored more than 200 publications, including IEEE/ACM/Springer/Elsevier journals, and flagship conference papers. He holds a U.S. patent on audio processing. He received the Best Faculty Award of the Computer Engineering Department, KSU, from 2014 to 2015. He has supervised more than ten Ph.D. and Master Theses. He is involved in many research projects as a principal investigator and a co-principal investigator. He was a recipient of the Japan Society for Promotion and Science (JSPS) Fellowship from the Ministry of Education, Culture, Sports, Science and Technology, Japan. VOLUME 8, 2020