Enhanced Skin Condition Prediction Through Machine Learning Using Dynamic Training and Testing Augmentation

In recent years, deep learning has taken the spotlight in automated medical bioimaging. However, the performance of current state-of-the-art score stems primarily from well-tuned parameters and architecture. There is still only limited research focused on dynamic data augmentation, even in the fields of machine learning and computer vision. In this study, we propose a dynamic training and testing augmentation capable of increasing performance significantly. The searching augmentation framework used in this study requires fewer GPU hours than a conventional search algorithm, which needs to train a new model every time augmentation is proposed. Speeding up of the search algorithm is achieved by using Bayesian optimization on a trained model, so we do not have to train a new model every time a new augmentation policy is proposed. The performance of our method is compared with that of a single model and the ensemble model that happens to be the winner of the ISIC 2019 challenge. Furthermore, we use the latest compact yet significantly accurate network architecture EfficientNet as the backbone system. Our method delivers a superior result, and this study also shares the searched augmentation policy utilized, which requires extraordinary resources. Thus, other researchers can use the searched augmentation policies for dermoscopic images to improve performance.

model aware of invariance in the data domain. Only a few studies have focused on improving augmentation techniques [10], [11]. However, there is a lot of attention directed at improving machine learning architectures to enhance performance [12]- [15]. For example, most winners of the largest competition on image datasets [16] held annually since 2010 achieved state-of-the-art score by improving the model architecture [12], [13], [17]- [19]. All the winners have used the same augmentation technique, with small changes, as the one proposed in 2012 [17].
There are many problems that come with using image datasets; from medical issues to everyday objects, such as disease, character, and animal classification. Several studies have explored the best augmentation for a single dataset. However, there is no research exploring whether the augmentation can be combined or transferred between datasets. To the best of our knowledge, only Cubuk et al. have explored the possibility of transferring an augmentation policy to another dataset [10]. This study has inspired us to continue the exploration of augmentation policies for images in the medical-domain.
Not only that we explore the application of the probabilistic augmentation framework to images in the medical-domain, specifically to dermoscopic images, to diagnose skin cancer types. We also expand the idea of probabilistic augmentation from training to inference stage. From the experimental results, the application of probabilistic augmentation can be expanded to cover dermoscopic images. We are able to even push the result further by using probabilistic augmentation at inference stage. We share the searched augmentation for future research to be used. Thus, other researchers that use skin datasets can focus on tuning the model parameters alone while leaving the search process for the best augmentation policy by using our result. A brief overview of the probabilistic augmentation framework is presented in Fig. 1. We use EfficientNet for the backbone network [20]. Efficient-Net is the latest and the most compact network with high performance. To summarize, the contributions of this paper are as follow: • We exploit the advance dynamic training augmentation called Fast Autoaugment to train the machine learning model for skin cancer diagnosis.
• We purpose dynamic preprocessing on inference (DPI) in order to adapt the environmental condition when the image was taken.
• We proved the performance of dynamic augmentation and DPI outperform the current state-of-the-art ensemble model for skin diagnosis.
• We present the searched augmentation for future research which need thousand of GPU hours to be found.

II. SKIN LESIONS
In this section, we explain the general pathological hierarchy of skin lesions and briefly describe the dermoscopic characteristics of eight diagnosis categories from the dataset used in our experiment [28]. Hierarchically, there are two types of skin lesions: melanocytic and non-melanocytic. Based on the origin of the lesion, a melanocytic lesion or melanoma comes from a genetic defect in pigment-producing cells known as melanocytes. A melanoma-type cell can rapidly grow from stage 0 (melanoma in situ) to stage 4 (metastatic melanoma) without proper treatment [29]. Furthermore, based on the degree of malignancy, melanoma lesions can be classified into two categories: benign and malignant. Benign melanoma cells lack the ability to spread or interfere with nearby cells. Oftentimes, the benign melanoma cells will stop growing once they reach a certain size, while malignant melanoma cells never stop growing and can thus press and interfere with surrounding healthy cells [30].
Melanoma is a malignant melanocytic lesion that should be detected as early as possible. However, it is hard to detect melanoma early because its characteristics vary depending on its anatomic site [31]. Generally, the characteristics of melanoma are dark (brown or black) rhomboidal structures and asymmetric pigmented follicular openings (Fig. 2a). Another important characteristic in relation to facial area is the appearance of slate-gray globules and dots. The only effective treatment for melanoma is surgical excision.
A melanocytic nevus (NV) is a benign variant of melanocytic lesions. Unlike the malignant-melanoma, this benign variant, the NV, has a symmetrical structure and color [21]. It is worthy of note that a symmetrical structure in biology refers to a uniform pattern regardless of an irregular shape or the presence of stain. A melanocytic nevus sample can be seen in Fig. 2b.
Basal cell carcinoma (BCC) is a malignant form of nonmelanocytic lesions. The BCC cells rarely grow but can spread if left untreated. The dermoscopic descriptions of non-pigmented BCC are arborizing vessels and shiny whitered structureless areas, while the description for pigmented-BCC are: maple leaf-like areas and large blue-gray ovoid VOLUME 8, 2020 nests [22]. Dermoscopic samples of BCC are presented in Fig. 2c. There are several treatment options for removing BCC cells: curettage and electrodessication, radiation therapy, photodynamic therapy, and surgery.
Actinic Keratosis (AK) is a variant of mini-squamous cell carcinoma that can be treated without surgery. However, AK should be diagnosed and treated as early as possible before it progresses and evolves into squamous cell carcinoma in situ. AK is commonly found in the facial area, and its characteristics take the form of red pseudo-network (strawberry-network), dotted/glomerular vessels, white structureless areas, a central mass of keratin, and diffuse yellow opaque scales [23], as shown in Fig. 2d.
Dermatofibroma is a benign variant of the non-melanocytic type of lesion. The most common characteristics are a central white patch and peripheral pigment network [25]. Dermatofibroma is harmless because it a benign type. Furthermore, optional treatments are available for aesthetic reasons, such as freezing or laser procedures. Sample images are presented in Fig. 2f.
A vascular lesion is a benign variant of the nonmelanocytic type of lesion. Angiomas, angiokeratomas, pyogenic granulomas, and hemorrhage are included in this category. As shown in Fig. 2g, the characteristics are a solid purple color with well-circumscribed structures known as red clods or lacunes [31]. Squamous cell carcinoma is a thickened plaque or nodule. It has varied characteristics depending on the location on which it appears. The characteristics are white circles, keratin, and blood spots [27], shown in Fig. 2h. Primarily, the optional treatment is excision surgery.

A. EFFICIENTNET
Recent state-of-the-art architecture is used as backbone in our experiment. Tan and Le proposed a compound-scalingmethod to construct an efficient machine learning architecture. In order to get a better performance, many researchers have been trying to scale the width of the network [12], the depth of the network [13], and the resolution of the image [33]. However, none of them define how to balance all the dimensions with a proper ratio. The size of input resolution is a crucial aspect for the model to achieve a superior performance. Furthermore, if the input resolution is increased, then more layers and channels should be added to capture more pattern in order to increase the model's receptiveness.
EfficientNet is a family model that is built from a baseline model which developed through a neural architecture search [34]. The neural architecture search is a framework to search for the best architecture in the search space under certain constrain. The main building block of the network is mobile inverted bottleneck (MBConv) layer which proposed by Sandler et al. [35] and Tan et al. [36]. The illustration of the baseline model which called as EfficientNet b0 is displayed at Fig. 3. The detail of the scale for other expanded models can be seen in the Table 1.
The performance of EfficientNet is better than all the predecessors with similar computation and memory need by large margin. EfficientNet b4 which has similar number of FLOPs and parameters as ResNet50 [13] performs 6% better (82% vs 76%) at ImageNet dataset. The performance of EfficientNet b4 is the same as NasNet-A [34] with 5.7× lower computational resources and 4.7× lower memory needs.

B. AUTOAUGMENT
AutoAugment uses a reinforcement learning approach to formulate the problem of finding the best augmentation technique for a dataset. In this problem, a state is represented by an image, which can be an original image or an image to which augmentation techniques have been applied. Then,  the action space consists of augmentation techniques for images, such as rotation, translation, and scaling, because the augmentation techniques for images require a magnitude parameter. For example, in rotation transformation, the magnitude of the degree through which an image is to be rotated is required to perform the transformation. Thus, the degree becomes a parameter required for the transformation. The action comes in a pair: an image transformation and a magnitude value as its parameter. A reward in AutoAugment is based on a machine learning models performance on a validation set after applying the augmentation techniques.
For each episode, an RL agent selects the proper augmentation technique to apply to the dataset. Subsequently, a backbone machine learning model is trained with a dataset in which the images have been augmented. Then the final accuracy is used as a reward to the RL agent to improve its next action.
An RL agent will have a sufficient ability to pick the best augmentation policy after several episodes. Because a completely new model is trained in each episode, the time complexity of this framework is expensive. Based on the experiment by Cubuk et al., they needed thousands of GPU hours to improve state-of-the-art score [10].
Lim et al. proposed an improvement, called Fast AutoAugment, that overcomes the time constraint of the AutoAugment framework. Instead of re-training the model each time a new augmentation technique is selected, they only train a model once and pick the best augmentation technique based on inference time [11].
Let us denote φ as an image transformation that can be used as an augmentation technique that receives an image x, magnitude m, and probability p. The result of the transformation can either be φ(x; m) with probability p or the original x itself with probability 1 − p. The illustration is shown in Fig. 4.
Then, several image transformations can be stacked sequentially and annotated as n , which contains n augmentation techniques applied sequentially to the input image x. Formally, the definition can be expressed as follows: At first, a trained dataset δ is divided into k-folds. Each fold consists of a tuple <δ k train , δ k val >. Then, several classifiers f k θ are trained concurrently with each dataset δ k train . Subsequently, using the Bayesian optimization approach [37], the top-n augmentation techniques are selected based on the lowest error rates of classifier f k θ when predicting dataset δ k val . All the best augmentation techniques n are merged from each fold. The final step is to train the entire training set δ with augmentation techniques nk . The complete step by step procedure is outlined in Algorithm 1.

C. BAYESIAN OPTIMIZATION
The primary goal of Bayesian optimization is to find the minimum value of a function f θ (x) based on a finite input set X . First, Bayesian optimization uses an approximation function f θ (x) that represents the probabilistic function f θ (x). Then, Bayesian optimization constructs an acquisition function q to exploit the approximation function for picking the next input x off θ (x) to be evaluated.
A Gaussian process is selected to be the approximation function f θ (x). We define as the cumulative distribution function of the Gaussian distribution. The acquisition function used in our paper is the expected improvement function, which is mathematically defined as: ζ (x) is the probability of improvement of the current maximum value, which can be calculated using: The µ(·) and σ (·) is the mean and variance of · respectively. In this study, we use the Bayesian optimization to find the best augmentation by using the loss given by the trained model as the target to be minimized.

D. DYNAMIC PREPROCESSING ON INFERENCE (DPI)
In recent years, most researchers have focused on the best network architecture, while only a few have focused on data augmentation. To the best of our knowledge, there is no study focus on dynamic preprocessing framework for inference yet. To achieve dynamic augmentation on inference, we define P ϑ : x i − > as a convolution neural network function that maps an image x i into augmentation space . The augmentation space contains all possible augmentations. The training objective can be define as: where z i is the top-k augmentation of the i-th data. The index j on z i,j is a boolean indicator whether the j-th augmentation is one of the top-k augmentation. The definition of z i is mathematically described as: where is the loss function of the main model. The task of finding the top-k augmentation becomes multilabel classification problem with the top-k augmentation as the label. Because the magnitude of augmentation is continuous, we discretize into several levels of equal chunks. The model for inference augmentation is trained using (x i , z i ) to understand which augmentation works best for certain images. After successfully training the network with the multilabel data on inference, we use the maximum logit value to select the desired augmentation.

E. DATASET
The International Skin Image Collaboration (ISIC) is the largest dermoscopic dataset in the world with an open license and has a total of 25,331 dermoscopic images with groundtruth labels [28], [31], [38]. The data also comes with clinical information, such as age and gender. The ground-truth labels have 8 different diagnostic categories: melanoma (MEL), melanocytic nevus (NV), basal cell carcinoma (BCC), actinic keratosis (AK), benign keratosis (BKL), dermatofibroma (DF), vascular lesion (VASC), and squamous cell carcinoma (SCC). A sample from each category in the dataset shown in Fig. 5. The image resolution comes in a 600 × 450 dimension.
The obvious problem with the dataset is an imbalanced class distribution, shown in Fig. 6. The NV class consists of 12,875 samples, while DF and VASC contain fewer than 300 samples.

IV. EXPERIMENT
In this experiment, we set the augmentation search space as follows: shear, translate, rotate, auto-contrast, invert,   equalize, solarize, posterize, color, contrast, brightness, sharpness, and cut out. We employ EfficientNet, which is state-of-the-art architecture in deep learning [20] proposed by Tan and Le as a backbone model for classifying dermoscopic images. Precisely, we are using the EfficientNet b4 model with the exact same setting as Tan and Le. We called this model as probabilistic augmentation (PA) model. We used 512,000 steps, with 16 as the batch size. The Adam optimizer [39] is used with a learning rate of 0.0001. We also use an exponential decay of 0.99998. The search for augmentation is done using four independent training models.   Each independent training model uses two searches. For every search, the top-5 policies with the lowest loss are added to the final augmentation policies. For the final training model, we used the same setup as for the augmentation policy search with a batch size of 64. Using probabilistic augmentation policy, a higher training step is needed to achieve a welltrained model. This is required to ensure stochastic stability in the training phase.
For preprocessing a model on inference, we used the exact same model as the main model, Efficient-net b4.
We called the second model as dynamic preprocessing on inference (DPI) model. We train the DPI model as follows: for each possible augmentation, we use 16 levels of magnitude, where the 0 level means no augmentation is applied. Because we have 13 candidates for augmentation with 16 levels, we have 13 × (16 − 1) + 1 = 196 possible outputs. The final preprocessing on inference is determined by the maximum logit given by the DPI model. The overall flow of our experiment can be seen in Fig. 7.
To compare our method, we ran two different scenarios. First, we used the searched augmentation policy to train the PA model. Second, we combined the first scenario with dynamic preprocessing on inference. We compared those scenarios with a single model trained using static augmentation and the ensemble model (the ISIC 2019 challenge winner).

V. EVALUATION
In this section, we describe several metrics that can quantify the models performance in our experiment. We divide our metrics into two categories: binary classification metric and multiclass classification metric.

A. BINARY CLASSIFICATION METRIC
Despite the ISIC dataset having eight categories, we can measure the models performance using a binary classification metric. We can calculate the correctness of the predictions for each class separately. Therefore, for each metric, we produce eight different scores for each class.
In a binary classification problem, we can assign a positive-negative class scenario directly. A common metric for binary classification measurement is to calculate how many samples are truly classified as positive (TP), truly classified as negative (TN), falsely classified as positive (FP), and falsely classified as negative (FN) as shown in Table 2.

1) ACCURACY
Accuracy is a metric for determining the quality of the models performance in categorizing both positive and negative classes. The score is calculated by comparing all truly classified data with all data. The formal definition can be written as follows: 2) SENSITIVITY/RECALL The sensitivity, also known as the recall score, is used to measure how good the models performance was for discovering positive data from the actual positive data. When the smaller positive sample is categorized as negative, a higher recall score will be achieved. From a medical perspective, it is important to minimize the number of sick persons categorized as healthy because it can jeopardize their life. The recall score can be calculated as follows:

3) F1 SCORE
As mentioned previously, the recall score is very important from a medical perspective, and we do not want our model to categorize every sample as positive. Therefore, we also need an F1 score that can balance recall and precision. Precision is used to measure how precise the model's ability is for identifying the actual positive data from positive-predicted 40542 VOLUME 8, 2020 FIGURE 11. Confidence Matrix Improvement of DPI framework. data. The precision score is expressed as: While the formula of F1 score is defined as follows:

4) AREA UNDER CURVE (AUC)
Binary classification comes with a single probability: whether a sample belongs to a positive or negative class. Instead of determining a fixed threshold for categorization, an AUC metric is used to measure how well data are ranked based on their prediction probability. The AUC metric is obtained by calculating the area under the receiver operating characteristic (ROC) curve. The ROC is a graph representation of the false-positive rate (FPR) and the true-positive rate (TPR). FPR and TPR are mathematically defined by: and The final AUC value is a number representing the total area under the ROC curve. The AUC value bounded between 0 and 1 with the higher value is superior.

B. MULTICLASS CLASSIFICATION METRIC
The common problem with medical datasets is a highly imbalanced number of samples for each class. Therefore, to measure the models performance, we cannot use a common accuracy metric that treats each sample equally because it stimulates the model to ignore samples from the minor class.
Because we need to treat each class equally, we use balance accuracy to know how well the model performs on the dataset. VOLUME 8, 2020

1) BALANCE ACCURACY
Balance accuracy is a weighted accuracy score based on the proportion of samples from each class to the total sample. Mathematically, balance accuracy can be written as follows: We can get an overall score that represents the ability to identify each class equally regardless of the imbalanced samples.

VI. RESULT
We employ the search for the best set of augmentation policies using Fast AutoAugment using Bayesian-optimization. The searched augmentation probability and magnitude can be seen in Fig. 8 and Fig. 9. The procedure to run the search requires around 300 hours of GPU time (Tesla V100) for completion. Thus, we present the details of the searched augmentation in Appendix A to be used for future research. Based on the results presented, the evidence that augmentation plays a significant role in medical image prediction is undeniable. The comparison of accuracies between static and PA while training can be seen in Fig. 10. We refer to the term ''static augmentation'' as using a few augmentation policies. The augmentation policies are limited to crop, resize, and horizontal flip. In Fig. 10, static augmentation during training phase provides a faster convergence rate than probabilistic augmentation. However, the weight is not stochastically tuned because we only use a few augmentations, which leads to memorization instead of generalization [10], [40]. Due to not being stochastically tuned, the accuracy cannot grow further, unlike training using probabilistic augmentation. A low convergence rate is a drawback of employing the considerably massive possibility of augmentations. Greater augmentation possibilities will lead to lower convergence rates.
As part of the multiclass classification, the confussion matrix is one of the most important visualization metrics. We can see in Fig. 11a and 11b that high accuracy is concentrated in classes with a high number of samples. At a glance, combining both PA and DPI gives a better confusion matrix in Fig. 11b than using only PA, as can be seen in Fig. 11a.
The entire experiment using PA and PA plus DPI is compared with a single model trained using static augmentation and the ensemble model that is the first place solution in the ISIC 2019 challenge, which does not use an additional dataset. The ensemble method uses several EfficientNets [20] and SerestNexts [41]. As can be seen in Table 3 and 4, our methods can supersede other methods on all average metrics, including AUC, accuracy, sensitivity, and F1 score. Since the competition does not release label for the test set, we showed the ensemble model performance metrics on both cross validation and the competition leaderboard. We can see in the Fig. 12 the example of images after dynamic augmentation applied. The probability of the actual class is increased significantly and able to alter the network outcome.

VII. CONCLUSION
This study proposes a technique for exploring the possibility of having multiple probabilistic training augmentations and dynamic augmentation on inference. The searching framework uses Bayesian optimization to reduce GPU hours compared to that required for reinforcement learning-based augmentation search [10]. We present the searched augmentation in Appendix A for future research. In addition to the first work to apply augmentation search on medical dataset, our paper also proposed novel technique to perform dynamic preprocessing on the inference. We use two models to accomplish our goal: the first model is for predicting the skin condition, and the second model is built to predict the best augmentation for the inference stage. We use EfficientNet b4 for both models. The performance of PA and PA plus DPI proved to be better than that of the normal model and the ensemble model that is the ISIC 2019 challenge winner. Although the proposed method has high accuracy, more steps are needed to train the model. The low convergent rate of our method is due to achieving stochastically tuned weight, which is rarely achieved using minimum augmentation.

APPENDIX THE SEARCHED POLICY
See Table 5.
TRYAN ADITYA PUTRA received the bachelor's degree in computer engineering and the master's degree in electrical engineering from Universitas Indonesia. He is currently pursuing the Ph.D. degree with the Department of Electrical and Computer Engineering, National Taiwan University of Science and Technology. His research interests include machine learning, artificial intelligence, data mining, and mathematical modeling.
SYAHIDAH IZZA RUFAIDA received the bachelor's and master's degrees in computer science from Universitas Indonesia. She is currently pursuing the Ph.D. degree with the Department of Electrical and Computer Engineering, National Taiwan University of Science and Technology. Her research interests include machine learning, data mining, and image processing. August 2017, he has been a Professor and the Chairperson. His research interests include heterogeneous network integration, mobile service and platform design, distributed computing (P2P, cloud computing), and green and orange technology integration. He has published extensively in these areas, with 58 SCI indexed journal articles, 57 conference papers or book chapters, and led 12 MOST project, 12 industry-academia projects, and two cross-university projects in the past ten years.