Diabetes Prediction Using Ensembling of Different Machine Learning Classifiers

Diabetes, also known as chronic illness, is a group of metabolic diseases due to a high level of sugar in the blood over a long period. The risk factor and severity of diabetes can be reduced significantly if the precise early prediction is possible. The robust and accurate prediction of diabetes is highly challenging due to the limited number of labeled data and also the presence of outliers (or missing values) in the diabetes datasets. In this literature, we are proposing a robust framework for diabetes prediction where the outlier rejection, filling the missing values, data standardization, feature selection, K-fold cross-validation, and different Machine Learning (ML) classifiers (k-nearest Neighbour, Decision Trees, Random Forest, AdaBoost, Naive Bayes, and XGBoost) and Multilayer Perceptron (MLP) were employed. The weighted ensembling of different ML models is also proposed, in this literature, to improve the prediction of diabetes where the weights are estimated from the corresponding Area Under ROC Curve (AUC) of the ML model. AUC is chosen as the performance metric, which is then maximized during hyperparameter tuning using the grid search technique. All the experiments, in this literature, were conducted under the same experimental conditions using the Pima Indian Diabetes Dataset. From all the extensive experiments, our proposed ensembling classifier is the best performing classifier with the sensitivity, specificity, false omission rate, diagnostic odds ratio, and AUC as 0.789, 0.934, 0.092, 66.234, and 0.950 respectively which outperforms the state-of-the-art results by 2.00 % in AUC. Our proposed framework for the diabetes prediction outperforms the other methods discussed in the article. It can also provide better results on the same dataset which can lead to better performance in diabetes prediction. Our source code for diabetes prediction is made publicly available.


I. INTRODUCTION
Diabetes is a very familiar word in the present world and crucial challenges in both developed and developing countries [1]. The insulin hormone in the body produced by the pancreas allows glucose to pass from the food into the bloodstream. The lack of that hormone due to malfunctioning of the pancreas forms diabetes which can result in coma, renal and retinal failure, pathological destruction of The associate editor coordinating the review of this manuscript and approving it for publication was Wei Wei .
GitHub: https://github.com/kamruleee51/Diabetes-Prediction-Using-MLClassifiers Data: https://www.kaggle.com/uciml/pima-indians-diabetes-database pancreatic beta cells, cardiovascular dysfunction, cerebral vascular dysfunction, peripheral vascular diseases, sexual dysfunction, joint failure, weight loss, ulcer, and pathogenic effects on immunity [2]. Research on diabetes patients demonstrates that diabetes among adults (over 18 years old) has risen from 4.7 % to 8.5 % in 1980 to 2014 respectively and rapidly growing up in second and third world countries [3]. Statistical results in 2017 show that 451 million people were living with diabetes worldwide, which will increase to 693 million by 2045 [4]. Another statistical study in [5] shows the severity of diabetes, where they reported that half a billion people have diabetes worldwide, and the number will increase to 25 % and 51 % respectively in 2030 and 2045.
However, there is no long term cure for diabetes, but it can be controlled and prevented if an early prediction is accurately possible. The prediction of diabetes is a challenging task, as the distribution of classes for all attributes is not linearly separable as depicted in Fig. 1.
In recent years, plenty of methods have been proposed and published for diabetes prediction. A ML based framework was proposed in [7] where authors implemented the Linear Discriminant Analysis (LDA) [8], Quadratic Discriminant Analysis (QDA) [9], Naive Bayes (NB) [10], Gaussian Process Classification (GPC) [11], Support Vector Machine (SVM) [12], Artificial Neural Network (ANN) [13], AdaBoost (AB) [14], Logistic Regression (LR) [15], Decision Tree (DT) [16], and Random Forest (RF) [17] with different dimensionality reduction and cross-validation techniques. They also performed extensive experiments on the outlier rejection and filling missing values for boosting the performance of the ML model, where they were able to obtain the highest possible AUC of 0.930. In [18], authors employed three different ML classifiers such as DT, SVM, and NB to prognosticate the likelihood of diabetes with maximum accuracy. They demonstrated that NB is the best performing model with the AUC of 0.819. The AB and bagging ensemble techniques using J 48 (c4.5)-DT, as a base learner and standalone data mining technique (J 48), have been studied and implemented in [19] for the classification of diabetes mellitus. The experimental results of them prove that the AB ensemble method is better than bagging and standalone J 48-DT. Genetic programming for the prediction of diabetes had proposed in [20] where the framework outperformed as compared to other implemented techniques by them. Authors, in [21], employed four ML methods such as DT, ANN, LR, and NB to classify the risk of diabetes mellitus, where they boosted the robustness by bagging and boosting techniques. The experimental results show that the RF algorithm gives optimum results among all the employed algorithms. Gaussian Process (GP)-based classification technique was proposed, in [22], using three different kernels (linear, polynomial, and radial basis function) and compared against the traditional LDA, QDA, and NB. The authors also performed extensive experiments to search for the best cross-validation protocol. Their experiments demonstrate that the GP-based classifier with the K 10 cross-validation protocol is the best performing classifier for the diabetes prediction. Although there are numerous frameworks already been published, in recent years, still, the improvement requires in the preciseness and robustness for diabetes prediction.
In this literature, We propose a new pipeline for diabetes prediction from the PIMA Indians Diabetes dataset. Preprocessing, in the proposed pipeline, is the heart of achieving the state-of-the-art result, which consists of outlier rejection, filling missing values, data standardization, feature selection, and K-fold cross-validation. We consider the mean value in the missing position of attribute rather than median value, as it has a more central tendency toward the mean of that attribute distribution. The folding of the dataset for cross-fold validation is performed carefully to preserve the percentage of class proportion, as same as in the original dataset. Different ML classifiers (k-nearest Neighbour (k-NN), RF, DT, NB, AB, and XGBoost (XB)) and MLP were implemented in our proposed pipeline. We apply the grid search technique for selecting the number of hidden layers, number of neurons in each hidden layer, activation function, neuron initializer, batch size, learning rate, epoch, percentage of dropped neurons, loss function, an optimizer of MLP and hyperparameters of ML models. Extensive experiments are performed on different combinations of preprocessing and ML classifiers for maximizing the AUC of diabetes prediction under the same experimental conditions and dataset. The best ML classifier is then set as a baseline model to evaluate our proposed classifier quantitatively for the prediction of diabetes precisely. Moreover, we propose an ensembling classifier by the combination of the ML models for boosting the diabetes prediction. To ensemble the ML models, soft weighted voting is employed, where the weight for the individual model was estimated from the respective AUC. The AUC of the ML model is chosen as the weight of that model for voting ensembling rather than accuracy since AUC is unbiased to the class distribution. Extensive experiments on different combinations of the ML models are accomplished for searching the best ensemble classifier where the best performing preprocessing from the previous experiments is employed.
The organization of the remaining paper is as follows: Section II presents the dataset, proposed methodology, and evaluation metrics. In section III, the different experimental results are reported with the interpretation. Finally, the paper is concluded with future works in section IV.

II. MATERIALS AND METHODS
This section focuses on materials and methods used for this study, in the literature, where the subsections II-A, II-B, and II-C respectively explain the dataset, proposed framework, and hardware & metrics used to evaluate the framework.

A. DATASET
The ML models were trained and tested on publicly available PIMA Indians Diabetes (PID) dataset of 768 female diabetic patients from the Pima Indian population near Phoenix, Arizona [6]. This dataset consists of 268 diabetic patients (positive) and 500 non-diabetic patients (negative) with eight different attributes. The descriptions of the attributes and brief statistical summary are shown in Table 1. The Pedigree (Diabetes Pedigree Function) was calculated [6] as in (1).
where i and j respectively denote the relatives who had developed and NOT developed diabetes. K is the percentage of shared genes by the relatives (K = 0.500 for the parent or full sibling, K = 0.250 for a half-sibling, grandparent, aunt VOLUME 8, 2020 FIGURE 1. The population distribution of all attributes in the PIMA Indian Diabetes Dataset [6] where blue and orange color distribution respectively denotes non-diabetes and diabetes class. or uncle and K = 0.125 for a half aunt, half-uncle or first cousin). ADM i and ACL j is the age of relatives, in years, at the time of diagnosing and at the last non-diabetic test respectively.

B. PROPOSED FRAMEWORK
The proposed framework, in this literature, has been illustrated in Fig. 2 where the preprocessing of raw data is the integral step in the proposed pipeline, as the quality of data can drive the classifiers to learn directly.

1) PREPROCESSING
In the proposed framework, the preprocessing step includes outlier rejection (P), filling missing values (Q), standardization (R), and feature selection of the attribute which are briefly described as follows: The outlier [23] is a markedly deviated observation from other observations. It requires to be rejected from data distribution as the classifiers are very much sensitive to the data range and distribution of the attributes. The mathematical formulation for the outlier rejection in this literature can be written as in (2).
where x is the instances of the feature vector that lies in ndimensional space, x ∈ R n . Q 1 , Q 3 , and IQR is the first quartile, third quartile, and interquartile range of the attributes respectively, where Q 1 , Q 3 , IQR ∈ R n . The attributes, after outlier rejection, were processed to fill the missing or null values [24] as they could lead to the wrong prediction for any classifiers. In the proposed framework, the missing or null values were imputed by the mean values of the attributes rather than dropping, which can be formulated as in (3). The imputation with the mean is beneficial as it imputes the continuous data without introducing outliers.
where x is the instances of the feature vector that lies in ndimensional space, x ∈ R n . The standardization or Z-score normalization is the technique to rescale the attributes for achieving standard normal distribution with zero mean and unit variance. The standardization (R), as shown in (4), also reduces the skewness of the data distribution.
where x is the n-dimensional instances of the feature vector, x ∈ R n .x ∈ R n and σ ∈ R n are the mean and standard 76518 VOLUME 8, 2020 deviation of the attributes. However, in many ML models such as tree-based models are probably the models, where feature standardization can't provide a guarantee for significant improvement. The accuracy of the classifiers increases with the increment of the attribute's dimension. However, the performance of the classifiers will tend to reduce when the attribute's dimension increases without increasing the samples. Such a scenario, in machine learning, is referred to as a curse of dimensionality. Due to a curse of dimensionality, the space of the feature becomes sparser and sparser which forces the classifiers to be overfitted by loosing generalizing capability. In this literature, three most commonly used methods for the feature selection namely Principle Component Analysis (PCA) [25], Independent Component Analysis (ICA) [26], and Correlation-based [27] technique were used to compare their performance for the PID dataset. The details algorithm of PCA, ICA and Correlation-based technique are given in Appendix A, Appendix B, and Appendix C respectively.

2) CROSS-FOLD VALIDATION
The K-fold Cross-validation (KCV) technique is one of the most widely used approaches by practitioners for model selection and error estimation of classifiers [28]. The pictorial presentation of the data splitting (5-fold cross-validation), used in this literature, is shown in Fig. 3. The PID dataset has partitioned into K folds. The K − 1 folds are used to train and fine-tune the hyperparameters in the inner loop where the grid search algorithm [29] was employed. In the outer loop (K times), the best hyperparameters and the test data were used to evaluate the model. Since the PID dataset contains an imbalanced positive and negative samples, the stratified KCV [30] has been used to preserve the percentage of samples for each class as same as in the original percentage. The final performance metric was estimated using the equation as in (5).
where M is the final performance metric for the classifiers and P n ∈ R, n = 1, 2, . . . , K is the performance metric for each fold.

3) ML MODEL AND ENSEMBLING
Different ML models such as k-NN [31], DT, AB, RF, NB, and XB [32] have been trained (see Appendix D, Appendix E, Appendix F, Appendix G, Appendix H, and Appendix I respectively) and tested in the proposed framework. The hyperparameters which will tune, in the inner loop, are shown in Table 2. The ensembling of the ML model is the well-known technique to boost the performance using a group of classifiers [33], [34]. In ensembling, the aggregation of the output from different models can improve the precision of the prediction. The output from each model, Y j (j = 1, 2, 3, . . . , m = 6) ∈ R C assigns C = 2 (either having diabetes, C 1 or not, to the unseen test data where P i ∈ [0, 1] and C i=1 P i = 1. The weighted aggregation of different ML models in this literature was performed using the equation as in (6).
where the weight, W j is the corresponding AUC of that j th classifier. Since we are proposing a weighted soft voting ensemble, we need an imbalanced, as in the PID dataset, unbiased metric as a weight. That is why we choose AUC as a weight for the proposed ensembling classifier. The output of the ensembled model, Y ∈ R C has the confidence values P en i ∈ [0, 1]. The final class label of the unseen data, X ∈ R n from ensembled model will be C i if P en i = max(Y (X )).

4) MULTILAYER PERCEPTRON (MLP)
A neural network consists of processing units, called neurons, where each neuron is connected to other neurons by unidirectional connections of different weights [35]. A feed-forward VOLUME 8, 2020  neural network or MLP used, in this paper, is shown in Fig. 4 which consists of an input-output layer and several hidden layers. The D-dimensional input vector of any layer of MLP produces N -dimensional output vector, f (x) : R D → R N . The output of each processing unit can be expressed as in (7).
where the x j , w j , b and are the inputs, weights, bias to the neuron and the nonlinear activation function respectively. The parameters of the neuron are updated as in (8) during the training using back-propagation [36] to minimize the error, γ = y true − y output .
where η is the learning rate, which is the amount at which the weights are updated during the training. However, it is very uncertain to guesstimate the number of the hidden layers (H M ) and neurons (N M ) at each hidden layer as they highly depend on the dataset. The more number of layers and neurons will have more parameters that can not provide any guarantee to have better performance. The more the parameters, the more the samples require in the training dataset. However, in this paper, we are learning those hyperparameters from the PID dataset. The hyperparameters such as the number of hidden layers, number of neurons in each hidden layer, activation function, neuron initializer, batch size, learning rate, epoch, percentage of dropped neurons, loss function, the optimizer will be used in the grid search for optimizing to maximize the AUC.

C. EVALUATION METRICS
The models were implemented using the Python programming language with different Python and Keras APIs and the experiments were carried out on a machine running Windows-10 operating system with the following hardware configuration: Intel R Core TM i7-7700 HQ CPU @ 2.80 GHz processor with Install memory (RAM): 16.0 GB and GeForce GTX 1060 GPU with 6 GB GDDR5 memory. All the extensive experiments were evaluated using several metrics where each metric has a different meaning of evaluation. The confusion matrix of True Positive (TP), False Positive (FP), False Negative (FN), and True Negative (TN) along with different metrics e.g. Sensitivity (Sn), Specificity (Sp), Precision (Pr), False Omission Rate (FOR), and Diagnostic Odds Ratio (DOR) [37] has been reported. The Sn and Sp are respectively used to quantify the type-II error (the patient having positive symptoms, but erroneously fails to be rejected) and type-I error (the patient having negative symptoms, but detected as positive). Pr, FOR, and DOR have been used to evaluate the percentage of correctly classified diabetes patients having positive conditions, the proportion of the individuals with a negative test result, for which the true condition is positive, and the effectiveness of a diagnostic test respectively. Additionally, the Receiver Operating Characteristics (ROC) with Area Under the ROC Curve (AUC) is also reported to measure how well predictions are ranked, rather than their absolute values.

III. RESULTS AND DISCUSSION
This section presents the different extensive experiments with the corresponding results in several subsections. The results for preprocessing and ML model are described in subsections III-A and III-B respectively. The subsections III-C and III-D are dedicated to represent the results for MLP and ensembling classifiers respectively, and the subsection III-E compares the results.

A. RESULTS FOR PREPROCESSING
The class-wise distribution of the attributes (see Fig. 1) demonstrates the complexity of distinguishing positive and negative diabetes in the PID dataset. Most of the attributes also have the skewness (positive and negative) and leptokurtic distribution. However, the presence of the outlier introduces the skewness and kurtosis (see Fig. 5 (a)) in the attribute's distribution where the high kurtosis is an indicator of heavy tails or outliers in the PID dataset. The presence of the skewness and kurtosis will tend to underestimate and overestimate the expected value respectively. The result for the outlier rejection (see Fig. 5) demonstrates that the skewness of the distribution moves to the zero means, which indicates the mean and median of the attribute have coincided approximately (see Fig. 5 (b)). The leptokurtic (kurtosis > 3) distribution of the PID dataset also moves to a mesokurtic distribution (kurtosis = 3). The confusion matrix of the correlation (see Fig. 6) presents the result for the outlier rejection and filling missing values together. The qualitative and quantitative analysis on the Fig. 6 (a) and Fig. 6 (b) demonstrate that the correlation of the attribute with the target outcome has improved after applying outlier rejection and filling the missing values where the correlation coefficient, especially for the F3, F4, and F5, have improved significantly. The improved correlation is the beneficiary for the correlation-based feature selection (see Appendix C). VOLUME 8, 2020   Table 3 shows the quantitative results for the selection of the best performing preprocessing and ML model where the AUC with standard deviation is reported for the comparison among them. The summary of each model's capability of achieving the best AUC from the proposed pipeline, with corresponding best preprocessing and attribute selection algorithm as well as the number of selected attributes, has reported in Table 4. The best-tuned hyperparameters using the grid search are also shown in Table 4. The investigation on Table 3 provides evidence of getting better results from different models when we employ suitable preprocessing for them.
All the classifiers demonstrate their respective best results for outlier rejection and filling missing values when the correlation-based feature selection is employed (see Table 3 and Table 4). The first two experiments, as shown in Table 3, show that the boosting classifiers (AB & XB) beat all the classifiers in AUC. The AB performs better for the raw data (x ∈ R 8 ), and XB performs better when only the outliers are rejected (x ∈ R 8 ) from the PID dataset. The performance of the XB has improved by a 0.6 % margin when only the outliers are rejected (P). These two experiments show that XB is affected by the outlier, in the PID dataset, more than AB, although XB has extreme gradient boosting capabilities. There is a possibility of overfitting in XB as it assigns equal weight to all the weak base-learners, whereas AB assigns more weight to the weak base-learners having better performance. The building of a new tree depends on the residuals of the previous tree, where the outliers will have much larger residuals than non-outliers. XB does not penalize those residuals as in AB. Moreover, after applying PCA and ICA on outlier rejected data, the NB classifier yields better performance to AUC by improving the AUC of all other classifiers (k-NN, DT, and RF), even the boosting classifiers (AB & XB). The reason can claim that the PCA and ICA return the feature vector with mutually exclusive and uncorrelated features. For which, NB performs better than others. However, for correlation-based feature selection, the XB outperforms other classifiers, even the NB classifiers for the preprocessing of P. Since features from the correlation-based selection are correlated with the outcome and are no more uncorrelated with each other as in PCA and ICA-based feature selection. For which, NB fails to be a winner in this experiment.
When the missing values are filled (Q) with the mean rather than rejection along with outlier rejection (P), the classification performance has boosted significantly. The XB has won for all the cases of feature selection when both the P and Q are employed. For P + Q and PCA or ICA, the XB outperforms the NB, where the NB was the best classifier for the process, P and PCA or ICA. The preprocess (P + Q) has more samples comparing the preprocess, P alone, as the samples were rejected when it was an outlier or missed in P alone. For the preprocess (P + Q) and correlation-based feature selection, all the classifiers show their tremendous success, as there are no missed values and outliers, where the RF and XB outperform the state-of-the-art by a 0.9 % and 1.6 % margin in AUC respectively.
Further addition of standardization as a preprocessing could not increase the performance of the classifiers as it is not always guaranteed to improve the performance. Treebased models are not distance-based models, and hence standardization could not improve the performance of most of the ML models in this literature (see Table 3). Moreover, the standardization of the smaller dataset with fewer instances used in this literature can increase the possibility of losing information regarding the mean and standard deviation since the variability is less.
Remarkably, the employing of correlation-based feature selection rather than employing PCA and ICA-based techniques improves the AUC of all ML models when we apply the processing P and Q. The PCA transformed the higher dimensional space into a lower-dimensional space based on the orthogonal projections that contain the highest variance. The higher variance between the features will have lower covariance, whereas the uncorrelated data is only partially independent according to the ICA theory. The performance of the PCA algorithm depends on the number of PCs are being used, where the separation of the classes is more pronounced in the direction of smaller variance. Since the ICA finds the new predetermined mutually independent components, there is a possibility of losing correlation with the target outcome. Both the PCA and ICA find the new components in an unsupervised technique. For which, there is no guarantee of getting better performance in the PID dataset using PCA or ICA. On the other hand, the correlation-based feature selection uses the correlation between the feature and target outcome to select the features.
From the Table 4, it is also noticed that most of the classifiers performed better with 6 attributes comparing 4 or 8 attributes which are F1, F2, F4, F5, F6, and F8. This experiment also shows that the features such as diastolic blood pressure and diabetes pedigree function can be discarded from the PID dataset for diabetes prediction, as they carry less information of diabetes comparing other features, as in the PID dataset. Comparing all the ML models in Table 3 and  Table 4, the XB provides the best performance with AUC (± std.) of 0.946 ± 0.020, as it has extreme gradient boosting capability to minimize the loss when adding new models in parallel. The best performance of the diabetes prediction from the proposed pipeline using the XB model is achieved when the sum of instance weight in a leaf node less than 5 with the tree depth 5. The minimum loss reduction to make a further partition on a leaf node of the tree and the subsample ratio of to construct the tree were 1.5 and 0.6 respectively to obtain the highest possible results using the XB model from the proposed pipeline.

C. RESULTS FOR MLP
The extensive experiments were conducted on the PID dataset for diabetes prediction to obtain the best MLP architecture. Eight different models of MLP, with 1 ∼ 8 hidden layers, were implemented and tested, where the number of neurons was the hyperparameter to select optimum numbers. The experimental results are shown in Fig. 7, where it shows that the MLP architecture of M = 3 hidden layers (H 1 , H 2 , and H 3 ) with N 1 = 16, N 2 = 64, and N 3 = 64 neurons was chosen as the best architecture. The addition of more hidden layers with fewer samples as in the PID dataset TABLE 3. The summary of all extensive experiments for the selection of the best performing preprocessing, feature selection methods with selected attribute numbers, and classifier. The last column represents the best performing classifier for any preprocessing, whereas the underlined blue color denotes the best preprocessing for each classifier.

TABLE 4.
The best performing ML model and preprocessing along with tuned hyperparameters with highest possible AUC.
will tend to limit the generalizing capability of the MLP model, as depicted in Fig. 7. The extensive depth in the MLP model may also lead the model to be overfitted and often has gradient fading problems due to the limited numbers of data, as in the PID dataset.
The results on the best MLP architecture for different preprocessing are shown in Table 6, where all the neurons were initialized and activated by a normal distribution and ReLU function [38] respectively. We use the dropout layer [39] by randomly dropping 60 % neurons to tackle the overfitting. We trained our MLP model on 200 epochs with respective learning rate and batch sizes as 0.001 and 8. The results in Table 6 demonstrate that the outliers rejection and filling missing values drive the performance of the MLP model by    Table 5. a 7.1 % margin in AUC from raw data. Only the preprocess (P) can not improve the performance due to fewer samples, as both outliers and missing values are rejected in the process, P. The highest AUC from the MLP model is 0.902 with a standard deviation of 0.020 when we perform both the outliers rejection and filling missing values (P + Q). It is also demonstrated that the correlation-based feature selection is better in the PID dataset for diabetes prediction as similar to previous experiments on ML models (see subsection III-B). The ICA also performed as same as the correlation-based feature selection, the standard deviation for later one is much less than the former. For which later one has less inter-fold variation. Further addition of standardization with outliers rejection and filling missing values can not improve the results, as there is a possibility of losing information regarding the mean and standard deviation due to the less variability in the PID dataset.

D. RESULTS FOR ENSEMBLING MODEL
Since the ML models are ensembled for boosting the performance of the diabetes prediction, the best preprocessing from the subsection III-B and Table 3 & Table 4 are used in this experiment. The combination of the above ML models (N = 6) provides N i=1 N C i = 63 ensemble models. Among them only the best performing ensemble model with 2, 3, 4, 5, and 6 baseline models are reported in Table 7 with their corresponding results. The combination of AB and XB provides the best results for diabetes prediction for the three metrics out of the five, as shown in Table 7, by beating the other combinations by the 1.20 %, 14.81 %, and 0.90 % margin in Sp, DOR, and AUC respectively. The prevalence independent measure (DOR) of the AB+XB (see Table 7) has a greater value than the other combinations, which is considered to be a very good test [40] for the diabetes prediction. The confusion matrix and ROC curve of the best ensemble model (AB+XB) are shown in Fig. 8 (a) and Fig. 8 (b) respectively. The fraction of correctly classified patients among all the positive predictions is 84.2 % using the combination of AB and XB. From the ROC curve (see Fig. 8 (b)), it is seen that for false-positive rate of 0.066, the probability of getting true-positive rate is 0.788 at the model's accuracy (see the red star point in Fig. 8 (b)). From the ROC curve, it is also observed that the inter-fold variation of the AUC is also less which proves the robustness of the best ensembling classifier (AB+XB). The performance of AB+XB for diabetes prediction on the PID dataset is the superior, as both the AB and XB are the boosting type classifiers, where AB is the sequential boosting and XB is the parallel boosting. The combination of other ML models with the boosting type models (AB & XB) can not predict diabetes as good as the boosting types alone, as shown in Table 7 (2 ∼ 5 th rows). Although the combination of all the 6 models (see Table 7 (5 th row)) beats the best combination (AB+XB) in two metrics out of five, it has defeated in unbiased measurement (AUC) by a margin of 1.0 %. As a consequence, we can claim that for the diabetes prediction from the PID dataset, the soft weighted voting of VOLUME 8, 2020   serial and parallel boosting classifiers performs better than serial or parallel boosting classifier alone.

E. RESULTS COMPARISON
In this subsection, all the three experiments (see subsection III-B, III-C, and III-D) are compared and summarized. Finally, the best experiment is compared with the state-ofthe-art to validate our contributions in this literature. Table 8 demonstrates that the proposed weighted-ensemble of AB and XB produces the best prediction for the three metrics out of the five metrics, whereas performs as a second highest with respect to Sp and prevalence independent measurement (DOR). The proposed ensemble model (AB+XB) yields the best performance concerning Sn, FOR, and AUC by improving the XB by the margin of 2.1 %, 0.8 %, and 0.6 % respectively. It also beats MLP model in Sn, Sp, FOR, and AUC respectively by the margin of 3.2 %, 3.4 %, 1.5 %, and 4.8 %. The ensembling model (AB+XB) improves the true-positive rate compare to the XB model alone, as there is less possibility of miss-classification in the ensembling model. The less FOR values in the ensembling model (see Table 8) demonstrates that negative predictive value is high with less Type II error in the diabetes prediction. Furthermore, it is also observed that the proposed ensembling model (AB+XB) yields the best performances for balanced accuracy (average of Sn and Sp) by improving the XB and MLP results by 0.6 % and 3.3 % respectively, when the proposed preprocessing (P+Q and correlation-based feature selection) is employed. As a consequence from the above discussions in subsections III-B, III-C, III-D, and III-E, it can be concluded as follows: The proposed ensembling classifier (AB+XB) appears better suited for diabetes prediction from the PID dataset. For ensembling, the base classifiers should have a minimum correlation between them to achieve higher precision in diabetes prediction (see Table 7). The ensembling of two boosting (adaptive (AB) and gradient (XB)) type classifier is the best combination for diabetes prediction. The best combination (AB+XB), along with our proposed preprocessing (P+Q and correlation-based feature selection), can achieve tremendous success for diabetes prediction in the PID dataset. Comparative performance of our proposed method against the state-of-the-art works on the same dataset as shown in Table 1.
From Table 9, it is observed that all the models perform better either in positive or negative diabetes prediction, whereas the proposed model beats them with improved balanced accuracy or AUC or both. The framework proposed in [43], [44] used the k-NN technique to impute the missing values, where the algorithm searches the k th neighbor as a missing value. In such a technique, the new imputed value could be far from the central tendency of the population distribution. The performance in the pipeline (see Table 9) employed in [18], [20], [41], [42], [46] is less as comparing the proposed framework and others in [7], [44], [45]. Those fewer performances clearly indicate the role of outlier rejection and filling missing values in the PID dataset. The manual feature selection [42] without considering the correlation and covariance with the features and target label is the possible reason for getting less true-positive rates. The above discussion and Table 9 confirm that our proposed ensembling classifier (AB+XB) for predicting diabetes is a better diagnosis, with an AUC of 0.950, when the AUC-weighted soft voting and proposed preprocessing pipeline were employed compared to others.

IV. CONCLUSION AND FUTURE WORK
In this literature, diabetes prediction has been accomplished using the proposed ensemble model from the PID dataset, where the preprocessing plays a crucial role in robust and precise prediction. The quality of the dataset was improved by the proposed preprocessing scheme, where outlier rejection and filling missing values was a core concern. Such Algorithm 1 The Steps of Implementing the PCA-Based Feature Selection Input: The original n-dimensional data, X ∈ R n with N number of sample and variance threshold, T variance Output: The reduced k-dimensional data, Y ∈ R k 1 Load X ∈ R n and compute it's mean,X = 1 whereX ∈ R n 2 Compute the n × n covariance matrix, where P ∈ R n is the matrix of eigen vectors and D n×n is the diagonal matrix with eigenvalues on the diagonal 4 Sort the eigen vectors by descending order to choose first k eigen vectors that will have variance ≥ T variance and form a new projection matrix, W n×k a preprocessing can improve the kurtosis and skewness of the attribute distribution in the PID dataset. The correlation-based attribute selection can improve the correlation between attribute and target outcome, whereas PCA and ICA care

Algorithm 2 The
Steps of Implementing the ICA-Based Feature Selection Input: The original n-dimensional data, X ∈ R n Output: The reduced k-dimensional data, Y ∈ R k 1 Set non-quadratic nonlinear function, G for the approximation of neg-entropy 2 Initialize W of W × H = X , where W , H , and X are the ratios of the sources during mixing, the matrix containing the different components, and the mixed output respectively. 3 Perform PCA on X by X = PCA(X ) as in IV-A 4 while W changes do

Algorithm 3 The Steps of Implementing the Correlation-Based Feature Selection
Input: The original n-dimensional data, X ∈ R n and expected outcome, 3 Sort the correlation, r iT by descending order to choose first k features for Y ∈ R k Algorithm 4 The Steps of Implementing k-Nearest Neighbour (k-NN) Input: The n-dimensional data, X ∈ R n and target outcome, Y ∈ R Output: The posterior probability, P ∈ [0, 1] of unseen test data, x, where C i=1 P i = 1 and C = 2 (diabetes present (C 1 ) or not (C 2 )) 1 Calculate geometric distances, D h for k query points, , where X i = current instance, x i = query instance, q = order [47]. 2 Form a set, S with closest k points 3 Estimates the posterior probability, P for each class The n-dimensional data, X ∈ R n and target outcome, Y ∈ R Output: The posterior probability, P ∈ [0, 1] of unseen test data, x, where C i=1 P i = 1 and C = 2 (diabetes present (C 1 ) or not (C 2 )) 1 Split θ = (j, t m ) into Q left (θ ) and Q right (θ ) subsets, where θ consisting of a feature, j and threshold, t m 2 Compute the impurity at k th node using an impurity function (H ), 3 Minimise the impurity by selecting the parameters, θ * = argmin θ G(Q, θ ) 4 Repeat the above processes for subsets Q left (θ * ) and Q right (θ * ) until depth reach to N m < min samples or N m = 1

Algorithm 6
The Steps of Implementing AdaBoost (AB) Input: The n-dimensional data, X ∈ R n with N number of sample and target outcome, Y ∈ R Output: The posterior probability, P ∈ [0, 1] of unseen test data, x, where C i=1 P i = 1 and C = 2 (diabetes present (C 1 ) or not (C 2 )) 1 Initialize weight sample, D(i) = 1 N , where i = 1, 2, . . . , N 2 for t ≤ T (n_Classifiers) do 3 Train a weak learner using distribution D t [48]. 4 Select a weak hypothesis, h t : where i = 1, . . . , N and z t is the normalization factor. 6 Output posterior probability: P(x) = sign( T t=1 a t h t (x)) XB, MLP, and proposed ensemble classifier was verified by using the 5-fold cross-validation. Hyperparameters of different classifiers can drive the learning capability of those classifiers, which were optimized using a grid search technique in our proposed framework. The AUC as a weight to build a generic ensembling classifier is better, as it considers more priority to the model having more AUC. Random tree-based classifiers are well suited for the data to be classified when

Algorithm 7
The Steps of Implementing Random Forest (RF) Input: The n-dimensional data, X ∈ R n and target outcome, Y ∈ R Output: The posterior probability, P ∈ [0, 1] of unseen test data, x, where C i=1 P i = 1 and C = 2 (diabetes present (C 1 ) or not (C 2 )) Grow a random-forest tree T b using X b and Y b by repeating recursively using the following steps until the minimum node size is n min .
1) Randomly select m variables from the given n variables 2) Pick the best variable or split-point among the m variables 3) Split the node into two daughter nodes Output the ensemble of trees will be {T b } N 1 4 The posterior probability,P N RF (x) = Voting{P k (x)} N 1 , whereP k (x) is the class prediction of the k th random-forest.

Algorithm 8 The Steps of Implementing Naive Bayes (NB)
Input: The n-dimensional data, X ∈ R n and target outcome, Y ∈ R Output: The posterior probability, P ∈ [0, 1] of unseen test data, x, where C i=1 P i = 1 and C = 2 (diabetes present (C 1 ) or not (C 2 )) 1 Compute the prior probabilities for each of the class [49], P(Y = C 1 ) = N C1 N and P(Y = C 2 ) = N C2 N , where N is the number of sample 2 The output posterior probability of class for the given predictor (attributes), P( , where P(X |C i ) is the likelihood of the predictor for a given class and P(X ) is the prior probability of predictor.
inter-class redundancy is much higher (not linearly separable), as in the PID dataset. The comparative results demonstrate that our proposed framework has outperformed other frameworks on AUC, which has shown great potentiality for diabetes prediction from the PID dataset. The ensembling of two boosting type classifiers (AB and XB) is the best combination for diabetes prediction, as the base classifiers should have a minimum correlation between them. The higher precision in diabetes prediction from the PID dataset using the best combination (AB+XB) can be achieved when our proposed preprocessing (P + Q and correlation-based feature selection) is applied. In the future, the proposed Algorithm 9 The Steps of Implementing XGboost (XB) Input: The n-dimensional data, X ∈ R n and target outcome, Y ∈ R Output: The posterior probability, P ∈ [0, 1] of unseen test data, x, where C i=1 P i = 1 and C = 2 (diabetes present (C 1 ) or not (C 2 )) 1 Initialize the model with constant value: F(x)) is the differentiable loss function and N is the number of sample 2 for m = 1 to M (n_Iterations) do 3 Compute pseudo-residuals, Fit a base tree, h m using training set (X i , r im ) for i = 1, 2, . . . , N 5 Compute multiplier γ m by is the desired posterior probability, P ∈ [0, 1] trained model will be used to build a web app with a user-friendly interface. Additionally, the proposed framework will be applied to other medical contexts to verify their generality and versatility to predict the disease classes.

CONFLICTS OF INTEREST
Authors haven't any conflicts to disclose this research. He is currently working as an Assistant Professor with the EEE Department, KUET. During the studying of MAIA, he focused on different modalities of medical image analysis and machine learning to build a generic computer-aided diagnosis system. He is also working as a Supervisor with several undergraduate students on different modalities of medical image classification, segmentation, and registration. He has published several international journal articles and conference papers on medical image and signal processing. His research interests include medical image and data analysis, machine learning, deep convolutional neural networks, medical image reconstruction, and surgical robotics. He received the University Gold Medal due to securing 1 st position in his class at KUET.
MD. ASHRAFUL ALAM is currently pursuing the degree in electrical and electronic engineering (EEE) with the Khulna University of Engineering & Technology (KUET). His research interests include medical image and data processing, computer vision, and deep learning. He is also working on medical data analysis, skin cancer classification, and multilabel whole heart segmentation from CT and MRI as a B.Sc. thesis. She is also working as a Lecturer with the CSE Department, KUET. Her research interests are in machine learning, deep neural networks, data mining, and biomedical engineering. She has published some conference papers in these domains. She is also working on some articles about these topics with her students and colleagues. Since 2015, he has been an Assistant Professor with the Department of Electrical Engineering and Renewable Energy, Oregon Tech, where he is involved with several research projects on renewable energy and grid-tied microgrid system. He is currently working as an Associate Researcher with the Oregon Renewable Energy Center (OREC). He is also a Registered Professional Engineer (PE) in the state of Oregon, USA. He is also a Certified Energy Manager (CEM) and Renewable Energy Professional (REP). He has been working in the areas of distributed power systems and renewable energy integration for the last ten years. He is looking forward to exploring methods to make the electric power systems more sustainable, cost-effective and secure through extensive research and analysis on energy storage, microgrid systems, and renewable energy sources, with his dedicated research team. He has published several research articles and posters in this field. His research interests include modeling, analysis, design, and control of power electronic devices, energy storage systems; renewable energy sources, integration of distributed generation systems, microgrid and smart grid applications, robotics, and advanced control systems. He is a Senior Member of the Association of Energy Engineers (AEE). He is the winner of the Rising Faculty Scholar Award, in 2019, from the Oregon Institute of Technology for his outstanding contribution to teaching. He is also serving as an Associate Editor for IEEE ACCESS. Since 2018, he has been a Lecturer with KUET. He is also a Graduate Teaching Assistant with Stony Brook University. His previous research interest was applicable to machine learning and deep learning. His current research interest is in computer vision. From 2017 to 2018, he also worked as the IEEE Student Branch President. VOLUME 8, 2020