Average Weighted Objective Distance-Based Method for Type 2 Diabetes Prediction

Early detection of Type 2 diabetes is necessary for its prevention. The prediction models for detection systems normally employ common factors that may not properly fit all persons having different health conditions. Therefore, this study proposes a method for type 2 diabetes prediction with factors representing personal health conditions. More specifically, this study proposes a novel prediction method named Average Weighted Objective Distance (AWOD) based on the assumption that the individual has diverse health conditions resulting from different individual factors, a requirement for an effective prediction model. AWOD is a modification of Weighted Objective Distance (WOD) by applying information gain to reveal significant and insignificant individual factors having different priorities, which are represented by different weights. For AWOD, the data set is divided into a training set used to determine all relevant thresholds and constant values required for AWOD calculation and the testing set. In particular, AWOD is designed for binary classification problems with a relatively small dataset. To validate the proposed method, two datasets from open sources, Pima Indians Diabetes (Dataset 1) and Mendeley Data for Diabetes (Dataset 2) each containing 392 records, were studied. The prediction performance for both datasets is compared with the machine learning-based prediction methods, including K-Nearest Neighbors, Support Vector Machines, Random Forest, and Deep Learning. The comparison results showed that the proposed method provided 93.22% and 98.95% accuracy for Dataset 1 and Dataset 2, respectively, which are higher than those provided by other machine learning-based methods.


I. INTRODUCTION
Diabetes, formally called diabetes mellitus, is a group of abnormal metabolic and chronic diseases. It causes elevated blood glucose levels, which results in prolonged high blood sugar levels [1]. Elevated blood sugar levels can lead to increased urination, thirst, and hunger, especially for sweets. It also leads to severe damage to the blood vessels, heart, kidneys, eyes, and nerves. Without urgent treatment, diabetes mellitus can cause many other complications and serious negative side effects, up to and including diabetic ketoacidosis, nonketotic hyperosmolar, heart disease, stroke, kidney failure, foot ulcers, vision loss, and blindness [2], [3]. In addition, individuals with diabetes mellitus are more likely The associate editor coordinating the review of this manuscript and approving it for publication was Yiming Tang .
to be infected and are at a higher risk of complications and death from COVID-19 [4]. Recently, diabetes has become the leading cause of mortality and morbidity in the world. According to the International Diabetes Federation, approximately 463 million people had diabetes worldwide in 2019. This amount is expected to increase by 51% in the next 26 years with around 700 million people living with diabetes worldwide in 2045 [5]. Early detection and treatment of diabetes is a major step forward in necessary treatment for diabetic patients, which can reduce the risk of serious complications [6].
There are three main types of diabetes: type 1, type 2, and gestational diabetes. Type 1 diabetes develops when the body cannot produce insulin because cells in the pancreas responsible for that are destroyed. Type 2 diabetes develops when the body becomes resistant to insulin. Gestational diabetes VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ develops when insulin-blocking hormones are produced during pregnancy. In 2019, there were over 1.1 million children and adolescents living with type 1 diabetes. For gestational diabetes type, there were over 20 million live births affected by diabetes during pregnancy. However, type 2 diabetes is the most common type usually found in adults. Its prevalence has increased dramatically in most countries worldwide with approximately 374 million people at increased risk of developing it in 2019. Diabetes caused total mortality of about 4.2 million. Notably, it was found that 1 in 2 people with diabetes went undiagnosed, which increases the number of mortalities in the future. In general, many people with type 2 diabetes rarely show any symptoms, which results in increased risk factors generating complications [7]. People with this condition should undergo several tests to diagnose diabetes in advance. The increasing number of patients subjected to inadequate health care providers exacerbate the problem for diabetes diagnosis and care [8]. The primary objective of this study is to support health care providers with a prevention diabetes prediction method, particularly for early-stage type 2 diabetes. For the early disease detection systems, the existing studies employ different machine learning algorithms, which normally go through large and diverse amounts of data, to predict the presence of type 2 diabetes. The machine learning-based methods can easily fail to categorize the diversity of individuals with a relatively small set of data. To address this issue, this study proposes a binary classification method based on distance measurement to predict the presence of type 2 diabetes. Furthermore, existing prediction methods normally employ a common set of factors for constructing the model. According to the health care professional principle in the diagnosing process, a wide swath of health conditions results in different disease diagnoses and treatment decisions [9]. Thus, this study proposes a novel prediction method to predict the presence of type 2 diabetes based on individual factors instead of using common factors as shown in the general prediction methods.
The proposed method, the Average Weighted Objective Distance (AWOD), is a modification of Weighted Objective Distance (WOD) [10]. Both methods are designed particularly for constructing prediction models from relatively small datasets because of infrequent and rare clinical data. To calculate the weight of each factor, WOD requires the pre-defined thresholds and constant values, which need to be assigned by a healthcare professional accordingly to the individual health diagnosis records. This process cannot be applied to any dataset without individual health diagnosis records. AWOD is thus designed to deal with this limitation of WOD. More specifically, AWOD derives all of them directly from the training dataset.
The organization of this paper is in the following way: Section II describes the literature review. Section III presents the research methodology. Section IV showcases experimentation with the proposed AWOD based method. Section V shows the results and discussion. Section VI gives the conclusion for this paper.

II. LITERATURE REVIEW
The literature reviews below discuss existing works in two different research areas, namely factors for prediction method and prediction method.

A. FACTORS FOR PREDICTION METHOD
The prediction of diabetes normally considers the common set of data for constructing the model. To take into account the diversity of individuals, the risk factors are considered. Novel feature extraction methods are proposed to select relevant factors required for effective prediction. The most significant risk factors are extracted from the whole dataset based on the attribute scores [11], [12]. There exist works that account for both risk factors and symptom-oriented variables for constructing the model [13]. Complex attributes containing several categorical attributes are also employed to provide better prediction performance [14]. However, the reduction of the number of input factors is also widely encouraged [15] to decrease the model complexity. From the previous studies regarding early diabetes prediction, the common set of factors are granted the most consideration for model construction. Conversely, this study bears in mind individual set of factors derived from different health conditions for each person.

B. PREDICTION METHOD
As mentioned before, machine learning algorithms have been widely used in early diabetes prediction for decades. Pieces of literature have shown they express an effective performance compared with the traditional statistical methods [16]. Several works for early diabetes detection using several machine learning techniques, such as K-Nearest Neighbor (K-NN), Support Vector Machine (SVM), Artificial Neural Network (ANN) [17], [18], and Random Forest (RF) [19] have demonstrated cost-effectiveness and time-saving for diabetic patients and doctors. Particularly, the algorithms were used to classify different types of datasets. For example, the RF classifier exhibited high accuracy for predicting type 2 diabetes of individuals based on their lifestyle and family background [19]. K-NN, SVM, Logistic Regression (LR), and ANN were applied for type 2 diabetes prediction by using long non-coding RNAs and demographic data [20]. ANN, RF, and Decision Trees were used to predict diabetes from physical examination data randomly selected for healthy people and diabetic patients [21]. For this set of clinical data, machine learning-based predictive models usually fail to characterize the diversity among individuals without enough numbers of data. Recent works have demonstrated the attempts to empower machine learning-based classifiers to deal with classification problems of small training sample size, such as a generalized mean distance-based K-NN classifier [22], a local mean representation-based K-NN classifier [23], and a locality constrained representation-based K-NN classifier [24]. Similarly, this study thus proposes the binary classification method for a relatively small data set.
The binary classification method is based on distance measurements like Hamming, Euclidean, Manhattan distance, and Minkowski distance [25], [26]. Generally, a distance measure is an objective score describing the relative difference between two objects. Hamming distance calculates the distance between two binary vectors or binary strings. It is the number of bit positions in which the two bits are different. Euclidean and Manhattan distance determine the distance between two real-valued vectors. Euclidean distance is the straight-line distance between 2 data points in a plane, which can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem. Manhattan distance is preferred for the vectors describing objects on a uniform grid, like a city block. The Manhattan distance thus is normally used to calculate the distance between two data points in a grid-like path. For Minkowski distance, it is a generalization of other distances having a parameter called the order to calculate different distance measures. More specifically, it gives control over the type of distance measure of realvalued vectors by using a hyperparameter that can be tuned. For example, when the order is ''1'', ''2'', and infinity, it is the Manhattan, Euclidean, and Chebychev distance respectively. For this study, the proposed measurement follows the principle of Euclidean distance which is to measure the distance between 2 real-valued vectors in the plane and have a scalar factor to change their lengths but does not change their directions.
The distance measurements have been using for solving various classification and prediction problems. For example, a combination of Jaccard and weighted Euclidean distances was presented for noise prediction [27]. A weighted Chebyshev distance method was proposed for the classification of hyperspectral imagery [28]. In personalized learning application domains, objective distance (OD) [29] was initially proposed to measure the distance between the current competency of a student and the expected level to attain learning expectations. Similarly, the OD was applied in the health care domain to classify a group of older people with hypertension by using the distance between current and expected health status of all risk factors developing hypertension [30]. It was also used to provide the appropriate recommendation individually based on each risk factor [31].
Later, WOD [10], which is the modification of the original OD, was proposed to improve the group classification performance of older people with hypertension by using prioritized individual factors instead of considering all of them. Following the principle of distance measurement, WOD has a scalar factor or weight for representing the priority. For WOD, priority is represented only for significant individual factors by different weights obtained from the information gain principle. For weight calculation, WOD requires predefined thresholds namely expected and acceptable levels of all factors and some constant values. Therefore, WOD cannot be obtained directly from any dataset with no clues of pre-designed thresholds and constant values. Then, AWOD is proposed in this study to be more generalized for different datasets and different diseases as two diabetes datasets in this study.
Information gain is still applied to prioritize factors for AWOD. Generally, information gain measures reductions in entropy [32] and determines irrelevant attributes of a dataset [33]- [36], including individual factors [37] by considering information gain levels after reducing entropy. For AWOD, the information gain is applied to determine both significant and insignificant factors for individuals. The former is defined as the factors the individual cannot control properly so that they have a noticeable effect on their health condition, while the latter are those controlled by the individual [10]. The limitation of WOD is that fewer significant individual factors decrease classification performance. For this study, the assumption is thus made that the significant and insignificant factors are different for each person and can be effectively used for constructing a model to provide higher prediction performance with proper priority settings.
Therefore, this study proposes the AWOD based method for type 2 diabetes prediction based on individual health conditions employing both significant and insignificant individual factors. Accordingly, the AWOD method is modified to be more generalized for a relatively small dataset for a binary classification problem. To represent priority, the weight is calculated from the obtained thresholds and constant values from the training dataset of both classes.
To validate the proposed prediction method, two open data sets, namely Pima Indians Diabetes (Dataset 1) [38] and Mendeley Data for Diabetes (Dataset 2) [39], are used in this study. As mentioned before, the proposed AWOD method cannot compare with WOD because there are no pre-defined thresholds, and constant values required for weight calculation of these datasets. Instead, the prediction results with them are compared with existing machine learning-based methods from the distance-based group including K-NN and SVM, implicit feature selection group as RF, and deep learning (DL) method, which is the state-of-the-art machine learning method.
K-NN calculates the distance from the interest data to every other in the dataset to find the closest data. The obtained distances are sorted to find the nearest neighbors, where the k number is defined as a minimum distance to predict the results. The principle of SVM for classification is that the algorithm creates a line or a hyperplane by separating the data into two classes. To perform classification, SVM finds the hyperplane that maximizes the margin between the two classes. RF is known as an ensemble method that is more effective than a single decision tree because it can reduce over-fitting by averaging the result. RF is a dimensionality reduction method because it identifies the most significant variables among input variables. DL can learn without human supervision by drawing from data that is both unstructured and unlabeled. Learning can be supervised, semi-supervised, or unsupervised. This algorithm is VOLUME 9, 2021 essentially a neural network with three or more layers that attempt to mimic the human brain based on a combination of data inputs, weights, and biases.

III. RESEARCH METHODOLOGY
The research methodology of this study consists of two main procedures as shown in Figure 1, AWOD determination and evaluation of AWOD based prediction method. The details of each procedure are described in the following sections.

A. AWOD DETERMINATION 1) AWOD CONCEPT
The principle underlying AWOD based method is based on the number of significant and insignificant factors representing real effects towards the prediction. This principle can represent the different individual health conditions and the general diagnostic procedures performed by health care professionals. The AWOD concept is illustrated in Figure 2. From Figure 2, three main steps are required for AWOD determination. The first step is to determine important levels for weight calculation namely an expected level and an acceptable level. The expected level is the health status level of each factor that the individual is recommended to have for an individual with no diabetes, while the acceptable level is the health status level for each factor that is acceptable for an individual with no diabetes. Next, the weights for both significant and insignificant factors are calculated. Finally, AWOD is determined for prediction.
To determine expected and acceptable levels used to calculate the current and the acceptable distance for each factor, the dataset will be split into two, dataset U and dataset V. The dataset U is the training set used for determining expected and acceptable levels of all factors by averaging score. Therefore, both levels can be representatives from those in the training set. Then, both levels are used to find weights for each factor. Dataset V is the testing set applied for weighting factors and determining AWOD. The value of weighting factors can represent significant and insignificant factors. These represent the real effects of each factor for each individual based on the set of significant and insignificant factors, which is denoted as a constant value. Next, Dataset V is used to calculate the distance between the acceptable distance and each factor's current distance. Then, the obtained distances, their associated weights, and a constant value are combined for deriving an individual AWOD. Since the values of each factor are on different scales, a min-max normalization is required for rescaling the range of attributes to scale the range in [0,1].

2) TARGET CLASS DETERMINATION WITH AWOD BASED METHOD
The algorithm for determining target class with AWOD based prediction method can be explained as the pseudocode as follows.

BEGIN //Variable initialization
Step 0: Input all variables u j+ ← value of factor j that is in the positive class for the U set nu j+ ← total number of factor j that is in the positive class for the U set u j− ← value of factor j that is in the negative class for the U set nu j− ← total number of factor j that is in the negative class for the U set Tn ← total number of factors T ← total number of target classes Z j ← the current level of factor j x j ← value of the expected level of factor j y j ← value of the acceptable level of factor j z j ← value of the current level of factor j lTn ← the minimum number of factors that can affect identifying the negative class hTn ← the maximum number of factors that can affect identifying the negative class MAXnu j+ ← the factor with the maximum number of the positive class among all factors nND (v=0) ← total number of factorswith the normalized average-based weighted objective distance that is equal to 0 ND (v=0) ← the factor with the normalized average -based weighted objective distance that is equal to 0 N ← total number of individuals //Split data to Training Set (U set) and Testing Set (V set) Step 1: Read Dataset Random data for U set (70%) Random data for V set (30%) //Determine expected levels and acceptable levels of all //factors from U set. //Stopping Condition is the number of factors that are //reached.
Step 2: WHILE Stopping Condition is False Step 3: Calculate the expected level of each factor for all samples of positive class Step 4: Calculate the average number of each factor for all samples of negative class Step 5: Calculate the acceptable level of each factor for all samples //Determine the entropy of each factor in V set. //Stopping Condition is the number of factors that are //reached.
Step 9: WHILE Stopping Condition is False Step 10: IF x j > y j > z j or x j < y j < z j THEN Step 11: Calculate acceptable distance for each factor that can identify as the positive class (dXY j ) Step 12: Calculate the current distance that must be considered to identify as the positive class or the negative class (dXZ j )

ENDIF
Step 13: Calculate the probability of the acceptable distance (pXY j+ ) and the probability of the current distance (pXZ j− ) Step 17: WHILE Stopping Condition is False Step 18: Determine the significant gain for each factor (S j ) VOLUME 9, 2021 Step 19: Determine the weight of each factor (W j ) W j = S j Na j=1 S j ENDWHILE //Determining AWOD in V set. //Stopping Condition is the number of factors that are //reached.

ENDCASE
Step 24: Determine the average-based weighted objective distance for all factors for each individual   Figure 3 shows the evaluation process of the proposed AWOD based method. In it, the predicted and the observed class were applied to evaluate the prediction performance using precision, specificity, and accuracy. The observed class refers to individuals' actual condition, specifically type 2 diabetes presence or absence. The predicted class refers to the prediction of either absence of type 2 diabetes (AD) or the presence of type 2 diabetes (PD), using the AWOD based method. In addition, K-NN, SVM, RF, and DL classifiers were employed with all original factors to compare their performance to the AWOD based prediction method. The classifiers used for evaluating results are described below. The details and results of prediction performance are explained in the next section.

IV. EXPERIMENT
A. DATA COLLECTION Two datasets used for experimenting with type 2 diabetes prediction were designated Dataset 1 and 2. The former was collected from the Kaggle website, while latter from the Mendeley Data website. Dataset 1 is originally from the National Institute of Diabetes and Digestive and Kidney Diseases associated with all female patients at least 21 years old of Pima Indian heritage. The dataset contains 392 records after removing the missing value, and 8 factors including pregnancies, glucose, blood pressure, skin thickness, insulin, BMI, diabetes pedigree function, and age. Dataset 2 is originally from the Iraqi society, which was acquired from the laboratory of Medical City Hospital and the Specializes Center for Endocrinology and Diabetes-Al-Kindy Teaching Hospital. For this dataset, the data attribute used in this study includes 10 factors, which are age, urea, creatinine ratio, hemoglobin A1c (HBA1C), cholesterol, triglycerides, highdensity lipoprotein (HDL), low-density lipoprotein (LDL), very-low-density lipoprotein (VLDL), and body mass index (BMI). Data with Diabetic and Non-Diabetic classes were only employed for predicting type 2 diabetes. In this dataset, 392 records were randomly selected for this study, which is the same as Dataset 1. Abbreviations of diagnostic factors for Dataset 1 and Dataset 2 are presented in Table 1 and Table 2 respectively. The abbreviation of diagnostic factors for Dataset 1 is used for a calculating demonstration to predict type 2 diabetes in the next section. Examples of the gathered data for Dataset 1 and Dataset 2 based on factors associated with type 2 diabetes diagnoses are shown in Table 3 and Table 4 respectively. In this table, ''Yes'' and ''No'' for the diabetes factors (Di) represent presence and absence respectively.

B. DEMONSTRATION OF SAMPLE CALCULATION
This section presents the sample AWOD calculation to predict the presence or absence of type 2 diabetes using Dataset 1. The type 2 diabetes prediction using the proposed measurement method can be illustrated by applying data no. 1. In this study, the presence of type 2 diabetes was denoted as PD, whereas the absence as AD. The sample calculation performed to calculate the AWOD and predict which class the data no. 1 belongs to is shown below.
To determine levels by averaging score, the dataset of 392 samples as shown in Table 3 were divided into 2 sets,  U and V, by splitting the data in 70:30 ratio due to their relatively small size. The proportion of the split ratio represents that 70% of the data used for determining the value of the expected and the acceptable level, which refers to the U set. Conversely, 30% of the data will be applied for weighting factors, which refers to the V set. The U set includes 274 samples, and the V set 118 samples. An example of the expected and acceptable level calculation for the Dpf factor is provided. X j for the Dpf factor (X Dpf ) and Y j for the Dpf factor (Y Dpf ) were determined, which applied the dataset from the U set. In this calculation sample,  Table 5.  From above algorithm, the first step of determining weight factor is to find the entropy of the target class. The equal probability of the target class was initially determined. This refers to the equal probability between AD representing the positive target class (EP + ) and PD representing the negative target class (EP − ). The values of the AD (EP + ) and PD (EP − ) were equated to 4, as follows: Next, the positive-target class fraction (F + ) and the negative-target class (F + ) with respect to all factors was determined. The fractions of the AD (F + ) and PD (F − ) with respect to all factors are shown as follows: The entropy of the target class with respect to all factors [E(C)] was thus determined. The value of E (C) was equal to 1, as follows: The second step is to determine the entropy of each factor. An example of entropy calculation for the Dpf factor is provided. The value of acceptable distance (dXY Dpf ) and current distance (dXZ Dpf ) for Dpf factor were measured respectively. From Table 3, the current level of data no.1 for the Dpf factor (Z Dpf ) is 0.692, which belongs to the V set. From Table 5, X Dpf = 0.484 and Y Dpf = 0.550 calculated from the U set. The dXY Dpf and dXZ Dpf were equal to 0.066 and 0.208, respectively, as follows: The entropy of Dpf factor [E(C Dpf )] was then computed. E(C Dpf ) was equal to 0.79, as follows: The entropy calculated for each factor is shown in Table 6. The third step is to determine the information gain of the target class with respect to all factors. Thus, the entropy of all factors [E(Ct)] was calculated. E (Ct) was equal to 0.54, as follows: Then, the information gain of the target class with respect to all factors [Gain (C, t)] was determined. Gain (C, t) was equal to 0.46, as follows: The fourth step is to determine the weight of each factor. Dpf factor was used as an example of the determination of the weight for each factor. The significant gain for Dpf factor (S Dpf ) was calculated. S Dpf was equal to 1.72 as follows: The significant gain calculated for each factor is shown in Table 6.  The weight of the Dpf factor (W Dpf ) was determined. W Dpf was equal to 0.18, as follows: 1.72 1.58 + 1.86 + 1.10 + 0 + 0 + 2.18 + 1.72 + 1.08 = 0.18 The weight of each factor is displayed in Table 6.
To determine the AWOD of all factors, the AWOD for the Dpf factor (D Dpf ) was first determined. D Dpf was equal to 0.03, as follows: The average-based weighted objective distance of each factor is shown in Table 6.
Among all factors, the D max was 4.54 and the D min was 0. Then, the normalized average-based weighted objective distance for the Dpf factor (ND Dpf ) was determined. ND Dpf was equal to 0.01, as follows: The normalized average-based weighted objective distance for each factor is demonstrated in Table 6.
The AWOD for all factors for data no.1 (AWOD 1 ) was determined. In this study, lTn = 5 and hTn = 2 were derived by observing the dataset. According to Step 23 in the AWOD algorithm, b = 1if nND (v=0) < lTn, which nND (v=0) = 2 including St and In factors for the data no.1, so b = 1. AWOD 1 was equal to 0.25, as follows: Different weights (W i ) representing significant and insignificant factors for data no. 1 in Table 6 include W Pr = 0.17, W Gl = 0.20, W Bp = 0.12, W St = 0, W In = 0, W Bm = 0.22, W Dpf = 0.19, and W Ag = 0.11. The weight with a value of 0 indicates an insignificant factor. In contrast, the weight with a value greater than 0 indicates a significant factor. Accordingly, the weights of Pr, Gl, Bp, Bm, Dpf, and Ag were deemed to be significant factors. The weights of St and In were indicated as insignificant factors. To identify the target class based on the obtained AWOD 1 , the sample data no.1 was in the negative class because AWOD 1 = 0.25. Therefore, sample data no.1 can be predicted as an individual who has type 2 diabetes presence.

V. RESULTS AND DISCUSSION
The proposed AWOD based method for type 2 diabetes prediction uses Datasets 1 and 2 categorized into either AD (AWOD = 0) or PD (0 < AWOD ≤ 1). To evaluate the prediction accuracy of the AWOD based method, the result was compared with the observed value for each data sample.
A. PREDICTION RESULTS OF THE AWOD BASED METHOD Table 7 and Table 8 show examples of the type 2 diabetes prediction results for Dataset 1 and Dataset 2, respectively, using the proposed AWOD based method. In the tables, weights, constant values, AWOD values, and predicted classes are showcased. ''Y'' means a class has been correctly predicted, and ''N'' means an incorrect prediction by matching between the predicted and the observed class. The predicted class as ''AD'' represents ''No'' for the observed class, whereas the predicted class as ''PD'' represents ''Yes'' for the observed class. The weight represents the factor that can affect the type 2 diabetes prediction. A constant value was used to calculate the AWOD value representing real effects towards the prediction based on the number of both significant and insignificant factors, and a factor with the maximum number of the AD class among all factors. For example, the significant factors for sample data no.1 include Pr, Gl, Bp, Dpf, and Ag. The insignificant factors are St, In, and Bm. According to Step 23 in the AWOD algorithm, the number of insignificant factors is less than the minimum number of factors affecting the prediction as the PD class, which represents significant factors, a constant value was equal to 1. Therefore, the AWOD value for sample data no.1 was equal to 0.25, which was predicted as the PD class (0 < AWOD ≤ 1). In contrast, the significant factors for sample data no.5 include Bp, In, and Bm. The insignificant factors include Pr, Gl, St, Dpf, and Ag. According to Step 23 in the pseudocode for the proposed AWOD based method, the number of insignificant factors is more than the maximum number of factors that can affect identifying the negative class or the PD class, in other words, the number of insignificant factors is more than the number of significant factors so that a constant value was equal to 0. Therefore, the AWOD value for sample data no.5 was equal to 0.00, which was predicted as the AD class (AWOD = 0). Based on these two examples, both individuals have different sets of significant and insignificant factors that can be used to predict type 2 diabetes. In addition, each individual has specific health conditions influencing type 2 diabetes diagnosis, so a constant value is necessary for representing real effects towards the prediction to obtain the accurate class besides the weights.

B. AN EVALUATION OF PREDICTION PERFORMANCE FOR THE AWOD BASED METHOD
To evaluate prediction performance obtained from the AWOD based method, precision, specificity, and accuracy were used. Precision can measure how frequently the proposed AWOD based method correctly predicts true positive (TP) out of the total number of predicted positive classes. TP represents the individuals who were correctly predicted in the AD group. Specificity can measure the proportion of true negative (TN) that is correctly predicted out of the total number of negatives. TN represents the individuals who were correctly predicted in the PD class. Accuracy can measure the total prediction performance of the proposed AWOD based method, which indicates that both TP and TN are correctly predicted. To calculate precision, specificity, and accuracy, false positive (FP) and false negative (FN) are applied. FP means that the individuals were incorrectly predicted as the AD class, but the observed class is in the PD class. FN means that the individuals were incorrectly predicted as the PD class, but the observed class is in the AD class. Table 9 demonstrates the performance of type 2 diabetes prediction for Dataset 1 and Dataset 2 using the AWOD based method. Each dataset used for type 2 diabetes prediction includes 392 records. The prediction performance for Dataset 1 indicated TP = 35, FP = 5, TN = 75, and FN = 3. For Dataset 2, the prediction performance provided TP = 88, FP = 1, TN = 28, and FN = 1. The precision was revealed at 87.50% for Dataset 1 and 98.88% for Dataset 2, indicating that the proposed method has the high potential for predicting an individual who has the type 2 diabetes absence. The specificity was revealed at 93.75% for Dataset 1 and 96.55% for Dataset 2, indicating that the proposed method has the ability in predicting an individual who has type 2 diabetes presence correctly. Particularly, the overall prediction accuracy revealed that the proposed AWOD based method provides high accuracy with 93.22% for Dataset 1 and 98.95% for Dataset 2.
After evaluating the prediction performance with precision, specificity, and accuracy, the AWOD based method has a high potential to predict the type 2 diabetes presence or absence. The determination of significant factors and insignificant factors applying information gain based on an average value of the acceptable level and the expected level, represented as weighting factors, can be used in the prediction process. The prioritization of those factors by different weights along with indicating a constant value affecting the actual class prediction appears to be a workable method for prediction.

C. COMPARATIVE PREDICTION RESULTS
The prediction accuracy obtained from the AWOD based method was compared against K-NN, SVM, RF, and DL classifiers, as shown in Table 10. The K-NN and SVM were employed in this study because both classifiers measure distances to obtain the prediction, which is similar to the proposed AWOD based method. For this study, the K-NN classifier performed the prediction with K = 5. The concept of AWOD based method is to consider significant individual factors and transform insignificant factors to be zero in the prediction process. Similarly, the RF classifier is widely used for feature extraction to identify the most significant features; therefore, the RF was employed to compare the prediction performance against the proposed method. Additionally, a state-of-the-art machine learning algorithm as DL was applied in this study to evaluate the prediction performance of the AWOD based method. A K -fold cross-validation technique was used for validating the prediction performance. This technique is appropriate for a limited dataset and yields minimum bias during the training process [42]. To obtain the prediction performance, the dataset was divided into 10 folds (K = 10) for employing in the training and testing process. From Table 10, using the K-NN and SVM classifiers to predict type 2 diabetes presence or absence resulted in poor accuracy for Dataset 1 with 71.68% and 77.30% respectively. Although the K-NN and SVM classifiers provided good accuracy for Dataset 2 with 92.08% and 93.45% respectively, the prediction performance obtained from the proposed AWOD based method still provided better accuracy than those classifiers for both datasets, which are 93.22% for Dataset 1 and 98.95% for Dataset 2. It was caused by using all factors, significant and insignificant factors, to calculate the distance for all patients because some may not affect some individuals, but those factors were used in the prediction process. Moreover, the prediction accuracy obtained from DL provided 74.74% for Dataset1 and 94.72% for Dataset 2; however, the results obtained from the AWOD based method were still better. It caused from the DL requires a large training dataset for training the model based on a combination of data inputs, weights, and bias to derive better accuracy. Besides, the prediction results for both datasets using the RF classifier provided higher accuracy than those provided by other classifiers because this method chose only the most significant factors for prediction. According to the comparison results, the AWOD based method provided higher prediction performance than using other machine learning classifiers because this method works well with relatively small datasets, while larger datasets are required for those classifiers.
The proposed AWOD based method has the potential to predict the patients whether have type 2 diabetes presence or absence. Therefore, the assumption made by this study can be confirmed that the proposed AWOD based method can provide higher accuracy than those machine learning classifiers. In particular, the AWOD based method can determine significant factors and insignificant factors for the prediction process, which results in the high accuracy of prediction. It can be recognized that insignificant factors can affect type 2 diabetes prediction among individuals because patients have different health conditions. Some insignificant factors may represent the factors influencing the presence of type 2 diabetes for some individuals. Thus, those factors applied in the AWOD based method can enhance prediction performance.
However, the proposed AWOD based method still provided an approximate error of 6.78% for Dataset 1 and 1.05% for Dataset 2 for incorrect prediction. Among the incorrect prediction cases, the individuals were predicted in the wrong class. Most cases may cause by an individual having specific health conditions. This condition resulted in the current level of those individuals being indicated in the improper range either below or above the acceptable level calculated by the average score. Additionally, inaccurate predictions may obtain from constant values. The minimum and maximum numbers of factors used for determining constant values may not be workable for predicting type 2 diabetes in those individuals. Therefore, determining the expected level, the acceptable level, and constant values will be considered for future studies by modifying AWOD based method or applying different analytical points of view for obtaining better prediction performance.
The proposed AWOD based method can benefit the diagnosis of any chronic diseases with a relatively small dataset that is hardly collected more often and many of them are low frequency of change. For the limitations, computational complexity should be further investigated for future work. Processing large datasets using AWOD based method may encounter computational complexity problems because there are several computational stages involved and requires many parameter settings. In addition, the proposed AWOD method requires complicated parameter settings in order to apply for multi-category classifications. It is also worth modifying the AWOD to be more generalized for other multi-category classifications for future work.

VI. CONCLUSION
This study proposes a novel prediction method, called average-based weighted objective distance (AWOD), for type 2 diabetes prediction. The AWOD based method is based on the principle of health care professionals that considers individual health conditions for diagnosis. The proposed method employed information gain based on average values of expected levels and acceptable levels to prioritize factors referred to as weighing factors. The prioritized factor indicates significant and insignificant factors for individuals. Those factors can represent real effects towards prediction based on diverse individual health conditions. The open data named Pima Indians Diabetes dataset and Mendeley Data for Diabetes dataset were studied for the experiment, which contains 392 records for each set. The prediction performance obtained from the AWOD based method was evaluated by precision, specificity, and accuracy. The comparison results for prediction performance revealed that the AWOD based method provided 93.22% and 98.95% accuracy for Dataset 1 and Dataset 2 respectively, which are more accurate than those of machine learning-based prediction methods including K-NN, SVM, RF, and DL. Table 11.