A Machine Learning Methodology for Diagnosing Chronic Kidney Disease

Chronic kidney disease (CKD) is a global health problem with high morbidity and mortality rate, and it induces other diseases. Since there are no obvious symptoms during the early stages of CKD, patients often fail to notice the disease. Early detection of CKD enables patients to receive timely treatment to ameliorate the progression of this disease. Machine learning models can effectively aid clinicians achieve this goal due to their fast and accurate recognition performance. In this study, we propose a machine learning methodology for diagnosing CKD. The CKD data set was obtained from the University of California Irvine (UCI) machine learning repository, which has a large number of missing values. KNN imputation was used to fill in the missing values, which selects several complete samples with the most similar measurements to process the missing data for each incomplete sample. Missing values are usually seen in real-life medical situations because patients may miss some measurements for various reasons. After effectively filling out the incomplete data set, six machine learning algorithms (logistic regression, random forest, support vector machine, k-nearest neighbor, naive Bayes classifier and feed forward neural network) were used to establish models. Among these machine learning models, random forest achieved the best performance with 99.75% diagnosis accuracy. By analyzing the misjudgments generated by the established models, we proposed an integrated model that combines logistic regression and random forest by using perceptron, which could achieve an average accuracy of 99.83% after ten times of simulation. Hence, we speculated that this methodology could be applicable to more complicated clinical data for disease diagnosis.


I. INTRODUCTION
Chronic kidney disease (CKD) is a global public health problem affecting approximately 10% of the world's population [1], [2]. The percentage of prevalence of CKD in China is 10.8% [3], and the range of prevalence is 10%-15% in the United States [4]. According to another study, this percentage has reached 14.7% in the Mexican adult general population [5]. This disease is characterised by a slow deterioration in renal function, which eventually causes a complete loss of renal function. CKD does not show obvious symptoms in its early stages. Therefore, the disease may not be detected until the kidney loses about 25% of its function [6]. In addition, CKD has high morbidity and mortality, with a global impact The associate editor coordinating the review of this manuscript and approving it for publication was Hao Ji. on the human body [7]. It can induce the occurrence of cardiovascular disease [8], [9]. CKD is a progressive and irreversible pathologic syndrome [10]. Hence, the prediction and diagnosis of CKD in its early stages is quite essential, it may be able to enable patients to receive timely treatment to ameliorate the progression of the disease.
Machine learning refers to a computer program, which calculates and deduces the information related to the task and obtains the characteristics of the corresponding pattern [11]. This technology can achieve accurate and economical diagnoses of diseases; hence, it might be a promising method for diagnosing CKD. It has become a new kind of medical tool with the development of information technology [12] and has a broad application prospect because of the rapid development of electronic health record [13]. In the medical field, machine learning has already been used to detect human VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ body status [14], analyze the relevant factors of the disease [15] and diagnose various diseases. For example, the models built by machine learning algorithms were used to diagnose heart disease [16], [17], diabetes and retinopathy [18], [19], acute kidney injury [20], [21], cancer [22] and other diseases [23], [24]. In these models, algorithms based on regression, tree, probability, decision surface and neural network were often effective. In the field of CKD diagnosis, Hodneland et al. utilized image registration to detect renal morphologic changes [25]. Vasquez-Morales et al. established a classifier based on neural network using large-scale CKD data, and the accuracy of the model on their test data was 95% [26].
In addition, most of the previous studies utilized the CKD data set that was obtained from the UCI machine learning repository. Chen et al. used k-nearest neighbor (KNN), support vector machine (SVM) and soft independent modelling of class analogy to diagnose CKD, KNN and SVM achieved the highest accuracy of 99.7% [27]. In addition, they used fuzzy rule-building expert system, fuzzy optimal associative memory and partial least squares discriminant analysis to diagnose CKD, and the range of accuracy in those models was 95.5%-99.6% [1]. Their studies have achieved good results in the diagnosis of CKD. In the above models, the mean imputation is used to fill in the missing values and it depends on the diagnostic categories of the samples. As a result, their method could not be used when the diagnostic results of the samples are unknown. In reality, patients might miss some measurements for various reasons before diagnosing. In addition, for missing values in categorical variables, data obtained using mean imputation might have a large deviation from the actual values. For example, for variables with only two categories, we set the categories to 0 and 1, but the mean of the variables might be between 0 and 1. Polat et al. developed an SVM based on feature selection technology, the proposed models reduced the computational cost through feature selection, and the range of accuracy in those models was from 97.75%-98.5% [6]. J. Aljaaf et al. used novel multiple imputation to fill in the missing values, and then MLP neural network (MLP) achieved an accuracy of 98.1% [28]. Subas et al. used MLP, SVM, KNN, C4.5 decision tree and random forest (RF) to diagnose CKD, and the RF achieved an accuracy of 100% [2]. In the models established by Boukenze et al., MLP achieved the highest accuracy of 99.75% [29]. The studies of [2], [29] focus mainly on the establishment of models and achieve an ideal result. However, a complete process of filling in the missing values is not described in detail, and no feature selection technology is used to select predictors as well. Almansour et al. used SVM and neural network to diagnose CKD, and the accuracy of the models was 97.75% and 99.75%, respectively [30]. In the models established by Gunarathne et al., zero was used to fill out the missing values and decision forest achieved the best performance with the accuracy was 99.1% [31].
To summarize the previous CKD diagnostic models, we find that most of them suffering from either the method used to impute missing values has a limited application range or relatively low accuracy. Therefore, in this work, we propose a methodology to extend application range of the CKD diagnostic models. At the same time, the accuracy of the model is further improved. The contributions of the proposed work are as follows.
1) we used KNN imputation to fill in the missing values in the data set, which could be applied to the data set with the diagnostic categories are unknown.
2) Logistic regression (LOG), RF, SVM, KNN, naive Bayes classifier (NB) and feed forward neural network (FNN) were used to establish CKD diagnostic models on the complete CKD data sets. The models with better performance were extracted for misjudgment analysis.
3) An integrated model that combines LOG and RF by using perceptron was established and it improved the performance of the component models in CKD diagnosis after the missing values were filled by KNN imputation.
KNN imputation is used to fill in the missing values. To our knowledge, this is the first time that KNN imputation has been used for the diagnosis of CKD. In addition, building an integrated model is also a good way to improve the performance of separate individual models. The proposed methodology might effectively deal with the scene where patients are missing certain measurements before being diagnosed. In addition, the resulting integrated model shows a higher accuracy. Therefore, it is speculated that this methodology might be applicable to the clinical data in the actual medical diagnosis.
The rest of the paper is organized as follows. In Section II, we describe the preliminaries. The establishments of the individual model and the integrated model are described in Section III. In Section IV, we evaluate and discuss the performance of the integrated model. In Section V, we summarize the work and its contributions, including future works.

II. PRELIMINARIES
In this section, we describe the preliminaries before establishing the models, including the description of the data set and the operating environment, the imputation of the missing values and the extraction of the feature vector.

A. DATA DESCRIPTION AND OPERATING ENVIRONMENT
The CKD data set used in this study was obtained from the UCI machine learning repository [32], which was collected from hospital and donated by Soundarapandian et al. on 3 rd July, 2015. The data set contains 400 samples. In this CKD data set, each sample has 24 predictive variables or features (11 numerical variables and 13 categorical (nominal) variables) and a categorical response variable (class). Each class has two values, namely, ckd (sample with CKD) and notckd (sample without CKD). In the 400 samples, 250 samples belong to the category of ckd, whereas 150 samples belong to the category of notckd. It is worth mentioning that there is a large number of missing values in the data. The details of each variable are listed in Table 1. All of the algorithms were conducted in R (version 3.5.2), and the packages used included

B. DATA PROCESSING
Each categorical (nominal) variable was coded to facilitate the processing in a computer. For the values of rbc and pc, normal and abnormal were coded as 1 and 0, respectively. For the values of pcc and ba, present and notpresent were coded as 1 and 0, respectively. For the values of htn, dm, cad, pe and ane, yes and no were coded as 1 and 0, respectively. For the value of appet, good and poor were coded as 1 and 0, respectively. Although the original data description defines three variables sg, al and su as categorical types, the values of these three variables are still numeric based, thus these variables were treated as numeric variables. All the categorical variables were transformed into factors. Each sample was given an independent number that ranged from 1 to 400. There is a large number of missing values in the data set, and the number of complete instances is 158. In general, the patients might miss some measurements for various reasons before making a diagnosis. Thus, missing values will appear in the data when the diagnostic categories of samples are unknown, and a corresponding imputation method is needed.
After encoding the categorical variables, the missing values in the original CKD data set were processed and filled at first. KNN imputation was used in this study, and it selects the K complete samples with the shortest Euclidean distance for each sample with missing values. For the numerical variables, the missing values are filled using the median of the corresponding variable in K complete samples, and for the category variables, the missing values are filled using the category that has the highest frequency in the corresponding variable in K complete samples. For physiological measurements, people with similar physical conditions should have similar physiological measurements, which is the reason for using the method based on a KNN to fill in the missing values. For example, the physiological measurements should be stable within a certain range for healthy individuals. For diseased individuals, the physiological measurements of the person with a similar degree of the same disease should be similar. In particular, the differences in physiological measurements data should not be large for people with similar situations. This method should also be adapted to the diagnostic data of other diseases, as it has been applied in the area of hyperuricemia [33].
When the median of corresponding variables in K complete samples are selected, K is preferably taken as an odd number because in this case the middle number is naturally the median when the values of the numeric variables in the K complete samples are sorted by numerical value. The selection of K should neither be too large nor too small. An excessively large K value may ignore the inconspicuous mode, which might be important. Conversely, an excessively small K value causes noise and the abnormal data affects the filling of the missing values exceedingly. Therefore, the values of K in this work were chosen as 3, 5, 7, 9 and 11. As a result, five complete CKD data sets were generated. In addition, we also proved the effectiveness of KNN imputation by comparing it with two other methods in section III. One is to use random values to fill in the missing values, the other is to use mean and mode of the corresponding variables to fill in missing values of continuous and categorical variables, respectively.

C. EXTRACTING FEATURE VECTORS OR PREDICTORS
Extracting feature vectors or predictors could remove variables that are neither useful for prediction nor related to response variables and thus prevent these unrelated variables VOLUME 8, 2020  from interfering with the model construction, which causes the models to make an accurate prediction [34]. Herein, we used optimal subset regression and RF to extract the variables that are most meaningful to the prediction. Optimal subset regression detects the model performance of all possible combinations of predictors and selects the best combination of variables. RF detects the contribution of each variable to the reduction in the Gini index. The larger the Gini index, the higher the uncertainty in classifying the samples. Therefore, the variables with contribution of 0 are treated as redundant variables. The step of feature extraction was run on each complete data set. Images obtained on one complete data set are shown in Figs. 1 and 2, and this data set was obtained by KNN imputation when K equaling to 9. Fig. 1 represents the optimal combination of variables in the case of selecting one to all variables when the optimal subset regression was used. The vertical axis represents variables. The horizontal axis is the adjusted r-squared which represents the degree to which the combination of variables explains the response variable. To make it easy to distinguish each combination of variables, we used four colors (red, green, blue and black) to mark the selected variables. The combinations are ranked from left to right by the degree of explanations to the response variable and the right-most combination has the strongest interception to the response variable. Since the space is limited, the values represented by the horizontal axis in Fig. 1 are retained in two decimal places. The right-most combinations of variables in the images which were obtained by the optimal subset regression on each complete data set are shown in Table 2. For the complete data sets obtained by the KNN imputation, we selected the intersection of the optimal combinations on all complete data sets as the extracted combination of variables to obtain a uniform combination. In Table 2, for the complete data sets obtained by the KNN imputation, we used the intersection (bp, sg, al, bu, hemo, htn, dm, appet) to establish the models. For the complete data set obtained by the mean and mode imputation, the combination of the last row in Table 2 was used. For the complete data sets obtained by random imputation, we used the corresponding optimal combination obtained from each complete data set.
The result of feature extraction of RF is represented in Fig. 2, the vertical axis represents the variables, and the horizontal axis represents the reduced Gini index. The larger the reduced Gini index, the stronger the predictability of the variable to the response variable. When the RF was used to remove the variables with the contribution of zero, no matter which method was used to fill in the missing values, the variables with contribution of zero were the same, including pcc, ba and cad. Therefore, when the RF was used to extract the variables, all variables were selected expect pcc, ba and cad.

D. PERFORMANCE INDICATORS
In this study, ckd was set to be positive and notckd was set to be negative. The confusion matrix was used to show the specific results and evaluate the performance of the machine learning models. The template of the confusion matrix is shown in Table 3.
True positive (TP) indicates the ckd samples were correctly diagnosed. False negative (FN) indicates the ckd samples were incorrectly diagnosed. False positive (FP) indicates the notckd samples were incorrectly diagnosed. True negative (TN) indicates the notckd samples were correctly diagnosed. Accuracy, sensitivity, specificity, precision, recall and F1 score were used to evaluate the performance of the model. They are calculated using the following equations:

III. PROPOSED MODEL
In this section, the classifiers were first established by different machine learning algorithms to diagnose the data samples. Among these models, those with better performance were selected as potential components. By analyzing their misjudgments, the component models were determined. An integrated model was then established to achieve higher performance.

A. ESTABLISHING AND EVALUATING INDIVIDUAL MODELS
The following machine learning models have been obtained by using the corresponding subset of features or predictors on the complete CKD data sets for diagnosing CKD. Samples of data in the space are clustered in different regions due to their different categories. Therefore, there is a boundary between the two categories, and the distances between samples in the same category are smaller. According to the effectiveness of classification, we choose the aforementioned methods for disease diagnosis. LOG is based on linear regression, and it obtains the weight of each predictor and a bias. If the sum of the effects of all predictors exceeds a threshold, the category of the sample will be classified as ckd or notckd. RF generates a large number of decision trees by randomly sampling training samples and predictors. Each decision tree is trained to find a boundary that maximises the difference between ckd and notckd. The final decision is determined by the predictions of all trees in the disease diagnosis. SVM divides different kinds of samples by establishing a decision surface in a multidimensional space that comprises the predictors of the samples. KNN finds the nearest training samples by calculating the distances between the test sample and the training samples and then determines the diagnostic category by voting. Naive Bayes classifier calculates the conditional probabilities of the sample under the interval by the number of ckd and notckd samples in each different measurement interval. FNN can analyse non-linear relationships in the data sets due to its complex structure, and the sigmoid activation function was used in the hidden layer and the output layer.
To evaluate model performance comprehensively, in the case of retaining the sample distribution in the original data, a complete data set was divided into four subsets evenly. For all of the above models, each subset was utilized once for testing, and other subsets were utilized for training, the overall result was taken as the final performance. With the exception of RF, the rest of the models were established using the selected variables by feature extraction. RF does not require prior feature extraction, because predictors are selected randomly when each decision tree is established. In addition, when using KNN and FNN, all the categorical variables were converted into numeric types: categories 0 and 1 were converted to values 0 and 1, respectively, and the complete data sets were then normalised with the mean that is equal to 0 and the standard deviation that is equal to 1. Details of all are as follows: 1) The output of LOG was the probability that the sample belongs to notckd, and the threshold was set to 0.5.
2) RF was established using all variables. Two strategies were used to determine the number of decision trees generated. One is to use the default 500 trees and the other is to use the number of trees corresponding to the minimum error in the training stage. The RF was established using both strategies and evaluated on the data sets obtained by KNN imputation. The same random number seed 1234 was used to divide data and establish model, and the accuracy is shown in Table 4. It can be seen that the default number of trees is a better choice, therefore we selected the default 500 trees to establish RF.
3) The models of SVM were generated by using the RBF kernel function, and the function is described as follow: where γ was set to [0.1, 0.5, 1, 2, 3, 4]. Parameter C represents the weight of misjudgment loss, and it was set to [0.5, 1, 2, 3]. In each calculation of the model training, the algorithm selects the best combination of parameters to establish the model by grid search. 4) For the NB, the value of Laplace was equal to 1. 5) For the KNN, due to the nearest Euclidean distance with the detected sample, when the number of samples that are selected in training data set is an even number, the algorithm randomly selects a category as the output result of the detected sample in the situation wherein the number of selected samples belonging to ckd and notckd are the same. To avoid this in the work, the nearest neighbor parameter was set to [1, 3, 5, . . . , 19]. In each calculation of model training, the algorithm selected the best parameter to establish the model by grid search. 6) For the FNN, the network had a hidden layer. Presently, there is no clear theory in determining the best number of hidden layer nodes in a neural network. A method proposed in the previous study that was used to evaluate the performance of neural networks by increasing the number of hidden layer nodes one by one [35] was used in this study. The number of hidden layer nodes was increased one by one from 1 to 30. Then, the best result was selected.
To ensure the repeatability and comparability of the results, in the division of data, the establishment of RF with FNN, and the selection of the best parameters of SVM with KNN, the same seed of 1234 was used. For the random imputation,  the step of feature extraction was run on the complete data set obtained. Then, the models were established and evaluated by using the extracted features. Because of the randomness of the random imputation, the whole process was repeated five times to get the average result. For the KNN imputation and the mean and mode imputation, due to the certainty of data, the evaluation of models was executed once. After the feature extraction methods of optimal subset regression and RF were run, the accuracy of the basic models on the complete data sets are shown in Table 5 and Table 6, respectively.
It can been seen from Tables 5 and 6 that the optimal subset regression is more suitable for LOG and SVM when the KNN imputation is used, and the feature extraction method of RF is more suitable for FNN and KNN. When the KNN imputation is used, the accuracy of LOG and SVM is significantly improved (Table 5). In Table 6, the accuracy of LOG and SVM is relatively low, which might be due to the fact that there are too many redundant variables compared to the optimal subset regression. The accuracy of FNN is slightly improved and RF shows better performance when the KNN imputation is used both in Tables 5 and 6. For the NB and the KNN, the performance of the models when using KNN imputation is not very ideal compared to using random imputation or mean and mode imputation in Tables 5 and 6. The above result also proves the validity of the KNN imputation, since KNN imputation does improve the accuracy of some models, such as LOG, RF and SVM (Table 5). From Tables 5 and 6, LOG and SVM with the use of optimal subset regression, KNN and FNN with the use of the feature extraction of random forest and RF have better performance. Therefore, they are selected as the potential component models.

B. MISJUDGMENT ANALYSIS AND SELECTING COMPONENT MODELS
After evaluating the above models, the potential component models were extracted for misjudgment analysis to determine which would be used as the components. The misjudgment analysis here refers to find out and compare the samples misjudged by different models, and then determine which model is suitable to establish the final integrated model. The misjudgment analysis was performed on the extracted models. The prerequisite for generating an integrated model is that the misjudged samples from each component model are different. If each component model misjudges the same samples, the generated integrated model would not make a correct judgement for the samples either. When the data were read, each sample was given a unique number ranging from 1 to 400. The numbers of misjudgments for the extracted models on each complete data are shown in Table 7, and the black part indicates that the samples were misjudged by other models except FNN.
In Table 7, for the FNN, it can be seen that most of the misjudgments are simultaneously misjudged by other models. In addition, the performance of FNN is affected by the number of nodes in the hidden layer. It is not easy to establish a unified model for different data. Therefore, the FNN was excluded firstly. For the best model (RF), when K equaling to 7, only one misjudgment is simultaneously misjudged by the LOG. In other cases, all the samples that are misjudged by RF can be correctly judged by the rest of the models. Hence, the combinations of the RF with the rest of the models could be used to establish an integrated model. Next, we investigate which specific model combination could generate the best integrated model for diagnosing CKD. From Tables 5 and 6, it can be seen that there is no significant difference between LOG, SVM and KNN. In the case where the performance of the models is similar, the models are evaluated by the complexity of the algorithm, the running time and the computational resources consumed. LOG, RF, SVM and KNN were run five times on each complete data, and the average time taken are summarized in Table 8. It can be seen that the SVM and KNN take more time than the LOG and RF. In addition, SVM and KNN are also effected by their respective model parameters, so the parameters need to be adjusted before the models are established, which means more manual intervention is needed. For the LOG, there was no additional parameter that need to be adjusted. For the RF, the default parameters of the model were used. Hence, a combination of the LOG and the RF was selected to generate the final integrated model.

C. ESTABLISHING THE INTEGRATED MODEL
LOG and RF were selected as underlying components to generate the integrated model to improve the performance of judging. The probabilities that each sample was judged as notckd in LOG and RF were used as the outputs of underlying components. These two probabilities of each sample were obtained and could be expressed in a two-dimensional plane. In the complete CKD data sets, the probability distributions of the samples in a two-dimensional plane are similar. Therefore, the probability distribution of samples when K equaling to 11 is shown in Fig. 3.
It can be seen from Fig. 3 that the samples have different aggregation regions in the two-dimensional plane due to the different categories (ckd or notckd). In general, samples with ckd are concentrated in the lower left part, while the notckd samples are distributed in the top right part. Due to the fact that the results in the two models are different, some samples are located at the top left and lower right, and one of the two models makes the misjudgments. Perceptron can be used to separate samples of two categories by plotting a decision line in the two-dimensional plane of the probability distribution. Ciaburro and Venkateswaran defined perceptron as the basic building block of a neural network, and it can be understood as anything that requires multiple inputs and produces an output [36]. The perceptron used in this study is shown in Fig. 4.
In Fig. 4, prob 1 and prob 2 are the probabilities that a sample was judged as notckd by LOG and RF, respectively. w 0 , w 1 and w 2 are the weights of input signals. w 0 corresponds to 1, w 1 corresponds to prob 1 and w 2 corresponds to prob 2 , respectively. y is calculated according to (7):  The input signal corresponding to the weight w 0 is 1, which is a bias. The function of Signum is used to calculate output by processing the value of y as follows: If y > 0, then the output = 1, whereas if y < 0, then the output = −1. For the output, 1 corresponds to notckd, whereas -1 corresponds to ckd. A single perceptron is a linear classifier that can be used to detect binary targets. The weights are the core of the perceptron and adjusted in the training stage. y = 0 is the decision line, and this line can be described as (8): In the training stage, the models of LOG and RF were established by the training data at first. Then, a new training data set was generated though combining the probabilities of output of the two component models on the training data and the labels of the samples. This new training data set was used to establish the perceptron. For the binary classification, the samples have two types of labels, i.e. Y = ±1. The output of perceptron is calculated according to (7), we use g(X ) to represent the matrix form of this calculation, where W = [w 1 , w 2 ], X = [prob 1 , prob 2 ] T , and b = w 0 . When the g(X ) > 0, the output = 1, whereas the g(X ) < 0, then the output = −1. Therefore, for all samples correctly judged by the model, the following equation is valid: For all misjudgments, the value of (9) is less than zero, and the large the absolute value, the more serious the model misjudges the samples. Hence, for a misjudged sample (X i , Y i ), the loss of the perceptron can be expressed as (10): The perceptron is trained by the gradient descent method to adjust the weight and bias. The partial derivative of the weight and bias of the loss function are expressed as follows: Therefore, in the training stage, for each misjudgment, the weight and bias are updated by (13) and (14): where the η is the learning rate. However, for the bias, when the updating method of (14) was used, the obtained decision line could classify the samples, but the line was located at the edge of the solution area, so it is not reliable. To solve this problem, a new bias adjustment strategy proposed in chapter 4 of the previous literature [36] was referred and used, which is expressed in (15) where the R is the maximum of the L2 norm of the eigenvectors in all training samples. When the (15) was used, the obtained decision line could correctly classify the samples, and the line was located in the middle of the solution area, so it is more reliable than (14). When the second subset was utilized for testing (at K = 11), the above phenomenon was obvious. Figs. 5(a) and (b) plot the decision line constructed by the perceptron on the training data set when the updating strategies of (14) and (15) were used, respectively. It can be seen that the updating strategy of (15) in Fig. 5(b) is more reliable than (14) in Fig. 5(a). Therefore, (13) and (15)

4.
Build a new training data set, the predictors are the probabilities of being recorded, and the response variable is the label of training data.

5.
Initialize the perceptron, W is randomly generated and b is set to 0. 6. Traverse the samples in the new training data set. If (9) is not satisfied, update W and b using (13) and (15). 7. Repeat step 6 until all of the sample satisfies (9). 8. Return LOG, RF and perception. Testing stage Input: Test data Output: Sample category Procedure 1. Input the data into LOG and RF to record the probabilities that the samples are judged as notckd by them.

2.
Input the probabilities into the perceptron to obtain the result.

IV. EXPERIMENTS AND EVALUATIONS
In order to verify whether the integrated model can improve the performance of the component models, we first used the same random number seed 1234 to establish and evaluate the integrated model on each complete data, and the confusion matrices returned are shown in Table 9. Comparing Tables 9 and 5, it can be found that the integrated model improves the performance of the component models and achieves an accuracy of 100% when K equaling to 3 and 11. When K equaling to 5, 7 and 9, the integrated model improves the performance of LOG and has the same accuracy with the RF. Next, for a comprehensive evaluation, we removed the random number seed 1234 which was used to divide the data into four subsets and establish the RF. The integrated model was then run 10 times on the complete data sets. The average results of the integrated models and two component models are shown in Table 10, and the integrated model has the best performance in detecting the two categories because it achieves the highest accuracy and F1 scores under almost all conditions. The accuracy and F1 scores of integrated model have different degrees of improvement compared to component models, and the sensitivity of components has also been improved by integrated model. We can find that the integrated model does improve the performance of the component models and could achieve an ideal effect. We also compared the methodology in this study (LOG, RF and integrated model) with the other models on the same data in previous studies (called contrast models), and the comparison result is shown in Table 11. It can be seen that although the performance of the LOG established in this work is relatively low compared to some models established in previous studies, it is still better than half of the contrast models. For the RF, the performance is superior to most of the models built in previous works, however, it is consistent with some models [29], [30]. The proposed integrated model improves the performance of separate individual models and is superior to almost all the contrast models, with the highest accuracy and F1 score can achieve 100% in Table 9.
Our results show the feasibility of the proposed methodology. By the use of KNN imputation, LOG, RF, SVM and FNN could achieve better performance than the  cases when the random imputation and mean and mode imputation were used. KNN imputation could fill in the missing values in the data set for the cases wherein the diagnostic categories are unknown, which is closer to the real-life medical situation. Through the misjudgments analysis, LOG and RF were selected as the component models. The LOG achieved an accuracy of around 98.75%, which indicates most samples in the data set are linearly separable. The RF achieved better performance compared with the LOG with the accuracy was around 99.75%. From Tables 7 and 8, the misjudgments produced by LOG and RF are different in almost all cases, and the corresponding calculation speeds are relatively fast. Therefore, an integrated model combining LOG and RF was established to improve the performance of the component models. From the simulation result, the method of integrating several different classifiers is feasible and effective. We speculate that this methodology could be extended to more complex situations. When processing more complex data, various different algorithms are first attempted to establish models. After misjudgment analysis, the better algorithms that produce different misjudgments are extracted as component models. An integrated model is then established to improve the performance of the classifier. From Tables 10 and 11, it can be seen that the proposed methodology improves the performance of the otherwise independent models and achieves comparable or better performance compared to the models proposed in previous studies. In addition, the CKD data set is composed of mixed variables (numeric and category), so the similarity evaluation methods based on mixed data could be used to calculate the similarity between samples, such as general similarity coefficient [37]. In this study, we used euclidean distance to evaluate the similarity between samples, and KNN could obtain a good result based on euclidean distance with the highest accuracy of 99.25%. Therefore, we did not use other methods to evaluate the similarity between samples.

V. CONCLUSION
The proposed CKD diagnostic methodology is feasible in terms of data imputation and samples diagnosis. After unsupervised imputation of missing values in the data set by using KNN imputation, the integrated model could achieve a satisfactory accuracy. Hence, we speculate that applying this methodology to the practical diagnosis of CKD would achieve a desirable effect. In addition, this methodology might be applicable to the clinical data of the other diseases in actual medical diagnosis. However, in the process of establishing the model, due to the limitations of the conditions, the available data samples are relatively small, including only 400 samples. Therefore, the generalization performance of the model might be limited. In addition, due to there are only two categories (ckd and notckd) of data samples in the data set, the model can not diagnose the severity of CKD. In the future, a large number of more complex and representative data will be collected to train the model to improve the generalization performance while enabling it to detect the severity of the disease. We believe that this model will be more and more perfect by the increase of size and quality of the data.
JIONGMING QIN received the B.S. degree in electronic information engineering from Shanxi University, China, in 2018. He is currently pursuing the master's degree with the Graduate School of Electronic Information Engineering, Southwest University, involving in the research of machine learning and deep learning.
LIN CHEN received the B.S. degree in electronic and information engineering from the School of Electronics and Information Engineering, Southwest University, Chongqing, in 2018. He is currently pursuing the master's degree with Kyushu University, Fukuoka, Japan, in the research field of gas and odor sensors.