A machine learning analysis of health records of patients with chronic kidney disease at risk of cardiovascular disease

Chronic kidney disease (CKD) describes a long-term decline in kidney function and has many causes. It affects hundreds of millions of people worldwide every year. It can have a strong negative impact on patients, especially when combined with cardiovascular disease (CVD): patients with both conditions have lower survival chances. In this context, computational intelligence applied to electronic health records can provide insights to physicians that can help them make better decisions about prognoses or therapies. In this study we applied machine learning to medical records of patients with CKD and CVD. First, we predicted if patients develop severe CKD, both including and excluding information about the year it occurred or date of the last visit. Our methods achieved top mean Matthews correlation coefficient (MCC) of +0.499 in the former case and a mean MCC of +0.469 in the latter case. Then, we performed a feature ranking analysis to understand which clinical factors are most important: age, eGFR, and creatinine when the temporal component is absent; hypertension, smoking, and diabetes when the year is present. We then compared our results with the current scientific literature, and discussed the different results obtained when the time feature is excluded or included. Our results show that our computational intelligence approach can provide insights about diagnosis and and relative important of different clinical variables that otherwise would be impossible to observe.

generate better predictions on data of patients with CKD. 77 To the best of our knowledge, no study published before 78 involves the usage of machine learning methods to investi-79 gate a dataset of patients with both CKD and CVD. 80 In this manuscript, we analyzed a dataset of 491 patients 81 from United Arab Emirates, released by Al-Shamsi and col-82 leagues [28] in 2018 (section II). In their original study, the 83 authors employed multivariable Cox's proportional hazards 84 to identify the independent risk factors causing CKD at stages 85 3-5. Although this analysis was interesting, it did not involve 86 a data mining step, which instead could retrieve additional 87 information or unseen patterns in these data. 88 To fill this gap, we perform here two analyses: first, 89 we apply machine learning methods to binary classify the 90 serious CKD development, and then to rank the clinical 91 features by importance. Additionally to what Al-Shamsi and 92 colleagues [28] did, we also performed the same analysis 93 excluding the year when the disease happened to each pa-94 tient ( Figure 1). 95 As major results, we show that computational intelligence 96 is capable of predicting a serious CKD development with or 97 without the time information, and that the most important 98 clinical features change if the temporal component is con-99 sidered or not. 100 We organize the rest of the paper as follows. After this 101 Introduction, we describe the dataset we analyzed (section II) 102 and the methods we employed (section III). We then report 103 the binary classification and feature ranking results (sec-104 tion IV) and discuss them afterwards (section V). Finally, we 105 recap the main points of this study and mention limitations 106 and future developments (section VI).   When the EventCKD35 variable has value 0, the pa-152 tient's kidney condition is at stage 1 or 2. Instead, when 153 EventCKD35 equals to 1, the patient's kidney is at stage 3, 4,154 or 5 (Table 1). 155 Even if the value of eGFR has a role to the definition of More information about this dataset can be found in the 172 original article [28].

174
The problem described earlier (section I) can be addressed 175 as conventional binary classification framework, where the 176 goal is to predict EventCKD35, using the data described 177 earlier (section II). This target feature indicates if the patient 178 has the chronic kidney disease in the stage 3 to 5, which 179 represents an advanced stage.

180
In binary classification, the problem is to identify the 181 unknown relation R between the input space X (in our case: 182 the features described in Section II) and an output space 183 Y ⊆ {0, 1} (in our case: the EventCKD35 target) [32]. 184 Once a relation is established, one can find a way to discover 185 what the most influencing factors are in the input space for 186 predicting the associated element in the output space, namely 187 to determine the feature importance [33].

188
Note that, X can be composed by categorical features 189 (the values of the features belong to a finite unsorted set) 190 and numerical-valued features (the values of the features 191 belong to a possibly infinite sorted set). In case of categorical 192 features, one-hot encoding [34] can map them in a series of 193 numerical features. The consequent resulting feature space is 194 X ⊆ R d .

195
A set of data D n = {(x 1 , y 1 ), . . . , (x n , y n )}, with x i ∈ X 196 and y i ∈ Y, is available in a binary classification framework. 197 Moreover, some values of x i might be missing [35]. In this 198 case, if the missing value is categorical, we introduce an 199 additional category for missing values for the specific feature. 200 Instead, if the missing value is associated with a numerical 201 feature, we replace the missing value with the mean value of 202 the specific feature, and we introduce an additional logical 203 feature to indicate if the value of the feature is missing for a 204 particular sample [35].

205
Our goal is to identify a model M : X → Y, which best 206 approximates R, through an algorithm A H characterized by 207 its set of hyper-parameters H. The accuracy of the model M 208 to represent the unknown relation R is measured using dif-209 ferent indices of performance (Supplementary information). 210 Since the hyper-parameters H influence the ability of A H 211 to estimate R, we need to adopt a proper Model Selection 212 (MS) procedure [36]. In this work, we exploited the Com-213 plete Cross Validation (CCV) procedure [36]. CCV relies on 214 a simple idea: we resample the original dataset D n many 215 (n r = 500) times without replacement to build a training 216 set of size l L r l while the remaining samples are kept in the 217 validation set V r v , with r ∈ {1, · · · , n r }. In order to perform 218 the MS phase, to select the best combination of the hyper-219 parameters H in the set of possible ones H = {H 1 , H 2 , · · · } 220 using the algorithm A H , the hyper-parameters which mini-221 mize the average performance of the model, trained on the 222 VOLUME X, 2021    Note that these methods have shown to be a set of of the 311 simplest yet best performing methods available in scientific 312 literature [56], [57]. The difference between the methods is 313 just the functional form of the model which tries to better 314 approximate a learning principle.

315
For example, Random Forests and XGBoost try to imple-316 ment the wisdom of the crowd principles, Support Vector Ma-317 chines are robust maximum margin classifiers, and Decision 318 Tree and One Rule represent very easy to interpret models. 319 In this paper we tested multiple algorithms since the no-free-320 lunch theorem [58] assures us that, for a specific application, 321 it is not possible to know, a-priori, what algorithm will better 322 perform on a specific task. Then we tested the ones which, 323 in the past, have shown to perform well on many tasks and 324 identified the best one for our application. Feature rankings methods based on Random Forests are 328 among the most effective techniques [59], [60], particularly 329 in the context of bioinformatics [61], [62] and health infor-330 matics [63]. Since Random Forests obtained the top predic-331 tion scores for binary classification, we focus on this method 332 for feature ranking.

333
Several measures are available for feature importance in 334 Random Forests. A powerful approach is the one based on 335 the Permutation Importance or Mean Decrease in Accuracy 336 (MDA), where the importance is assessed for each feature 337 by removing the association between that feature and the 338 target. This effect is achieved by randomly permuting [64] 339 the values of the feature and measuring the resulting increase 340 in error. The influence of the correlated features is also 341 removed.

342
In details, for every tree, the method computes two quanti-343 ties: the first one is the error on the out-of-bag samples as they 344 are used during prediction, while the second one is the error 345 on the out-of-bag samples after a random permutation of the 346 values of a variable. These two values are then subtracted and 347 the average of the result over all the trees in the ensemble is 348 the raw importance score for the variable under exam.

349
Despite the effectiveness of MDA, when the number of 350 samples is small these methods might result being unsta-351 ble [65]- [67]. For this reason, in this work, instead of running 352 the Feature Ranking (FR) procedure just once, analogously 353 to what we have done for MS and EE, we sub-sample the 354 original dataset and we repeat the procedure many times. The 355 final rank of a feature will be the aggregation of the different 356 ranking using the Borda's method [68]. 357 VOLUME X, 2021

358
Before employing machine learning algorithms, we applied 359 traditional univariate biostatistics techniques to evaluate the 360 relationship between the EventCKD35 target and each fea-361 ture. 362 We made use of the Mann-Whitney U test (also known 363 as Wilcoxon rank-sum test) [69] for the numerical features 364 and of the chi-square test [70] for the binary features.

394
In this section, we report the results for the prediction of 395 the chronic kidney disease (subsection IV-A) and its feature 396 ranking (subsection IV-B).

398
CKD prediction. We report the results obtained for the static  Table 4. We rank our results by the  (Table 4), while the support vector machine with 410 Gaussian kernel achieved the top specificity and precision.

411
Because of the imbalance of the dataset (section II), all 412 the classifiers attained better results among the negative data 413 instances (specificity and NPV) than among the positive ele-414 ments (sensitivity and precision). This consequence happens 415 because each classifier can observe and learn to recognize 416 more individuals without CKD during training, and therefore 417 are more capable of recognizing them than recognizing pa-418 tients with CKD during testing.

419
XGBoost and One Rule obtained Matthews correlation 420 coefficients close to 0, meaning that their performance was 421 similar to random guessing. Random Forests, linear SVM, 422 and Decision Tree were the only methods able to correctly 423 classify most of the true positives (TP rate = 0.792, 0.6, and 424 0.588, respectively). No technique was capable of correctly 425 making most of the positive predictions: all PPVs are below 426 0.5 Table 4.

427
Regarding positives, SVM with Gaussian kernel obtained 428 an almost perfect specificity (0.940), while Random Forests 429 achieved an almost perfect NPV of 0.968 Table 4.

430
These results show that the machine learning classifiers 431 Random Forests and SVM with Gaussian kernel can effi-432 ciently predict patients with CKD and patients without CKD 433 from their electronic health records, with high prediction 434 scores, in few minutes.

435
Since Random Forests resulted being the best performing 436 classifier, we also included the calibration curve plot [78] of 437 its predictions (Figure 2), for the sake of completeness. The 438 curve follows the trend of the x = y perfect line translated 439 on the x axis between approximately 5% and approximately 440 65%, indicating well calibrated predictions in this interval.

441
CKD prediction excluding temporal component. To 442 show a scenario where no previous disease history of a pa-443 tient is available, we did not include any temporal component 444 providing information about the progress of the disease in the 445 previous analysis. We then decided to performed a stratified 446 prediction including a time feature indicating the year when 447 the patient developed the chronic kidney disease, or the 448 last visit for non-CKD patients (Supplementary information). 449 After having included the year information in the dataset, 450 we applied a Stratified Logistic Regression [73], [79], as 451 described earlier (section III).

452
The presence of the temporal feature actually improved 453 the prediction, allowing the regression to obtain a MCC of 454 +0.469, better than all the MCC's achieved by the classi-455 fiers applied to the static dataset version except Random 456 Forests (Table 5). Also in this case, sensitivity and precision 457 result being much higher than sensitivity and NPV, because 458 of the imbalance of the dataset.

459
This result comes with no surprise: it makes complete 460 sense that the inclusion of a temporal feature describing the 461 trend of a disease could improve the prediction quality.

462
To better understand the prediction obtained by the Strati-463 fied Logistic Regression, we plotted a calibration curve [78] 464 of its predictions ( Figure 3). As one can notice, the Stratified 465   information does not help us to detect the relevance of the 487 features with enough precision. For this reason, we decided to 488 calculate the feature ranking with machine learning, by em-489 ploying Random Forests, which is the method that achieved 490 the top performance results in the binary classification ear-491 lier (subsection IV-A). 492 We therefore applied the Random Forests feature ranking, 493 and ranked the results by mean accuracy decrease posi-494 tion (Table 7 and Figure 4).

495
The two rankings show some common aspects, both listing 496 AgeBaseline and eGFRBaseline in top positions, but show 497 also some significant differences. The biostatistics standing, 498 for example, lists dBPBaseline as unrelevant predictive fea-499 ture (Table 6), while Random Forests puts it on the 4 th 500 position out of 19 (Table 7). Also, the biostatistics tests stated 501 that HistoryDiabetes is one of the most significant factors, 502 with p-value of 0.0005 (Table 6), while the machine learning 503 approach put the same feature on the last position of its 504 ranking.

505
The two rankings contain other minor differences that we 506 consider unimportant.

507
CKD predictive feature ranking considering the tem-508 VOLUME X, 2021 FIGURE 2: Calibration curve and plots for the results obtained by Random Forests predictions applied on the dataset excluding the temporal component (Table 4).  (Table 5).
poral component. As we did early for the CKD prediction, 509 we decided to re-run the feature ranking procedure by in-    previously described ranking generated without it (Table 7). 516 The most relevant differences in ranking positions are the 517 following: ing, while it is 9 th without considering time.

528
We also decided to measure the difference between these 529 two rankings through two traditional metrics such as Spear-

534
The comparison between ranking without time (Table 7) 535 and ranking considering time (Table 8)    kidney disease, in a few minutes, and then use this informa-547 tion to establish the urgency of the case. Our techniques, of 548 course, do not replace laboratory exams and tests, that will 549 still be needed to further verify and understand the prognosis 550 of the disease. However, if used efficiently, our methods will 551 provide quick, reliable, fast information to physicians to help 552 them with medical decision making.

553
Feature ranking. As mentioned earlier (subsection IV-B), 554 some significant differences emerge between the feature 555 ranking obtained without the time component and generated 556 through Random Forests (Table 7) and the feature ranking 557 obtained considering the year when the patient had the 558 serious CKD development and generated through Stratified 559 Logistic Regression (Table 8).

560
The features HTNmeds, ACEIARB, and HistoryDiabetes 561 had an increase of 13 positions in the year standing (Table 8), 562 compared to their original position in the static ranking (Ta-563 ble 7). Also, the feature BMIBaseline had an increase, of 10 564 positions. The AgeBaseline variable, instead, had the biggest 565 position drop possible: it moved from the most important 566 feature in the static standing (Table 7) to the less relevant 567 position in the year standing (Table 8). The other variables in 568 the year standing did not show so high position changes.

569
These results show that taking medication for hyperten-570 sion, taking ACE inhibitors, having a personal history of 571 diabetes, and body-mass index have an important role in 572 predicting if a patient will have serious CKD, when the 573 information about the disease event is included. The age of 574 the patient is very important when the CKD year is unknown, 575 but becomes irrelevant here.

576
Difference between temporal feature ranking and non-577 temporal feature ranking. The significant differences that 578 VOLUME X, 2021 emerge suggest strong overlap between the information con-   clinical factors ranked by importance [10], [12], [21]. Hypertension resulted being the 4 th most important factor 635 in Salekin's study [6], confirming the importance of the 636 HistoryHTN variable which is ranked at the 3 rd position in 637 our Stratified Logistic Regression ranking (Table 8). Also 638 diabetes history has high ranking in both the standings: 3 rd 639 position in the ranking of Salekin's study [6], and 6 th of 640 importance in our Stratified Logistic Regression ranking, as 641 HistoryDiabetes (Table 8).

643
Chronic kidney disease affects more than 700 millions people 644 in the world annually, and kills approximately 1.2 million 645 of them. Computational intelligence can be an effective 646 means to quickly analyze electronic health records of patients 647 affected by this disease, providing information about how 648 likely they will develop severe stages of this disease, or 649 stating which clinical variables are the most important for 650 diagnosis.

651
In this article, we analyzed a medical record dataset of 491 652 patients from UAE with CKD and at risk of cardiovascular 653 disease, and developed machine learning methods able to 654 predict the likelihood they will develop CKD at stages 3-5, 655 with high accuracy. Afterwards, we employed machine learn-656 ing to detect the most important variables contained in the 657 dataset, first excluding the temporal component indicating 658 the year when the CKD happened or the patient's last visit, 659 and then including it. Our results confirmed the effectiveness 660 of our approach.

661
Regarding limitations, we have to report that we per-662 formed our analysis only on a single dataset. We looked 663 for alternative public datasets to use as validation cohorts, 664 but unfortunately we could not find any that have the same 665 clinical features.

666
In the future, we plan to further investigate the probabil-667 ity of diagnosis prediction in this dataset through classifier 668 calibration and calibration plots [84], and to perform the 669 feature ranking with a different feature ranking method such 670 as SHapley Additive exPlanations (SHAP) [85]. Moreover, 671 we also plan to study chronic kidney disease by applying our 672 methods to CKD datasets of other types, such as microarray 673 gene expression [86]  Engineering at Università di Genovaa from 2016 to 2019. In 2018 he was 1058 co-funder of the ZenaByte s.r.l. spin-off company. In 2019 he obtained 1059 the Italian National Scientific Qualification for the role of full professor in 1060 computer science and computer engineering. In 2019 he became associate 1061 professor in computer science at Università di Pisa and currently is associate 1062 professor in computer engineering at Università di Genova. He has been 1063 involved in several Horizon 2020 projects (S2RJU, ICT, DS) and he has 1064 been awarded with the Amazon AWS Machine Learning and Somalvico 1065 (best Italian young AI researcher) Awards. His first main topic of research is 1066 the statistical learning theory with particular focus on the theoretical aspects 1067 of the problems of (semi) supervised model selection and error estimation. 1068 His second main topic of research is data science with particular reference 1069 to the problem of trustworthy AI and the solution of real world problems by 1070 exploiting and improving the most recent learning algorithms and theoretical 1071 results in the fields of machine learning and data mining.

1098
Precision-Recall (PR) curve = true positive rate on the x axis precision on the y axis (8) (worst value = 0; best value = 1)

1099
ROC curve = f alse positive rate on the x axis true positive rate on the y axis (9) (worst value = 0; best value = 1) 1100