Interpretable Machine Learning for COVID-19: An Empirical Study on Severity Prediction Task

The black-box nature of machine learning models hinders the deployment of some high-accuracy medical diagnosis algorithms. It is risky to put one’s life in the hands of models that medical researchers do not fully understand or trust. However, through model interpretation, black-box models can promptly reveal significant biomarkers that medical practitioners may have overlooked due to the surge of infected patients in the COVID-19 pandemic. This research leverages a database of 92 patients with confirmed SARS-CoV-2 laboratory tests between 18th January 2020 and 5th March 2020, in Zhuhai, China, to identify biomarkers indicative of infection severity prediction. Through the interpretation of four machine learning models, decision tree, random forests, gradient boosted trees, and neural networks using permutation feature importance, partial dependence plot, individual conditional expectation, accumulated local effects, local interpretable model-agnostic explanations, and Shapley additive explanation, we identify an increase in N-terminal pro-brain natriuretic peptide, C-reaction protein, and lactic dehydrogenase, a decrease in lymphocyte is associated with severe infection and an increased risk of death, which is consistent with recent medical research on COVID-19 and other research using dedicated models. We further validate our methods on a large open dataset with 5644 confirmed patients from the Hospital Israelita Albert Einstein, at São Paulo, Brazil from Kaggle, and unveil leukocytes, eosinophils, and platelets as three indicative biomarkers for COVID-19.


I. INTRODUCTION
T He sudden outbreak of COVID-19 has caused an un- precedented disruption and impact worldwide.With more than 100 million confirmed cases as of February 2021, the pandemic is still accelerating globally.The disease is transmitted by inhalation or contact with infected droplets with an incubation period ranging from 2 to 14 days [1], making it highly infectious and difficult to contain and mitigate.
With the rapid transmission of COVID-19, the demand for medical supplies goes beyond hospitals' capacity in many countries.Various diagnostic and predictive models are employed to release the pressure on healthcare workers.For instance, a deep learning model that detects abnormalities and extract key features of the altered lung parenchyma using chest CT images is proposed [2].On the other hand, Rich Caruana et al. exploit intelligible models that use generalized additive models with pairwise interactions to predict the probability of readmission [3].To maintain both interpretability and complexity, DeepCOVIDNet is present to achieve predictive surveillance that identifies the most influential features for the prediction of the growth of the pandemic [4] through the combination of two modules.The embedding module takes various heterogeneous feature groups as input and outputs an equidimensional embedding corresponding to each feature group.The DeepFM [5] module computes second and higherorder interactions between them.
Models that achieves high accuracy provide fewer interpretations due to the trade-off between accuracy and interpretability [6].To be adopted in healthcare systems that require both interpretability and robustness [7], the Multi-tree XGBoost algorithm is employed to identify the most significant indicators in COVID-19 diagnosis [8].This method exploits the recursive tree-based decision system of the model to achieve high interpretability.On the other hand, a more complex convolutional neural network (CNN) model can discriminates COVID-19 from Non-COVID-19 using chest CT image [9].It achieves interpretability through gradient-weighted class activation mapping to produce a heat map that visually verifies where the CNN model is focusing.Besides, several model-agnostic methods have been proposed to peek into black-box models, such as Partial Dependence Plot (PDP) [10], Individual Conditional Expectation (ICE) [11], Accumulated Local Effects (ALE) [12], Permutation Feature Importance [13], Local Interpretable Modelagnostic Explanations (LIME) [14], Shapley Additive Explanation (SHAP) [15], and Anchors [16].Most of these modelagnostic methods are reasoned qualitatively through illustrative figures and human experiences.To quantitatively measure their interpretability, metrics such as faithfulness [17] and monotonicity [18] are proposed.
In this paper, instead of targeting a high-accuracy model, we interpret several models to help medical practitioners promptly discover the most significant biomarkers in the pandemic.
Overall, this paper makes the following contributions: 1) Evaluation: A systematic evaluation of the interpretability of machine learning models that predict the severity level of COVID-19 patients.We experiment with six interpretation methods and two evaluation metrics on our dataset and receive the same result as research that uses a dedicated model.We further validate our approach on a dataset from Kaggle.2) Implication: Through the interpretation of models trained on our dataset, we reveal N-Terminal pro-Brain Natriuretic Peptide (NTproBNP), C-Reaction Protein (CRP), lactic dehydrogenase (LDH), and lymphocyte (LYM) as the most indicative biomarkers in identifying patients' severity level.Applying the same approach on the Kaggle dataset, we further unveil three significant features, leukocytes, eosinophils, and platelets.3) Implementation: We design a system that healthcare professionals can interact with its AI Models to incorporate model insights with medical knowledge.We release our implementation, models for future research and validation.We also summarize two evaluation metrics, faithfulness and monotonicity.

A. Model-Agnostic Methods
In healthcare, restrictions to using only interpretable models bring many limitations in adoption while separating explanations from the model can afford several beneficial flexibilities [19].As a result, model-agnostic methods have been devised to provide interpretations without knowing model details.
Partial Dependence Plot: Partial Dependence Plots (PDP) reveal the dependence between the target function and several target features.The partial function f xs (x s ) is estimated by calculating averages in the training data, also known as the Monte Carlo method.After setting up a grid for the features we are interested in (target features), we set all target features in our training set to be the value of grid points, then make predictions and average them all at each grid.The drawback of PDP is that one target feature produces 2D plots and two produce 3D plots while it can be pretty hard for a human to understand plots in higher dimensions.

fxs (x
Individual Conditional Expectation: Individual Conditional Expectation (ICE) is similar to PDP.The difference is that PDP calculates the average over the marginal distribution while ICE keeps them all.Each line in the ICE plot represents predictions for each individual.Without averaging on all instances, ICE unveils heterogeneous relationships but is limited to only one target feature since two features result in overlay surfaces that cannot be identified by human eyes [20].
Accumulated Local Effects: Accumulated Local Effects (ALE) averages the changes in the predictions and accumulate them over the local grid.The difference with PDP is that the value at each point of the ALE curve is the difference to the mean prediction calculated in a small window rather than all of the grid.Thus ALE eliminates the effect of correlated features [20] which makes it more suitable in healthcare because it's usually irrational to assume young people having similar physical conditions with the elderly.
Permutation Feature Importance: The idea behind Permutation Feature Importance is intuitive.A feature is significant for the model if there is a noticeable increase in the model's prediction error after permutation.On the other hand, the feature is less important if the prediction error remains nearly unchanged after shuffling.
Local Interpretable Model-agnostic Explanations: Local Interpretable Model-agnostic Explanations (LIME) uses interpretable models to approximate the predictions of the original black-box model in specific regions.LIME works for tabular data, text, and images, but the explanations may not be stable enough for medical applications.
Shapley Additive Explanation: Shapley Additive exPlanation (SHAP) borrows the idea of Shapley value from Game Theory [21], which represents contributions of each player in a game.Calculating Shapley values is computationally expensive when there are hundreds of features, thus Lundberg, Scott M., and Su-In Lee proposed a fast implementation for tree-based models to boost the calculation process [15].SHAP has a solid theoretical foundation but is still computationally slow for a lot of instances.
To summarize, PDP, ICE, and ALE only use graphs to visualize the impact of different features while Permutation Feature Importance, LIME, and SHAP provide numerical feature importance that quantitatively ranks the importance of each feature.

B. Metrics for Interpretability Evaluation
Different interpretation methods try to find out the most important features to provide explanations for the output.But as Doshi-Velez and Kim questioned, "Are all models in all defined-to-be-interpretable model classes equally interpretable?" [6] And how can we measure the quality of different interpretation methods?
Faithfulness: Faithfulness incrementally removes each of the attributes deemed important by the interpretability metric, and evaluate the effect on the performance.Then it calculates the correlation between the weights (importance) of the attributes and corresponding model performance and returns correlation between attribute importance weights and the corresponding effect on classifier [17].
Monotonicity: Monotonicity incrementally adds each attribute in order of increasing importance.As each feature is added, the performance of the model should correspondingly increase, thereby resulting in monotonically increasing model performance, and it returns True of False [18].
In our experiment, both faithfulness and monotonicity are employed to evaluate the interpretation of different machine learning models.

III. EMPIRICAL STUDY ON COVID
In this section, features in our raw dataset and procedures of data preprocessing are introduced.After preprocessing, four different models: decision tree, random forest, gradient boosted trees, and neural networks are trained on the dataset.Model interpretation is then employed to understand how different models make predictions, and patients that models make false diagnoses are investigated respectively.

A. Dataset and Perprocessing
The raw dataset consists of patients with confirmed SARS-CoV-2 laboratory tests between 18th Jan. 2020 and 5th Mar.2020, in Zhuhai, China.Our Research Ethics Committee waived written informed consent for this retrospective study that evaluated de-identified data and involved no potential risk to patients.All the data of patients have been anonymized before analysis.
Tables in the Appendix list all 74 features in the raw dataset consisting of Body Mass Index (BMI), Complete Blood Count (CBC), Blood Biochemical Examination, inflammatory markers, symptoms, anamneses, among others.Whether or not health care professionals will order a test for patients is based on various factors such as medical history, physical examination, and etc.Thus, there is no standard set of tests that are compulsory for every individual which introduces data sparsity.For instance, Left Ventricular Ejection Fraction (LVEF) are mostly empty because most patients are not required to take the color doppler ultrasound test .
After pruning out irrelevant features, such as patients' medical numbers that provide no medical information, and features that have no patients' records (no patient took this test), 86 patients' records with 55 features are selected for further investigation.Among those, 77 records are used for training, cross-validation, and 9 reserved for testing.The feature for classification is Severity01 which indicates normal with 0, and severe with 1.More detailed descriptions about features in our dataset are listed in the Appendix.
Feature engineering is applied before training and interpreting our models, as some features may not provide valuable information or provide redundant information.
First, constant and quasi-constant features were removed.For instance, the two features, PCT2 and Stomachache, have the same value for all patients providing no valuable information in distinguishing normal and severe patients.
Second, correlated features were removed because they provide redundant information.Table I lists all correlated features using Pearson's correlation coefficient.1) There is strong correlation between cTnICKMBOrdinal1 and cTnICKMBOrdinal2 because they are the same test among a short range of time which is the same for LYM1 and LYM2.2) LDH and HBDH levels are significantly correlated with heart diseases, and the HBDH/LDH ratio can be calculated to differentiate between liver and heart diseases.3) Neutrophils (NEU1/NEU2) are all correlated to the immune system.In fact, most of the white blood cells that lead the immune system's response are neutrophils.Thus, there is a strong correlation between NEU1 and WBC1, NEU2 and WBC2.4) In the original dataset, there is no much information about N2L2 which is correlated with NTproBNP, thus NTproBNP remains.5) the correlation between BMI and weight is straight forward because Body Mass Index (BMI) is a person's weight in kilograms divided by the square of height in meters.Third, statistical methods that calculate mutual information is employed to remove features with redundant information.Mutual information is calculated using equation 2 that determines how similar the joint distribution p(X, Y) is to the products of individual distributions p(X)p(Y).Univariate Test measures the dependence of two variables, and a high p-value indicates a less similar distribution between X and Y.
After feature engineering, there are 37 features left for training and testing.

B. Training Models
Machine learning models outperform humans in many different areas in terms of accuracy.Interpretable models such as the decision tree are easy to understand, but not suitable for large scale applications.Complex models achieve high accuracy while giving less explanation.
For healthcare applications, both accuracy and interpretability are significant.Four different models are selected to extract information from our dataset: Decision Tree, Random Forests, Gradient Boosted Trees, and Neural Networks.
Decision Tree: Decision Tree (DT) is a widely adopted method for both classification and regression.It's a nonparametric supervised learning method that infers decision rules from data features.The decision tree try to find decision rules that make the best split measured by Gini impurity or entropy.More importantly, the generated decision tree can be visualized, thus easy to understand and interpret [22].
Random Forest: Random Forests (RF) is a kind of ensemble learning method [23] that employs bagging strategy.Multiple decision trees are trained using the same learning algorithm, and then predictions are aggregated from the individual decision tree.Random forests produce great results most of the time even without much hyper-parameter tuning.As a result, it has been widely accepted for its simplicity and good performance.However, it is rather difficult for humans to interpret hundreds of decision trees, so the model itself is less interpretable than a single decision tree.
Gradient Boosted Trees: Gradient Boosted Trees is another ensemble learning method that employs boosting strategy [24].Through sequentially adding one decision tree at one time, gradient boosted trees combine results along the way.With fine-tuned parameters, gradient boosting can result in better performance than random forests.Still, it is tough for humans to interpret a sequence of decision trees and thus considered as black-box models.
Neural Networks: Neural Networks could be the most promising model in achieving a high accuracy and even outperforms humans in medical imaging [25].Though the whole network is difficult to understand, deep neural networks are stacks of simple layers, thus can be partially understood through visualizing outputs of intermediate layers [26].
As for the implementation, there is no hyperparameter for the decision tree.For random forests, 100 trees are used during the initialization.The hyperparameters for gradient boosted trees are selected according to prior experience.The structure for neural networks is listed in table III.All these methods are implemented using scikit-learn [27], Keras and python3.6.After training, gradient boosted trees and neural networks achieve the highest precision on the test set.Among 9 patients in our test set, four of them are severe.Both the decision tree and random forests fail to identify two severe patients, while Gradient Boosted Trees and Neural Networks find all of the severe patients.According to medical knowledge, CRP refers to C-Reactive Protein, which increases when there's inflammation or viral infection in the body.C-reactive protein (CRP) levels are positively correlated with lung lesions and could reflect disease severity [28].NTproBNP refers to N-Terminal prohormone of Brain Natriuretic Peptide, which will be released in response to changes in pressure inside the heart.The CRP level in severe patients rises due to viral infection, and patients with higher NT-proBNP (above 88.64 pg/mL) level had more risks of inhospital death [29].

D. Interpretation (PDP, ICE, ALE)
After recognizing the most important features, PDP, ICE, and ALE are employed to further visualize the relationship between CRP and NTproBNP.
In the PDPs, all of the four models indicate a higher risk of turning severe with the increase of NTproBNP and CRP which is consistent with the retrospective study on COVID-19.The difference is that different models have different tolerances and dependence on NTproBNP and CRP.Averagely, the decision tree has less tolerance on a high level of NTproBNP (>2000ng/ml), and gradient boosted trees give a much higher probability of death as CRP increases.Since PDPs only calculate an average of all instances, we use ICEs to identify heterogeneity.
ICE reveals individual differences.Though all of the models give a prediction of a higher risk of severe as NTproBNP and CRP increase, some patients have a much higher initial probability which indicates other features have an impact on overall predictions.For example, elderly people have higher NTproBNP than young people and have a higher risk of turning severe.
In the ALEs, as NTproBNP and CRP get higher, all of the four models give a more positive prediction of turning severe, which coincides with medical knowledge.

E. Misclassified Patients
Even though the most important features revealed by our models exhibit medical meaning, some severe patients fail to be recognized.Both Gradient Boosted Trees and Neural Networks recognize all severe patients and yield a recall of 1.00, while the decision tree and random forests fail to reveal two of them.
Patient No. 2 (normal) is predicted with a probability of 0.53 of turning severe which is around the boundary (0.5).While for patient No. 5 (severe), the model gives a relatively low probability of turning severe (0.24).

F. Interpretation (False Negative)
Suppose different models represent different doctors, then the decision tree and random forests make the wrong diagnosis for patient no. 5.The reason human doctors classified the patient as severe is that he actually needed a respirator to survive.To further investigate why the decision tree and random forests make wrong predictions, Local Interpretable Model-agnostic Explanations (LIME) and (Shapley Additive Explanation) SHAP are employed.LIME: Features in green have a positive contribution to the prediction (increasing the probability of turning severe), and features in red have a negative effect on the prediction (decreasing the probability of turning severe).
SHAP: Features pushing the prediction to be higher (severe) are shown in red, and those pushing the prediction to be lower (normal) are in blue.
1) Wrong Diagnoses: Take the decision tree as an example, in the figure 5a, the explanation by LIME illustrates that NTproBNP and CRP are two features (in green) that have a positive impact on the probability of turning severe.Even though patient No.5 is indeed severe, the decision tree gives an overall prediction of normal (false negative).Thus, we would like to investigate features that have a negative impact on the probability of turning severe.
In the figure 6c, the explanation by SHAP reveals that the patient is diagnosed as normal by the decision tree because the patient has no symptom.Even though the patient has a high NTproBNP and CRP, having no symptom makes it less likely to classify him as severe.The record was taken when the patient came to the hospital for the first time.It is likely that the patient developed symptoms later and turned severe.
However, both gradient boosted trees and neural networks are not deceived by the fact the patient has no symptom.Their predictions indicate that the patient is likely to turn severe in the future.
2) Correct Diagnoses: In the figure 6c and figure 6d, gradient boosted trees and neural networks do not prioritise the feature symptom.They put more weight on test results (NTproBNP and CRP).Thus they make correct predictions based on the fact that the patient's test results are serious.
Besides, neural networks notice that the patient is elderly (Age = 63).If we calculate the average age in different severity levels, it is noticeable that elderly people are more likely to deteriorate.Gradient boosted trees and neural networks make correct predictions because they trust more in test results, while the decision tree relies more on whether or not a patient has symptoms.As a result, gradient boosted trees and neural networks are capable of recognizing patients that are likely to turn severe in the future while the decision tree makes predictions relying more on patients' current situation.
Medical research is a case-by-case study.Every patient is unique.It's strenuous to find a single criterion that suits every patient, thus it's important to focus on each patient and make a diagnosis accordingly.This is one of the benefits of using interpretable machine learning.It unveils the most significant features for most patients and provides the interpretation for each patient as well.

G. Interpretation (False Positive)
With limited medical resources at the initial outbreak of the pandemic, it's equally important to investigate false positive, so that valuable resources can be distributed to patients in need.
In table VI, patient 2 is normal, but all of our models diagnose the patient as severe.To further explain the false positive prediction, table VIII lists anonymized medical records for patient 2 (normal) and patient 5 (severe) for comparison.In table IX, NTproBNP, CRP, LYM, LDH are the most common features that are deemed crucial by all different models.The three features, CRP, LYM, LDH, are listed as the most indicative biomarkers in the COVID-19 guideline.While the correlation between NTproBNP and COVID-19 are investigated in a paper from World Health Organization (WHO) global literature on coronavirus disease, that reveals elevated NTproBNP is associated with increased mortality in patients with COVID-19 [30].
As a result, the prediction of false-positive is consistent with doctors' diagnoses.Patient 2 who is normal is diagnosed as severe by both doctors and models.One possibility is that even though the patients' test results are not optimistic, he did not require a respirator to survive when he came to the hospital for the first time, so he was classified as normal.In this way, models' predictions can act as a warning.If a patient is diagnosed as severe by models, and the prediction is in accordance with medical knowledge, but the patient feels normal, we can suggest to the patient to put more attention on his health condition.
In conclusion, as illustrated previously in the explanation for patient 5 (false negative), every patient is unique.Some patients are more resistant to viral infection, while some are more vulnerable.Pursuing a perfect model is tough in healthcare, but we can try to understand how different models make predictions using interpretable machine learning to be more responsible with our diagnoses.

H. Evaluating Interpretation
Though we do find some indicative symptoms of COVID-19 through model interpretation, they are confirmed credible because these interpretations are corroborated by medical research.If we use the interpretation to understand a new virus at the early stage of an outbreak, there will be less evidence to support our interpretation.Thus we use Monoitinicity and Faithfulness to evaluate different interpretations using IBM AIX 360 toolbox [31].The decision tree only provides a binary prediction (0 or 1) rather than a probability between 0 and 1, so it cannot be evaluated using Monotonicity and Faithfulness.Faithfulness (ranging from -1 to 1) reveals the correlation between the importance assigned by the interpretability algorithm and the effect of each attribute on the performance of the model.All of our interpretations receive good faithfulness scores, and SHAP receives a higher faithfulness score than LIME on average.The interpretation by SHAP receives better results because the Shapley value is calculated by removing the effect of specific features which is similar to how faithfulness is computed, so SHAP is more akin to faithfulness.As for monotonicity, most interpretation methods receive a False though we do find valuable conclusions from interpretations.The difference between faithfulness and monotonicity is that faithfulness incrementally removes each of the attributes, while monotonicity incrementally adds each of the attributes.By incrementally adding each attribute, initially, the model may not be able to make correct predictions with only one or two features, but this does not mean these features are not important.Evaluation metrics for different interpretation methods is still an active research direction, and our results may hopefully stimulate further research on the development of better evaluation metrics for interpreters.

I. Summary
In this section, the interpretation of four different machine learning models reveals that N-Terminal pro-Brain Natriuretic Peptide (NTproBNP), C-Reaction Protein (CRP), and lactic dehydrogenase (LDH), lymphocyte (LYM) are the four most important biomarkers that indicate the severity level of COVID-19 patients.In the next section, we further validate our methods on two datasets to corroborate our proposal.

IV. VALIDATION ON OTHER DATASETS
At the initial outbreak of the pandemic, our research leverages a database consisting of patients with confirmed SARS-CoV-2 laboratory tests between 18th January 2020, and 5th March 2020, in Zhuhai, China, and reveals that an increase in NTproBNP, CRP, and LDH, and a decrease in lymphocyte count indicates a higher risk of death.However, the dataset has a limited record of 92 patients which may not be enough to support our proposal.Luckily, and thanks to global cooperation, we do have access to larger datasets.In this section, we further validate our methods on two datasets, one with 485 infected patients in Wuhan, China [8], and the other with 5644 confirmed cases from the Hospital Israelita Albert Einstein, at São Paulo, Brazil from Kaggle.

A. Validation on 485 infected patients in China
The medical record of all patients in this dataset was collected between 10th January and 18th February 2020, within a similar date range as our dataset.Yan et al. construct a dedicated simplified and clinically operable decision model to rank 75 features in this dataset, and the model demonstrates that three key features, lactic dehydrogenase (LDH), lymphocyte (LYM), and high-sensitivity C-reactive protein (hs-CRP) can help to quickly prioritize patients during the pandemic, which is consistent with our interpretation in Table V.
Findings from the dedicated model are consistent with current medical knowledge.The increase of hs-CRP reflects a persistent state of inflammation [32].The increase of LDH reflects tissue/cell destruction and is regarded as a common sign of tissue/cell damage, and the decrease of lymphocyte is supported by the results of clinical studies [33].Fig. 9: A decision rule using three key features and their thresholds in absolute value.Num, the number of patients in a class; T, the number of correctly classified; F, the number of misclassified patients.[8] Our methods reveal the same results without taking efforts to design a dedicated interpretable model but can be more prompt to react to the pandemic.During pandemic outbreak, a prompt reaction that provides insights on the new virus could save lives and time.

B. Validation on 5644 infected patients in Brazil
Our approach obtains the same result on the dataset with 92 patients from Zhuhai, China, and a medium-size dataset with 485 patients from Wuhan, China.Besides, we further validate our approach on a larger dataset with 5644 patients in Brazil, from Kaggle.
This dataset consists of 111 features including anonymized personal information, laboratory virus tests, urine tests, venous blood gas analysis, arterial blood gases, blood routine test, among other features.All data were anonymized following the best international practices and recommendations.The difference between this dataset and ours is that all data are standardized to have a mean of zero and a unit standard deviation, thus the original data range that contains clinical meaning is lost.Still, the most important medical indicators can be extracted using interpretation methods.Following the same approach, a preprocessing is applied on the dataset that removes irrelevant features such as patients' intention to the ward level, and features that have less than 100 patient's record, for instance, urine tests and aerial blood gas tests.On the other hand, patients that have less than 10 records are dropped, because these records do not provide enough information.After preprocessing, we have a full record of 420 patients with 10 features.After training and interpreting four different models, decision tree, random forests, gradient boosted trees, and neural networks, the most important features are identified and listed in table XIV.The three most common indicative features are leukocytes, eosinophils, and platelets.According to medical research, patients with increased leukocyte count are more likely to develop critically illness, more likely to admit to an ICU, and have a higher rate of death [34].Du et al. noted that at the time of admission, 81% of the patients had absolute eosinophil counts below the normal range in the medical records of 85 fatal cases of COVID-19 [35].Wool G.D. and Miller J.L. discovered that COVID-19 is associated with increased numbers of immature platelets which could be another mechanism for increased clotting events in COVID-19 [36].In addition, the two datasets collectively reveal that elderly people are more susceptible to the virus.The significant feature NTproBNP in the Chinese dataset is often used to diagnose or rule out heart failure which is more likely to occur in elderly people.And patients that have abnormally low levels of platelets are more likely to be older, male as well [36].
To further validate our interpretation, faithfulness and monotonicity are calculated and listed in tables XV and XVI.Similarly, our interpretations are consistent with medical knowledge and receive a good faithfulness score, but receive a worse score on monotonicity because the calculation procedure of monotonicity is contrary to faithfulness.V. CONCLUSION In this paper, through the interpretation of four different machine learning models, we reveal that N-Terminal pro-Brain Natriuretic Peptide (NTproBNP), C-Reaction Protein (CRP), and lactic dehydrogenase (LDH), lymphocyte (LYM) are the four most important biomarkers that indicate the severity level of COVID-19 patients.Our findings are consistent with medical knowledge and recent research that exploits dedicated models.We further validate our methods on a large open dataset from Kaggle and unveil leukocytes, eosinophils, and platelets as three indicative biomarkers for COVID-19.
The pandemic is a race against time.Using interpretable machine learning, medical practitioners can incorporate insights from models with their prior medical knowledge to promptly reveal the most significant indicators in early diagnosis and hopefully win the race in the fight against the pandemic.

Fig. 1 :
Fig.1:The difference between the usual workflow of machine learning, and our approach.

Fig. 2 :
Fig. 2: Partial Dependence Plot: There is a positive correlation between the level of NTproBNP/CRP and the probability of turning severe because as NTproBNP/CRP increases, the average possibility (y-axis) of turning severe increases.

Fig. 3 :Fig. 4 :
Fig. 3: Individual Conditional Expectation: Each line in different colors represents a patient.As we increase NTproBNP/CRP while keeping other features the same, the probability of turning severe increases for each individual, but each patient has a different starting level because their other physical conditions differ.

Fig. 5 :Fig. 6 :
Fig. 5: LIME Explanation (False-Negative Patient No.5): Features in green have a positive contribution to the prediction (increasing the probability of turning severe), and features in red have a negative effect on the prediction (decreasing the probability of turning severe)

Fig. 7 :Fig. 8 :
Fig. 7: LIME Explanation (False-Positive Patient No.2): Features in green have a positive contribution to the prediction (increasing the probability of turning severe), and features in red have a negative effect on the prediction (decreasing the probability of turning severe)

Fig. 10 :
Fig. 10: LIME Explanation (Kaggle Patient 0): Features in green have a positive contribution to the prediction (increasing the probability of turning severe), and features in red have a negative effect on the prediction (decreasing the probability of turning severe)

TABLE I :
Feature Correlation

TABLE III :
The structure of Neural Networks

TABLE IV :
Classification Results on our dataset

TABLE V :
Five most important features

TABLE VI :
Misclassified Patients

TABLE VII :
Average Age in different severity levels

TABLE VIII :
Record of the false positive Patient 2 We present the test results of both patients to doctors without indicating which patient is severe.All doctors mark patient No. 2 as more severe which is the same as our models.Doctors' decisions are based on the COVID-19 Diagnosis and Treatment Guide in China.The increased level in CRP, LDH, decreased level in LYM are associated with severe COVID-19 infection in the guideline, and patient 2 has a higher level of CRP and LDH, a lower level of LYM than patient 5.As a result, doctors' diagnoses are consistent with models' predictions2) Models' Diagnoses: Even though all of the four models make the same predictions as human doctors, it's important to confirm models' predictions are in accordance with medical knowledge.Table IX lists the three most important features in the interpretation of LIME and SHAP.More detailed interpretations are illustrated in the figure7and figure8.

TABLE IX :
Most important features from LIME, SHAP

TABLE X :
Failthfulness Evaluation

TABLE XI :
Monotonicity Evaluation

TABLE XII :
Patient No.0 in the Kaggle Dataset

TABLE XIV :
Five most important features (Kaggle)

TABLE XVIII :
Personal Info

TABLE XIX :
Complete Blood Count

TABLE XX :
Inflammatory Markers

TABLE XXI :
Biochemical Examination

TABLE XXII :
Symptoms and Anamneses

TABLE XXIII :
Other test results