Symptom Based Explainable Artificial Intelligence Model for Leukemia Detection

Leukemia is not only fatal in nature, it is also extremely expensive to treat. However, leukemia detection at early stage can save lives and money of the affected people, specially children among whom leukemia as a cancer type is very common. In this paper, we propose an explainable supervised machine learning model that accurately predicts the likelihood of early-stage leukemia based on symptoms only. The proposed model is developed based on primary data collected from two major hospitals in Bangladesh. Sixteen features of the datasets are collected through a survey on leukemia and non-leukemia patients in consultation with a specialist physician. Our explainable supervised model is based on a decision tree classifier which provides significantly better results compared to other algorithms and generates explainable rules that are ready to use. We have employed Apriori algorithm for generating explainable rules for leukemia prediction. In addition, feature analysis and feature selection are performed on the dataset to show the strength of individual features and enhance the performance of the classification models. Several classifiers are experimented on the dataset to show how the proposed model that is simple yet explainable, performs significantly better compared to most other models that we have used. The decision tree model proposed in our experiments has achieved 97.45% of accuracy, 0.63 of Mathew’s Correlation Coefficient (MCC) and 0.783 of area under Receiver Operating Characteristic (ROC) curve on the test set. We have also made the dataset and the source code of the methods used in this work available for future use by the researchers, which can be found in the following link: https://github.com/AkterHossain312/BDLeukemiaWithML.

T HE National Cancer Institute, USA estimated that only in USA, approximately 60,300 new patients have admitted to the hospital due to leukemia in 2019, where 24,370 of them have died [1]. This has been a major concern in recent years. In spite of major research endeavours taken to tackle leukemia and its different variants worldwide, the death rate from leukemia is alarming having severe consequences specially in children [2].
Leukemia cells are blood cells that are under-developed and shows abnormal behavior of growing and dividing in an indomitable manner. It is the most common type of cancer that is prevalent in children. Based on the cell types leukemia is broadly categorized into two types: lymphocytic (or lymphoblastic) and non-lymphocytic (or myeloid). They can occur in chronic or acute form. A body with leukemia cells develops signs and symptoms. However, often it is diagnosed at a later stage, thus makes the treatment more difficult. An early screening of leukemia can make a great difference by reducing the cost and related fatality rate and also improving the quality of life among the patients.
In general, leukemia detection and screening is being executed in the hospitals using various sophisticated methods. They employ blood samples [3], complete blood counts [4]- [7], bone marrow based tests [8], [9], etc. Bone marrow is the source or starting point of leukemia where lymphocytic or myeloid cells start to develop [10]. Imaging of blood cells too help to detect leukemia as shown in different researches [11]- [13]. A very popular dataset of leukemia detection is Acute Lymphoblastic Leukemia Image Database (ALL-IDB) for image processing, which is extensively discussed in literature. These image based methods are often susceptible to sophisticated devices and imaging techniques deployed [14]- [16]. In recent times, genomics methods [17], [18] along with clinical data are in use [19]- [22]. Though the combinations of various methods and multi-modal data are effective, they might be available at a late stage [23]- [25].
To detect and predict leukemia, various machine learning based methods and algorithms have already been used in literature. With the increase in available data, it is now possible to formulate the problem as a supervised machine learning problem where knowledge based algorithms are applicable. Some of the successfully deployed algorithms are Support Vector Machines (SVM) [4], [7], [12], Random Forest [19], [20], Decision Trees [4], [5], Neural Networks [9], [11], k-Nearest Neighbor (k-NN) [4], [12], Fuzzy Systems [6], ensemble methods [23], [24], etc. Often these machine learning methods do not provide explainable outcomes and act as black-box prediction models only. Explainable artificial intelligence often helps to assist the business logic by extracting knowledge and rules from the domain. On the other hand the black-box models are only suitable when a specific task such as prediction/classification is required to be performed.
In the context of Bangladesh, which has lately elevated to a developing country, not much work has been done in this regard [26]- [29]. This is due to many fold reasons. Firstly, there is a lack of dataset. Since most of the hospitals have paper based data recording systems, often initial screening records are not stored and maintained properly. Secondly, the immense workload on the physicians and the diagnostic system delay the overall digitization and decision support systems to be used. Thirdly, the financial conditions of a patient often does not support the monitoring within a health framework which is available in the developed countries.
However, in recent years things have started changing due to the digital transformation in the healthcare sector of Bangladesh. On the other hand, Bangladesh is among the highest growing nations in the world in terms of smart mobile phone based Internet users. This have led us to envision smart phone based screening applications to detect leukemia based on symptoms only. A system overview of symptom based leukemia detection is shown in Figure 1. Such a system, if implemented will be able to detect leukemia at an early stage and the screening system may help to reduce the overall load of the physicians. It is important to note that there have been a few works in symptom based disease detection [30] and particularly for cancer detection [31]. However, to the best of our knowledge, there is no such work for leukemia detection using symptoms only.
In this paper, we present a symptom based leukemia screening method based on explainable AI models. This work is an extended version of our initial work [32]. In this work, we have collected primary data from the pediatrics leukemia wards of two top government hospitals located in Dhaka, the capital city of Bangladesh. The dataset is collected following a guidance and policy administered by the consenting physicians and subjects. It is observed that decision tree based supervised learning method is able to predict leukemia based on the early symptoms collected from the patients data and provide explanatory analysis. Moreover, the explainable model performs significantly better compared to other complex and sophisticated black-box type models. The noteworthy contributions made in this paper are as follows: • A primary dataset on leukemia screening based only on the symptoms is collected from pediatric leukemia ward of two top hospitals in Bangladesh. • Experimental analysis is carried out to show the performances of different machine learning models including a detailed hyper-parameter study. • A feature analysis and selection study is completed to identify the suitable features that can further enhance the performances of the classification models. • Experimental results demonstrate that Explainable AI deploying decision tree and Apriori algorithms show satisfactory results compared to other methods. Moreover, the related confidence and support of the rules are generated. The rest of the paper is organized as follows: a brief literature review is presented in section II; the details of the materials and methods are given in section III; the experimental results and the discussion are presented in section IV and the paper concludes with a summary and outline of the future work in section V.

II. RELATED WORK
There have been several studies to predict leukemia where researchers have applied various machine learning techniques. In this section, first we review the existing works in the global context followed by the present works carried out in the context of Bangladesh.

A. GLOBAL ML BASED RESEARCH ON LEUKEMIA DETECTION
In this subsection, we briefly discuss about the cancer detection work done in the literature in the global context. Most of the works differ from the source of the samples from where the data is collected and the type of the data and the algorithms that have been applied. We have organized the section in terms of the type of the data that is used. However, a summary of methods is given in the upper part of Table 1.

1) Blood Sample based Screening Methods
Blood samples are often used to screen leukemia. Zelig et al. [3] investigated the effectiveness of Fourier Transform Infrared Microscopy (FTIR-MSP) for pre-screening and follow-up of leukemia patients undergoing chemo-therapy. They collected blood samples from leukemia patients before and during the treatment, and from healthy subjects who served as control groups. Often the Complete Blood Count (CBC) test taken on blood samples are used to screen leukemia [4]- [6]. Daqqa et al. [4] achieved 77.30% accuracy using decision tree on patients' gender, age and health status data, along with blood characteristic from CBC test. Mahmood et al. [5] obtained a very high accuracy using Classification and Regression Trees (CART) with CBC, Renal Function Test (RFT) and Liver Function Test (LFT) data. Fathi et al. [6] investigated the differences between the cute lymphoblastic leukemia and the acute myeloid leukemia using CBC test based data from children. Markiewicz et al. [7] used SVM classifiers to recognize the blood cells of myelogenous leukemia.

2) Bone Marrow based Screening Methods
Bone marrow data based screening methods had been explored in [8]- [10]. Hsieh et al. [8] used SVM on bone marrow and blood peripheral data. Ritter et al. [9] developed a supervised machine learning method using a combination of multiple Gaussian mixing models (GMMs). Leinoe et al. [10] worked on predicting bleeding in the early stages of acute myeloid leukemia by flow cytometry analysis of platelet function flow.

3) Image based Screening Methods
French-American-British (FAB) classification was used by Shafique et al. [11] to find sub-types of acute lymphoblastic leukemia based on ALL-IDB dataset. Das et al. [12] proposed to use optimized Support Vector Neural Network (SVNN) on the same dataset. Fatma et al. [13] also used the same dataset. They applied a color model considering linear contrast. Using neural networks they gained up to 91% accuracy. Rawat et al. [14] proposed to analyze color, morphology and textual features from blood images. A genetic algorithm was applied for feature optimization using the SVM classifier.
Jha et al. [15] also developed a FAB classification-based identification from the Blood Smear images (BSI). The size, texture properties and color of the segmented image extracted by the neural networks were fed to the SVM and Naive Bayes Classifiers.

4) Genomics Data based Methods
In the recent years, genomics and transcriptomics data analysis are playing a very crucial role in cancer related research. Heresta et al. [17] used combinations of transcriptomics data for the acute myeloid leukemia prediction. Lee et al. [18] used gene expressions for the targeted treatment of acute myeloid leukemia.

5) Clinical Data based Methods
Pan et al. [19] applied forward feature selection algorithm to rank the clinical variables and suggested to use Random Forest as a classifier. Chen et al. [20] explored different methods for sensing chronic lymphocytic leukemia using ensemble methods. Lin et al. [21] used auto-encoders to extract highlevel features and used them to predict the acute myeloid leukemia. Fuse et al. [22] demonstrated the effectiveness of decision tree algorithms for relapse of acute Leukemia.

6) Other methods
Kashef et al. [23] used paper-based files and analyzed 31 attributes using stacked ensemble classifier with the high area under receiver operating characteristic (auROC) value. Agius et al. [24] addressed chronic lymphocytic leukemia (CLL) and used 28 different machine learning algorithms on data from 4,149 patients. Karimi et al. [25] conducted a study on the spread of leukemia and lymphoma signs and symptoms in childhood in the southern Iranian province VOLUME 4, 2016 of Fars. They analyzed different symptoms that are highly correlated with different types of leukemia. However, they did not apply machine learning for decision making based on the symptoms.

B. RESEARCH WORKS IN BANGLADESH
Hossain et al. [26] showed a pre-analysis of more than 5,000 confirmed hematological cancer cases from 10 specialized hospitals between January 2006 and December 2012. They mainly showed the prevalence of different types of leukemia among various age groups. Hossain et al. [27] counted the sub-types of blood from microscopic images then based on the count of the object they attempted to detect leukemia. The authors used the Faster RCNN models. For this study, they collected approximately 256 images from Dhaka Shishu Hospital and National Institute of Cancer Research and Hospital (NICRH). Abedi et al. [28] proposed a scalable leukemia prognosis method based on the universally available ALL_IDB dataset. In another work, Zahra et al. [29] investigated the relationship of gene polymorphism in patients with acute lymphoblastic leukemia (ALL) from Bangladesh.

C. SUMMARY
A summary of the literature review is shown in Table 1. It is observed that various types of features have been used for leukemia prediction and screening. It is to be noted that early leukemia detection is possible only from symptoms. Though a good number of methods are used to analyze the symptoms and their correlations with the types of cancers, they are mostly used in combination with other sophisticated features for the prediction model. Often, these models are black-box machine learning and can not provide insights about the decision making process. Moreover, in the context of Bangladesh, not much work have been performed in this regard. Thus, we find a clear research gap and propose a symptom based early screening method for leukemia using explainable AI models.

III. METHODOLOGY
In this section, we present the methodology used in the proposed framework of symptom based leukemia detection as shown in Figure 1. The figure shows that the screening starts from a simple smart phone based questionnaire system that is filled up by a patient. The data sent by the phone is then processed by a server to feed into a machine learning model to find the desired detection.
The complete machine learning workflow is presented in Figure 2. First, we have identified the parameters in consultation with specialist physicians. Then we have collected data from the patients using a survey form. The missing data values and unnecessary columns are removed as a part of preprocessing and data cleaning. The dataset is then divided into train and test datasets. Machine learning models are trained using the train dataset and the resulting model is put into experiments using the test dataset to validate the results. The details of these steps are presented in the rest of the section.

A. DATA COLLECTION
The data collection step is guided by the specialist physicians from one of the largest medical universities of Bangladesh, namely Bangabandhu Sheikh Mujib Medical University (BSMMU). After consulting with the physicians, 16 features or symptoms of leukemia are identified. The data collection is performed from two leading hospitals of Bangladesh: Dhaka Shishu (Children) Hospital and the pediatric ward of the National Institute of Cancer Research and Hospital (NICRH), Dhaka. The data collection is performed with necessary permission and ethical clearance from the authority of the hospitals and only from the consenting subjects.
In total 840 subjects have given consent and participated in data collection. Among them, 131 patients are from NICRH with 103 leukemia patients and 28 non-leukemia patients. Whereas, 709 patients' data is collected from Dhaka Shishu Hospital; 510 of them are leukemia patients and 199 are nonleukemia patients. A summary of the collected data is shown in Table 2.
Thereafter, we separate the datasets into train and test datasets. We have kept the dataset from the National Institute of Cancer Research and Hospital (NICRH) as a test set and used the dataset from the Dhaka Shishu Hospital as a train set. Table 3 presents the 16 features of the dataset along with the class label collected in our research. Note that all of the features have binary meaning i.e., the presence and absence of a symptom. The binary levels are shown as zero (0) and one (1). The distribution of the features is also given in the last two columns of the table. Binary levels make the identification of the symptoms simpler from the users' point of view.

B. DESCRIPTION OF THE FEATURES
The features used in this research are telltale symptoms. Shortness of breath indicates whether someone has long term or short term shortness of breath. Long-term shortness of breath increases the risk of leukemia. People with bone pain have an 80% chance of developing leukemia, especially with pain around the spine. Bone pain at night and fever are also observed in people who have frequent infections. Family history denotes if there is any genetic linkage with the disease. People who have frequent infections are more likely to have leukemia. A leukemia patient with rash and itching on her skin may find small red or purple spots on her skin caused by ruptured blood vessels and capillaries under the skin. Among other important features are loss of appetite or nausea leading to weight loss, persistent weakness and fatigue. Swollen lymph nodes in armpits, neck or groin might be one of the early symptoms of leukemia. If the blood vessels under the skin are damaged patients experience bruising. Leukemia cells can grow in the liver and spleen and make them bigger. It can be noticed as fullness or bloating, or full feeling after eating only a small amount. The other symptoms related to this are enlarged liver, oral cavity, vision blurring,  [11] 368 Image ALL-IDB Deep Neural Network Das et al. [12] 368 Image ALL-IDB GFNB, ELM, KNN, SVM, Naive Bayes, SSDE-based SVNN Fatma et al. [13] 368 Image ALL-IDB Neural Network Rawat et al. [14] 420 Image American society of hematology FAB Classification Jha et al. [15] _ Image Dataset-master and Cellavison FAB, SVM, Naive Bayesian Mohapatra et al. [16] 108 Image Ispat General Hospital, Rourkela K-means Clustering Heresta et al. [17] 12  [28] 300 Images ALL-IDB Logistic Regression Zahra et al. [29] 160 Genotyped data Local Hospitals Statistical Analysis  jaundice and night sweats. We have also included smoking as a feature if the patient is exposed to smoke.

C. FEATURE SELECTION ALGORITHM
We have used the Least Absolute Shrinkage and Selection Operator (LASSO) model [33] for feature selection. The purpose of the LASSO model is to find the important or dominating features, and thus to regulate the data models. It uses the regression coefficient to select the features. They have been previously used in the recent literature for symptom based machine learning methods in disease diagnosis [30] and particularly in cancer detection [31]. In our experiments, we have observed the effectiveness of the feature selection. The importance of the selected features are determined using different types of classification algorithms and evaluation metrics.   Distribution  zero  one  1  shortness_of_breath  323  517  2  bone_pain  203  637  3  fever  278  562  4  family_history  229  611  5  frequent_infections  242  598  6  Itchy_skin_or_rash  261  579  7  loss_of_appetite_or_nausea  235  605  8  Persistent_weakness  231  609  9  swollen,painless_lymph  235  605  10  significant_bruising_bleeding  161  679  11  enlarged_liver  162  678  12  oral_cavity  242  598  13  vision_blurring  242  598  14  jaundice  208  632  15  night_sweats  234  606  16 Smoke 259 581 17

D. CLASSIFICATION METHODS
In our experiments on the dataset, we have used seven different machine learning models: Decision Tree (DT) classifier [34], Random Forest [35], k-Nearest Neighbor [36], Adaboost Classifier [37], Logistic Regression Classifier [38], Naive Bayesian Classifier [38] and Artificial Neural Network [38]. In this section, we briefly discuss about these classifiers. The details parameter studied on each of these classifiers are given in section IV.

Decision Tree (DT) Classifier
Decision tree classifiers create a structured decision flow by making decision based on the selected features at each decision node. The features are selected based on the information content. DT classifiers are used to generate rules that are suitable for explainable AI. Often these rules are interpreted by the domain experts. We have used gini index as feature selection metric for decision tree. Gini index [34] is defined as follows.
Here, p i denotes the probability of an instance being classified in class i among all possible branches created by the attribute q.

Random Forest (RF) Classifier
Random Forest classifier uses a bootstrapped method to sample the feature space and creating decision tree ensemble based on the selected features. In our experiments, we have set gini as the attribute selection metric for the decision trees.
The decision of the ensemble is the weighted average of all the predictions made by the decision trees. Random forest is often used successfully for classification of large datasets. However, they are not interpretable compared to DTs. They are used in feature importance analysis. A random forest classifier does the classification based on the predictions made by the constituent tree classifiers defined as in the following equation [35].
Here, y i and w i denotes the class label predicted and weights assigned to tree i.

k-NN Classifier
The k-Nearest Neighbor classifier is a lazy instance based classifier that uses a weighted voting mechanism to predict the class label of an instance based on its neighboring instances. The neighborhood and the weights are defined by the specific distance metrics selected. k-NN classifiers do not explicitly train, rather selects suitable hyper-parameters and the classification is done on real time. However, they might not be well interpreted for categorical data and depends on the specific label encoding method. The prediction made by a k-NN classifier is a weighted avaerage of the class of the neighbors defined by the following equation [36].
Here, w i is the weight of the instance i in the neighborhood assigned based on the distance metric and y i is the class label of the instance.

Adaboost Algorithm
Adaboost is an ensemble algorithm that adaptively improves the performances of the classifiers by changing the weights of the wrongly classified instances dynamically over the iteration. The final classifier provides a weighted prediction of all the single weak classifier predictions that are learned in the iterations. The classification rule of the Adaboost ensemble is given as follows [37].

Logistic Regression Classifier
Logistic regression classifier is a linear classifier that finds a linear boundary to divide the instances. It often uses regularization parameters and a sigmoid function along with the learned weights or parameters. The weights of the logistic regression parameters are learned using a gradient based optimization algorithm. The predicted label of logistic regression is given below [38].
Naive Bayesian Classifier The Naive Bayesian classifier uses a simple form of Bayesian net by formulating the class label. This is because the dependent variable or parent variable and all the features that are directly connected to it show direct causal relationship and confirm the conditional independence among the feature variables given in the class label. The classification rule is shown in the following equation [38].
Artificial Neural Network Artificial neural networks are models created to imitate the brain functions. However, they can be thought of as layers to nodes inter-connected to each other processing input features using hidden layers and eventually generating prediction at the output layer. Each node in the layers are logistic units. The output of the artificial neural network is defined by a sigmoid function defined as following [38].
Here, z is the input to the activation in the last layer.

E. PERFORMANCE EVALUATION
We have used separate train and test sets to evaluate the performances of the algorithms and methods employed in this paper. Cross-validation is applied to the train set. Here, the dataset is first divided into k non-overlapping sets and in each iteration, k − 1 sets are used to train and the rest set is used to validate. This is done in k turns or iterations. We have used different values of k to show the robustness of the training of the classifiers. We have used a set of metrics that are suitable for binary classification performance measurement. These are accuracy, precision, recall, Mathew's Correlation Coefficient (MCC) and area under Receiver Operating Characteristic (auROC) curve. The first four metrics are dependent on the confusion matrix. In the confusion matrix, true positives (TP) are the positive instances that are correctly classified. True negative (TN) denotes the negative instances that are correctly classified. On the other hand, false positive (FP) and false negative (FN) are the instances that are wrongly classified by the classifier as positives and negatives, respectively. Whereas their real class is the opposite. Based on these the metrics are defined in the following equations: P recision = T P T P + F P Recall = T P T P + F N (10) The other metrics, i.e., auROC, has been used without any threshold values selected by the binary classifiers. Note that auROC is often more effective in imbalanced datasets. ROC curve is the curve that plots the true positive rates against the false positive rates. This metric has values in the range of [0, 1], where 0.5 is a random classifier and 1 represents the best classifier.

IV. RESULTS AND DISCUSSION
All the modules we have used in our experiments are based on Python 3.6 and sci-kit learn library [39]. We have used Kaggle notebooks to run the experiments. We have executed all the experiments 10 times and reported the average values only.

A. PERFORMANCE OF THE TRAINING SET
We have applied all the classifiers-DT, RF, k-NN, Adaboost, Naive Bayes, Neural Network and Logistic regression-on the training set where they have been validated using k-fold cross validation (k = 3, 5, 10). Table 4 shows the metric values corresponding to each value of k. Please note that these experiments are conducted only on the training set to show the robustness of the methods and to tune or select the hyper-parameters. After this set of experiments, the models with the set of selected parameters are used and applied on the test set.
From Table 4, we can notice that DT algorithm gives the highest accuracy values of 97.14%, 97.54% and 97.74% for 3-, 5-and 10-fold, respectively. Naive Bayes model gives less accuracy values than other models, such as 85.65%, 85.55% and 85.69% for 3-, 5-and 10-fold, respectively. From Random Forest, we get the highest accuracy of 97.74% for 10fold. With k-NN model the highest value is 87.07% for 10fold. Moving on to Adaboost, the accuracy value is 92.22% for 10-fold. Finally, Logistic Regression gives 89.37% for 10fold.
From all these results, it is clear that the DT model gives the highest accuracy value which is 97.74% for 10-fold. It is also to be noted that the other metrics such as precision, recall and MCC are also high for the DT classifier. Moreover, auROC also gives very satisfactory values for the classifiers. Finally, MCC and recall are low for the classifiers such as logistic regression, Naive Bayes and k-NN in all experiments.

B. TEST SET PERFORMANCE
After performing cross-validation and learning models on the training data, we have applied the model to classify the instances from the test set. Table 5 shows comparison between all classification algorithms in terms of their performance in the test set. We have also drawn spider plots based on each of the metrics used in our experiments as shown in Figure 3.
From the table and the figure, we note that using KNN, the results are not much satisfactory with 68.70% accuracy. VOLUME 4, 2016

C. FEATURE SELECTION
We have used the LASSO model to find the feature importance. Figure 4 shows the feature importance of our dataset. We can see through the bar chart that some features have their regression coefficient values very close to 0. Those features are Itchy_skin_or_rash, night_sweats, shortness of breath, oral_cavity, smokes and vision_blurring. We exclude those features and select the features with higher correlation with the class label. We have also applied extra tree classifier based feature ranking and the ranking is shown in Figure 5.
We could see the similarities between the feature rankings found by two of the methods. We have reported the results before and after the feature selection together in Table 6 for different classifiers. Please note that performances have improved after the feature selection for different algorithms. However, accuracies might not reflect that. For example, the accuracy of Decision Tree is 89.31%, which was 97.45% earlier. In case of k-NN and Neural Network the accuracy has been greatly improved. Accuracy is unchanged in Random Forest, slightly degraded in Adaboost and Logistic Regression, and significantly degraded in Naive Bayes. Note that the dataset is a imbalanced one. The improved performances are reflected in MCC and auROC scores for each of these classifiers. We have performed Wilcoxon Sign-Ranked test to ensure the statistical significance with a p-value of 0.0277. However, the changes are more pronounced in terms of recall, MCC and ROC. Note that for Decision Tree we have noticed recall, MCC and ROC have significantly improved from 73.33%, 0.630 and 0.783 to 88.89%, 0.765 and 0.946, respectively. We also observe very similar results for the Random Forest and Adaboost algorithms. For other algorithms, the values of recall, MCC and ROC are either improved or degraded insignificantly. Over all we can conclude that feature selection has improved different performance parameters.

D. COMPARATIVE ANALYSIS
In this section, we show the comparative analysis of the different classification algorithms used in this paper based on the hyper-parameter space and overfitting.

1) Decision Tree algorithm
The max_depth hyper-parameter is used to restrict the size of the decision tree and thus reduce overfitting. The graph of VOLUME 4, 2016   Figure 6 for the train and test sets where the change of the max-depth is from 3 to 23. We can see from the graphs that for the train set the highest accuracy is obtained for the max-depth at 8 for while the accuracy does not vary much in case of the test set. On the other hand, we can also notice the corresponding performances on the test set for the max_depth.

2) Random Forest algorithm
In Figure 7, we present the plot of accuracy vs n_estimators for the Random Forest classifier on the train and test sets. We have changed the value of n_estimator from 20 to 1000 estimators and to see the change of accuracy. We have plotted the corresponding accuracy for both the train and test sets. We can observe that the maximum accuracy is obtained when the value of n_estimators is 150 for the train set.

3) Adaboost algorithm
Similar experiments are performed using the Adaboost algorithm and the results are plotted in Figure 8. Figure 8(a) shows that the accuracy is higher when the value of n_estimators is 100 for the train set.

4) k-NN algorithm
In Figure 9, we have plotted two graphs based on n_neighbours (denoted by k) and accuracy. We have taken the value of n_neighbours between 3 to 23 and reported the corresponding accuracy. From Figure 9(a) we note that the highest accuracy for the train set is achieved at n_neighbors=3. We have also applied distance weighted k-NN on the test set to see the performances thereof. However, varying the hyperparameter k similar to the values for the majority voting k-NN, we could not get higher performances in terms of accuracy. Figure 10 shows the training accuracy vs validation accuracy and the training loss vs validation loss for the Artificial Neural Networks algorithm. From Figure 10(a) we can see that the model can probably be trained more as the trend of accuracy in both datasets is still increasing for the last few epochs. We can further see that the model has not yet learned much more from the training dataset by showing comparable skills with both datasets. On the other hand, From Figure 10(b), we can see that the model has better performance. If these parallel plots begin to exit consistently, it could be a sign of stopping training at an early epochs. Please note that the graphs in Figures 6-10 are showing the model performances of different algorithms with different hyper-parameters. It is interesting to see the fluctuations of a few models for the settings used. However, for each of the classifiers we also note a stable performance for a region in the landscape of the parameters. We have selected the parameters for optimization according to the model behavior.

5) Artificial Neural Network algorithm
The most fluctuations are shown by Random Forest, which is explained by the randomly selected features by the number of estimators. However, note that after we increase the number of classifiers there are no changes in performances which is due to the relatively small number of features in our dataset.
The parameters that we have used to get the highest accuracy in case of different algorithms are listed in Table 7. Note that we have not used any parameter tuning for the Logistic Regression algorithm as the regularization parameters are the only parameters to be tuned. We have used the suggested settings from the literature. Also note that the Naive Bayes classifier is a parameter free algorithm. In the table, we show the default settings for these two algorithms. We have reported ROC curves for two of our best performing algorithms on the test set: decision tree and Random Forest as shown in Figure 11.

E. EXPLAINABILITY OF MODELS
Often in machine learning and AI, the results of the algorithms and the system are not explainable due to the blackbox nature of the mathematical models related with. In our experiments, we have shown the effectiveness of the Decision Tree model, which gives significantly better results than other algorithms in most of the evaluation metrics. We have already seen the important features that are revealed in the feature selection process and ranking done by the LASSO algorithm. In this section we further extend the explainability analysis by visualizing the models and automatically generating the rules. The rules generated by the models often confirm the existing knowledge and reports new information. There are two types of rules: classification rules and association rules. In this section, we have generated both types of rules and make a comparative analysis on them.

1) Classification Rules
First, we visualize two decision trees in Figure 12 and Figure 13 with max-depth values set to 4 and 8, respectively. Due to the large size of the figure, the decision tree with maxdepth has been shown partially in Figure 13. These figures give us a clear picture of the importance of the features selected as different levels by the Decision Tree. Note that, the selection depends on the attribute selection pa-rameter, information gain, gini, etc. The nodes of the decision trees are making a binary decision by comparing the values associated with it to a threshold value 0.5. Thus the branches actually denote the presence or absence of the particular attribute. From the trees, it is a very simple procedure to generate the classification rules. A path from the root to a leaf node denotes a path for classification. In the following we are showing three classification rules generated from the decision tree (shown in Figure 12) with max-depth equals to 4: • rule: 1 frequent_infections ∧ swollen, painless_lymph ∧ Persistent_weakness ∧ enlarged_liver =⇒ Leukemia • rule: 2 ¬frequent_infections ∧ ¬swollen, pain-less_lymph ∧ ¬jaundice =⇒ ¬Leukemia • rule: 3 frequent_infections ∧ oral_cavity =⇒ ¬Leukemia In the following we show four classification rules derived from the decision tree (shown in Figure 13) with max-depth equals to 8.

F. ASSOCIATION RULES
In explainable AI, often association rules are generated from the frequent itemsets that are present in the datasets. Often the quality of the rules are measured in terms of confidence, lift and support as defined below.
Here X and Y are two non-empty itemsets that are subsets of a frequent itemset based on a threshold and works as premise and conclusion of a rule. T denotes the set of all instances in the dataset. Here, support is denoted as how popular an itemset is, which is measured by the proportion of the transactions where an itemset appears. Confidence denotes as the presence of one itemset, which indicates the likelihood of the presence of another itemset. Here, items are indicating the symptoms in the dataset we have considered. Table 8 shows the rules of the Apriori algorithm without including the class labels as an item. These are the rules that show the co-occurrences among the items or symptoms that are important for leukemia screening. It is interesting to see the effect of the association rules generated when the class labels are included in the dataset. The association rules are shown in Table 9. Please note the similarities between the two sets of the rules. However, this time, a few of the rules where leukemia is present in the conclusion or premise encourage us to use them as similar to the classification rules. We can also see the symptoms that were selected by the decision tree such as 'frequent_infection', 'loss_of_appetite_or_nausea' or 'bone_pain' are selected as important features by all three types of analysis: feature ranking, decision tree based rule generation and association analysis by Apriori algorithm.

G. AVAILABILITY OF METHOD
To ensure that our method is reproducible and usable for other researchers in the community we have made all the necessary source code and dataset freely available. It can be accessible from here: https://github.com/AkterHossain312/BDLeukemiaWithML.

V. CONCLUSION
In this paper, we have presented an explainable machine learning model for leukemia detection. We have used a primary dataset collected from two government hospitals in Bangladesh. In a developing country like Bangladesh, data collection is a great challenge. However, after the data collection we have applied various ML models and shown a comparative analysis. We have also performed explainable analysis to generate classification and association rules that are interpretable and usable in leukemia screening.
In our experiments we have seen that simple and explainable model like the decision tree classifier performs best results when compared to the other methods which are more sophisticated. We have also shown that the similar symptoms are selected by all three types of explainable analysis: feature ranking, decision tree based rule generation and association analysis by Apriori algorithm. The symptoms that were selected mostly are 'frequent_infection', 'loss_of_appetite _or_nausea' , 'bone_pain', etc.
We strongly believe that this is a first of the kind work carried out in the context of Bangladesh. This study will provide benefits in early detection and screening of leukemia in research and in practice as well.
One of the future work is to enhance the dataset by incorporating more samples from the relevant hospital wards. This will also initiate the requirements of a study to validate the results from this pilot dataset and the symptom based prediction model. We strongly believe this pilot model will help building an enhanced dataset which in turn will help strengthen the model after further analysis and model building.