Making Deep Learning-Based Predictions for Credit Scoring Explainable

Credit scoring has become an important risk management tool for money lending institutions. Over the years, statistical and classical machine learning models have been the most researched risk management tools in credit scoring literature, and recently the focus has turned to deep learning models. This transition is due to better performances that are shown by deep learning models in different domains. Despite deep learning models’ superior performances, there is still a need for explaining how these models make their predictions. The non-transparency nature of deep learning models has created a bottleneck for their use in credit scoring. Explanations of decisions are important for lending institutions since it is a requirement for automated decisions that are generated by non-transparent models to be explained. The other issue in using deep learning models, specifically 2D Convolutional Neural Networks (CNNs), in credit scoring is the need to have the data in image format. We propose an explainable deep learning model for credit scoring which can harness the performance benefits offered by deep learning and yet comply with the legislation requirements for the automated decision-making processes. The proposed method converts tabular datasets into images and thus allowing the application of 2D CNNs in credit scoring. Each pixel of the image corresponds to a feature bin of the tabular dataset. The predictions from the 2D CNNs were explained using state-of-the-art explanation methods. Furthermore, explanations were evaluated using a sanity check methodology and also performances of the explanation methods were compared quantitatively. The proposed explainable deep learning model outperforms the other credit scoring methods on publicly available credit scoring datasets.


I. INTRODUCTION
Credit scoring, proposed by Durand [1], has become an important risk management tool for money lending institutions. Over the years, statistical and classical machine learning models have been the most researched risk management tools in credit scoring literature, and recently the focus has turned to deep learning models [2]. This transition is due to better performances that are shown by deep learning models in different domains. Despite better performances, there is still a need for explaining how deep learning models make their predictions. According to Liu et al. [3], an explanation is defined as ''the ability to provide a visual or textual presentation of connections between input features and output The associate editor coordinating the review of this manuscript and approving it for publication was Wei Xiang . predictions''. Similarly, Doshi-Velez and Kim [4] defined an explanation as ''the ability to explain or to provide the meaning in understandable terms to a human''. Note that in this study we treat explanation and interpretability as interchangeable terms although in literature some argue that these terms differ [5], Rothman [6] annotates that an explanation creates comprehensibility and clarity, on the other hand interpretability tells the meaning. This study intends not to pursue the philosophical debate on the differences between explanation and interpretability. The aim of this current study is to explain predictions made by a deep learning model in a credit scoring setting.
Explaining a prediction is particularly important in credit scoring since it is now a requirement for lending institutions under the Basel Accord [7] to explain to an applicant why her/his loan application was denied. Regulations such as the European Union General Data Protection Regulation (GDPR) [8] have made it compulsory for machine learning models to explain their predictions under the notion ''right to explanation''. Explanations foster trust in model predictions and assure that no discriminations are incurred during the application process when assessing credit worthiness of loan applicants. Explanations are playing an important role in areas where decisions are critical, for example in healthcare and in banking, to mention a few. This requirement of the ''right to explanation'' is advancing the research field that is coined as Explainable Artificial Intelligence (XAI) [6]. XAI is not a new concept and it has a rich history of more than 50 years [9]. Breiman was amongst the early adopters of interpretable systems and Breiman is known for developing Classification and Regression Trees and is quoted in one of the papers stating that ''On interpretability, trees are an A+'' [10].
Samek and Müller [11] highlighted at high level, the different techniques (old and emerging) that are used for XAI. The need, the purpose of different users, the evaluation methods and the future directions of XAI were also discussed in [11]. However, Rudin [12] provides a contrasting view on XAI and highlights the reasons why interpretable models should be used over explainable black box models for high-stakes decisions such as criminal justice, healthcare and credit scoring. Their argument was based on the premise that if a second model is used to explain a black box model (i.e. post-hoc explanation), there is a high chance that the explanation is not ''faithful to what the original model computes'' [12]. The recommendation was that instead of using post-hoc explanations, inherently interpretable models (i.e. ante-hoc explanations) that perform similarly to some of the black box models should be used for high-stake decisions. It was argued further that there is no truth in the statement ''there is a trade-off between accuracy and interpretability'', in other words it is a myth that non-interpretable models are more accurate than inherently interpretable models. To decide whether to use inherently interpretable models or not should be based on the circumstantial needs. In their review article Chari et al. [13] synthesised a taxonomy of explanations from the literature. This taxonomy consists of nine different types of explanations such as case-based, contextual, contrastive, scientific, statistical, trace-based, everyday and simulation-based explanations. The aim was to help produce explanations that are aligned to the circumstantial needs. Despite these different views from the literature, our study was motivated by [14]. However, our study differs from [14] in such a way that we have systematically discretized data into optimal categories using weights of evidence and we used both categorical and continuous features as opposed to [14].
Our contributions in this paper are as follows. 1) This is the first study to convert tabular data into images using a novel method and also it is the first method after such a conversion takes place that employs different explanation methods to explain the decision on each predicted credit sample. 2) We empirically showed that the explanations from different explanation methods can be used as optimal features and we observed that there is a huge boost in the classifiers' performance. In this work, we showed for the first time that the explanations are good features that can be directly applied on the classifiers and perform better when compared to the same classifiers operating on the raw data.
The remainder of this paper is organised as follows, Section II reviews and examines previous research of deep learning models in credit scoring, explanations of convolutional neural networks in other domains as well as the conversion of tabular data into images. Section III discusses the proposed framework of this study. Section IV discusses conversion of tabular datasets into images, training of 2D convolutional neural networks and the techniques used for explaining individual predictions. The experiments are described in Section V. The results are presented in Section VI and the conclusion is in Section VII.

II. RELATED WORK
In recent years, researchers have shown an interest in the application of deep learning models in credit scoring. For instance, Zhu et al. [14] used a hybrid method by combining a relief algorithm (which performs feature selection) with Convolutional Neural Network (CNN) to perform credit scoring and the hybrid relief-CNN model showed better performances when compared to Logistic Regression and Random Forest (RF) models. Tomczak and Zieba [15] used Restricted Boltzmann Machine (RBM) in credit scoring. A comprehensible scoring table from RBM showed better performances compared to classical machine learning models. Li et al. [16] conducted a study of deep learning with clustering and merging and the results showed high prediction accuracies. Deep Multi-Layer Perceptron (DMLP) and Deep CNN (DCNN) were used by Neagoe et al. [17] to assess the credit worthiness of applicants. The DCNN significantly outperformed DMLP. Hamori et al. [18] compared Deep Neural Network (DNN) with RF, bagging and boosting methods on credit datasets and the results showed that the ensemble methods performed better than DNN.
Deep learning with RF feature importance approach was proposed by Ha and Nguyen [19] for credit scoring. The empirical results showed that the proposed approach performed better than baseline methods such as the Decision Trees, k-Nearest Neighbour, Naíve-Bayes, Multi-Layer Perceptron and Random Forest with the exception of Support Vector Machine on German credit dataset. The proposed approach outperformed all baseline methods on Australian credit dataset. In their empirical study Sirignano et al. [20] developed a cohort of neural network ensemble models to assist in predicting the probability of default in credit risk using German and Australian credit datasets. The results showed that the proposed neural network ensemble models provide similar or better performance results compared to the literature results as well as to baseline single classifiers. Tripathi et al. [21] proposed a novel algebraic activation function to improve the performance of Extreme Learning Machine (ELM) model in credit scoring. The ELM consists of an input layer, a single hidden layer and an output layer. The study also proposed Bat algorithm which is an evolutionary approach to initialise the weights and biases of the ELM. The results showed a significant improvement on German and Australian credit datasets. Edla et al. [22] combined a binary particle swarm optimization and gravitational search algorithm (BPSOGSA) with a multi-layer ensemble classifier using five heterogeneous classifiers. The purpose of BPSOGSA was to select predictive features. The results showed that the proposed hybrid model outperforms the Random Forest model and other majority voting ensemble techniques on German, Australian and Japanese credit datasets. Similarly, Tripathi et al. [23] proposed a hybrid credit scoring model using a dimension reduction approach (i.e. neighbourhood rough set) and a multi-layer ensemble classifier. The experimental results showed that the proposed model can achieve satisfactory performance in credit scoring.
Regulated institutions are not willing to adapt models that cannot explain predictions and this is hampering the use of machine learning, specifically deep learning models. However, Ariza-Garzón et al. [24] conducted a study in credit scoring where the focus was to make non-transparent machine learning models explainable. In the study, classical machine learning models were compared to a logistic regression model for Peer-to-Peer (P2P) lending and predictions were explained using SHAP values. The results showed that the classical machine learning models not only perform better in terms of classification but also perform better in terms of explanations. Moreover, classical machine learning models showed that they can reflect dispersion, non-linearity and structural breaks in the relationships between each independent variable and the response variable [24]. The credit scoring literature has not researched extensively in explanations of deep learning predictions.
There has been interests in credit scoring and other fields for converting tabular datasets into images in order to apply 2D CNN models. For example, Zhu et al. [14] combined relief algorithm with CNN model. The relief algorithm was used to select predictive features and the CNN was used for classification. The study converted credit scoring tabular data into images by bucketing features and mapping them into image pixels. However, the study considered only numeric features. The results showed that the relief-CNN hybrid model performed better than the benchmark models in credit scoring such as the random forest and the logistic regression. Sharma et al. [25] proposed a novel approach called DeepInsight to convert non-image data into images and apply CNN models. The types of data that were used in the study were Ribonucleic Acid (RNA) sequence data, vowels data, text data and artificial data. The results showed a better performance from the DeepInsight approach compared to the stateof-the-art machine learning models such as the Ada-boost and the Random Forest models. Yang et al. [26] converted a multivariate time-series data firstly into two-dimensional colored images and secondly the two-dimensional colored images were combined to form a single image and lastly a CNN was applied on a single image for classification. The study compared three transformation methods for converting time series data into images. The transformation methods were Gramian Angular Summation Field (GASF), Gramian Angular Difference Field (GADF), and Markov Transition Field (MTF). The study was focusing on assessing the effect of using different transformation methods, the sequences of combining images, and the complexity of CNN architectures on classification accuracy. The results showed that the selection of transformation methods and the sequence of combination do not significantly impact the prediction outcome. Further, the results also showed that the proposed framework performed better than the other classification models in terms of the accuracy metric. Buturovi'c and Miljkovi'c [27] proposed a novel approach called TAbular Convolution (TAC) for converting tabular dataset into images for the application of 2D CNN models. The study used gene expression data obtained from blood samples of patients with bacterial or viral infections. The results showed a similar performance between the TAC approach and the state-of-the-art non-CNN machine learning classifiers in terms of accuracy when classifying gene expression data. Singh et al. [28] converted sensor dataset obtained from the floor surface pressure mapping into image data. Thereafter, a pre-trained CNN was applied on the image data and the results showed that the proposed method performed significantly better (by 10%) than the traditional machine learning methods. Note that none of these methods used the proposed conversion method that was undertaken in this current study. Furthermore, to the best of our knowledge, there is no other method in the literature which transforms tabular data into images in the way we proposed and applies explanation methods on predictions from 2D CNN.
Several studies have also focused on making CNNs interpretable in other domains. For instance, Tamajka et al. [29] used activations of the CNN as discreptors. Firstly the data was split into training and test sets, then the CNN was trained using a shallow network, the activations from the first pooling layer were used as a database of activations from known observations. Secondly, the test dataset was fed through a similar CNN and activations were extracted. Thirdly, to identify the class of each test set observation, its activation was compared to the activations of the training set samples and the class was assigned using a majority vote. This is similar to k-Nearest Neighbour approach and the authors used both cosine and euclidean as distance measures. The study used MNIST dataset and the results showed that the proposed interpretable model achieved almost similar results to the trained original CNN. Simonyan et al. [30] trained a DCNN on ILSVRC-2013 dataset which consists of 1.2 million images and two visualisation techniques were considered based on computing the gradient of the class score with respect to the input image. The finding was an establishment of the connection between the gradient-based CNN visualisation methods and deconvoltional networks. Liu et al. [31] applied CNN on MNIST dataset with the aim of interpreting the inner workings of the Fully-Connected (FC) layer. Firstly, the activations of the hidden layer of the FC layer were clustered to form factors. A factor is simply a label/class of a cluster. Secondly, identifications (IDs) were assigned to each cluster. Thirdly, the IDs were combined with the original data to form a meta-level data. Lastly, a decision tree was trained on meta-level data. For interpreting the hidden layer of the FC layer, the activations of the hypothesis class vs. true class were plotted on the x-y plane using all rows/depths of the decision tree. If there was an overlap between the activations, then this implied that there was no separation between the hypothesis class and the true class, this process was repeated until there was only one class or there was less overlap between the activations. The finding was that the proposed process was faithful to the original CNN model, i.e. accuracy was not compromised by the interpretable CNN.

III. PROPOSED FRAMEWORK
The proposed framework serves as a guideline for using 2D CNNs on tabular datasets. Firstly, this framework proposes that the feature values from tabular datasets should be mapped into different bins that are used to calculate weights of evidence and thereafter images to be created from the bins. After creating the images, the image dataset is split into training and test sets. Secondly, a 2D CNN needs to be trained on the training set and its performance to be assessed on the test set. Thirdly, the predictions from the 2D CNN need to be explained using state-of-the-art methods such as the Grad-CAM, SHAP values, Saliency Map and LIME. Fourthly, explanations must be validated using sanity check methodology [32]. Lastly, all explanation techniques must be compared quantitatively for the purpose of identifying the best performing explanation technique.

IV. METHODOLOGY
This section discusses the methodology undertaken in this study and its components in detail.

A. WEIGHT OF EVIDENCE
The Weight of Evidence (WOE) is used as a transformation technique for features in credit scoring [33]. The first step in calculating WOE is to create bins for each feature. Thereafter, the WOE is calculated for each bin. We denote the weight of evidence for bin b in feature f as WOE where Dst.Gds  Dst.Gds Dst.Bds WOE computations are performed as follows [33]. Each bin is created in such a way that the bin contains at least 5% of data. This is to ensure that no bins are empty. There are no bins with 0 counts for goods and bads. This means that even if the data is not balanced, each bin will have both goods and bads and no bin will contain one type of class. The bins should have a monotonically decreasing or increasing relationship with the response/target variable. This is to assure that there are no reversals in the relationship between the features and the response variable. To crystallise this, suppose a feature age has a relationship with a default risk target variable. In general, younger people tend to be at a higher risk of defaulting on their loans compared to older people, and this implies that risk decreases monotonically with increasing age. Note that in this study we are not directly using WOE for converting tabular datasets into images but bins that are used to calculate WOE. The WOE are only used as inputs to information value calculation which is discussed in the following section. Table 1 shows an example calculation of WOE. The Age feature is bucketed into different bins. For each bin, the distribution of goods and bads are calculated. Each bin has at least 5% of data and no bin contains 0 counts of either bads or goods. The goods are non-defaults and the bads are defaults. The WOEs are showing a decreasing trend. Later, we will show how we use bins for calculating WOE in order to convert our dataset into images. The bins for WOE calculations are optimal bins in terms of class distribution and also no bins contain single class information.

B. INFORMATION VALUE
The Information Value (IV) is used to select predictive features and is calculated as follows for each feature f , where B (f ) represents the number of bins for feature f . The following thresholds apply as a general rule of thumb when using IV [33]: R1) < 0.02: unpredictive; R2) 0.02 to 0.1: weak predictor; R3) 0.1 to 0.3: medium predictor; R4) 0.3 to 0.5: strong predictor; and R5) ≥ 0.5: suspicious or too good to be true. Before converting our datasets into images, we first use IV to select predictive features, i.e., a feature f is selected if

C. CONVERSION OF TABULAR DATA INTO IMAGES
Once the data is transformed using bins for WOE calculation, we then create a one-hot-encoding transformation for each binned feature. Below we show a matrix example for each applicant, i.e. ∀i ∈ {1, 2, · · · , N } where N is the number of applicants: where The feature image is nothing but a sparse binary matrix (see Eq. (5)) of size B × D representing one-hot-encoding of each feature according to where E : f j → {0, 1} B is a function to perform one-hotencoding for a feature f j and B is defined as a maximum number of distinct bins of features, i.e.
. It is worth to note that for some features B (f j ) < B and in such cases, function E(f j ) appends (B − B (f j ) ) zeros to onehot-encoding representation of f j . It is also worth to note that features become nominal with one-hot-encoding, hence some features in Table 2 do not follow a logical order. For example, the feature Duration of Credit in Table 2(b) has bin [13,24] as the first bin and [4,12] as the last bin. For this feature, one would expect the bins to be arranged in an increasing order, however it is not always the case with onehot-encoding.
Each entry of x (i) is then mapped into a pixel, where ones correspond to white pixels (and white pixels symbolise the presence of a bin) and zeros to black pixels (and black pixels symbolise the absence of a bin). Thus, each white pixel of an image represents a defined bin for WOEs calculations.

D. CONVOLUTIONAL NEURAL NETWORKS
Convolutional Neural Networks (CNNs) [34] are widely known for solving tasks in image recognition, speech recognition and time series problems. The CNN consists of convolutional layers, pooling layers and dense layers (please see Figure 1). In what follows we briefly discuss layers of a CNN. For more details on CNNs please refer to our previous paper [2].

1) INPUT
In image classification tasks, an input is a tensor of shape (height, width, channel). The channel represents a color of an image where 1 denotes a gray-scale image and 3 denotes an image with color/s.

2) CONVOLUTIONAL LAYER
Local patterns in image classification such as edges and textures are learned at convolutional layers [35]. Each convolutional layer consists of feature/activation maps that are responsible for learning different local patterns in an image. The feature maps are formed by sliding a learned feature detector also known as a learned filter on different parts of an image. Thereafter, a non-linear function such as the Rectified Linear Unit (ReLU) function is applied on feature maps.

3) POOLING LAYER
A pooling layer is responsible for downsizing feature maps by performing either max pooling or average pooling. The spatial invariance occurs at a pooling layer to assure that features are preserved irrespective of where they are located in an image.

4) FLATTENING
Flattening transforms all pooled feature maps into a single vector which acts as an input in a dense layer/fully connected layer.

5) CALCULATION IN EACH LAYER OF CNN
Normally, a deep learning neural network learns a set of parameters such as the weights during training. The parameters are learned in the convolutional layers and dense layers.
To calculate the number of learnable parameters in the convolutional layers, we multiply the shapes of the filters such as the width w, height h, previous layer's filters p and filters of the current layer c resulting in where 1 is for the bias term, w is the shape of width of the filter, h is the shape of height of the filter, p is the number of filters in the previous layer and c is the number of filters in the current layer. In the convolutional layers, a feature map (i.e. a matrix of numbers) is produced by striding a filter along an input image (in the input layer) or along another feature map (in the convolutional layer). Each entry of the feature map is calculated as follows where r and c represent rows and columns, respectively, of a resulting feature map, f denotes an input image, t represents a kernel or a filter, * denotes a convolutional operator. The feature map A (l−1) is then multiplied by a weight matrix W (l) and added with a bias term b (l) to form and a ReLU function σ is applied on Z (l) to propagate the feature maps into subsequent convolutional layers where l denotes the l-th convolutional layer. In the pooling layer, either a max pooling or mean pooling is applied to produce down-sized feature maps. Each entry of the pooled feature map is In the dense layer, the number of learn-able parameters is where c denotes current layer neurons, p denotes previous layer neurons and 1 is for the bias term. For dense layers or fully connected layers, each neuron in the l-th layer receives signals from neurons of the previous layers l − 1. The signals are multiplied by the corresponding weights and are summed together with a bias term and thereafter a transfer function is applied on the summed product to form an activation. Suppose w (l) ji is a weight connection from neuron i to neuron j in the l-th layer. The activation a (1) j of each neuron in the first hidden layer of the fully connected layers is and the activation in the subsequent hidden layers is where d denotes the number of input features, n denotes the number of hidden neurons in hidden layer l − 1 and σ is a transfer function (e.g sigmoid or ReLU function).
In the output layer of the fully connected layers, a binary cross-entropy function is where p(y i ) is the predicted output, y i is the actual target value and q = 1 − p(y i ). The binary cross-entropy function is used to propagate errors backwards to update the weights of the CNN during training.

E. EXPLANATIONS 1) SALIENCY MAP
A Saliency Map [9] provides a visual explanation by highlighting important regions of an image that result in a prediction of a specific class. Specifically in this study, the image pixels correspond to bins that were created for WOE calculations. There are several variants of Saliency Map explanations, e.g., gradient explanation, gradient class activation map (Grad-CAM) explanation and layer-wise relevance propagation (LRP) explanation to mention a few. However, in this study we focused on gradient explanation and Grad-CAM.
For an image x belonging to a class c, the Saliency Map M is computed as a derivative of a predictionĝ(x) with respect to an input image x wherê g is a cost function. Each pixel of M corresponds to the importance of the relative pixel of an input image [32], which is normally a class-specific score. The gradient measures how much a change in each input dimension would change the predictionĝ(x) in a small neighborhood around the input [32].

2) GRAD-CAM
Gradient weighted Class Activation Map (Grad-CAM) [36] is determined using a last convolutional layer of a CNN model. Firstly, a gradient of the score with respect to an activation map of the last convolutional layer is determined via a back-propagation approach where y c is a score for class c and A k is a feature map activation. Secondly, the neuron importance weights are calculated as follows j is a global average pooling and Z = i × j. Lastly, each importance weight is multiplied by the corresponding feature map activation and the products are summed up to form a saliency map given as follows where L c Grad−CAM ∈ R u×v and u × v represent the size of the feature map. The saliency map that is produced by the Grad-CAM highlights the important regions in an image that are corresponding to a predicted class.

Local
Interpretable Model-Agnostic Explanations (LIME) [37] is used to explain predictions of black-box models. LIME uses for example a decision tree or a linear model around an instance of interest to explain the instance's class prediction [37]. The idea behind explaining predictions is to foster trust in predictions if any actions need to be taken. For example, if a credit analyst depends on a model for granting/rejecting a loan application, he/she will need to have a model's prediction explanation in order to trust the prediction and subsequently make a decision. We use LIME to create explanations as follows.

a: CREATE PERTURBATIONS OF AN IMAGE OF INTEREST:
Let x be an original representation of an image. Image x is then segmented into M super-pixels. A super-pixel is a group of pixels in an image. Let K be a number of perturbed images. A perturbation of image x is created by randomly turning on and off some of the super-pixels. This results intô X = {0, 1} K ×M where M = B × D (i.e. B is the maximum number of bins and D is the number of features) and each row ofX is an interpretable representation of each perturbed image. Here, a 1 represents that a super-pixel is on and a 0 represents that a super-pixel is off.

b: PREDICT CLASSES OF THE PERTURBED IMAGES:
A non-linear machine learning model g (which is in our case the model we have trained on the images) is used to predict a class of each perturbed image x (i) . The prediction of x (i) iŝ g(x (i) ).

c: COMPUTE DISTANCES BETWEEN THE ORIGINAL IMAGE AND EACH OF THE PERTURBED IMAGES:
The distance between each perturbed image and the image being explained is computed using the cosine distance. The cosine distance is given as is the original image with all super-pixels enabled and x (i) = {0, 1} M i=1 are perturbed images.

d: COMPUTE THE WEIGHT FOR EACH PERTURBED IMAGE:
This is achieved by mapping each of the distances calculated above into a kernel function that has values between 50432 VOLUME 9, 2021 zero and one. This is shown by the formula below where σ is the width of the kernel.

e: FIT A SPARSE LINEAR MODEL:
The interpretable representationsX of the perturbed images, the predictionsĝ(x (i) ) and the weights π x (i) are then used to fit a sparse linear model. The sparse linear model is given as ∀i = {1, 2, · · · , K } where M represents the number of super-pixels. Each coefficient β j in the sparse linear model corresponds to one super-pixel in the segmented image. The coefficients represent how important each super-pixel is for the prediction of the class of interest. Thus, the top super-pixels (in terms of the coefficient magnitudes) are used to explain the prediction made by the model g.

4) SHAP VALUES
The SHAP values [38] use an idea of coalition game theory, where there are players, a game and a payout. The aim of game theory is to distribute the payout between the players based on their contributions. In machine learning, the players are the feature values of a tabular dataset, the game is the prediction task, the payout is the prediction minus the average prediction for all dataset records. Hence, the SHAP value for any record is the average marginal contribution of a feature across all possible feature coalitions. Suppose we have a linear model and we want to explain a prediction for a record f (i) .
The prediction for f (i) iŝ where f d are feature values. Let φ j be a contribution of feature j on predictionĝ(f (i) ). The contribution φ j is given as where E(F j ) is the expected value for feature j. If all feature contributions are summed up for one record, the result is which is the difference between the predicted value for record f (i) and the average predicted value for dataset For machine learning models, we determine M coalitions/combinations of feature values for a record f (i) . There are 2 d possible coalitions for feature values and this makes computations to be expensive and as a result we choose M such that M 2 d . In order to assess contributions of each feature when explaining a prediction of a record, we make predictions for all M coalitions. Since we are using a subset of coalitions, the contribution of each feature is an estimate given asφ whereĝ(f m + j ) is the prediction for record f, but with a random number of feature values replaced by some feature values from a random data point z in dataset F with the exception of the respective value for feature j. The predictionĝ(f m − j ) is identical toĝ(f m + j ) but the respective value of feature j is taken from a feature value of z. Let and then and In general, Eq. (21) is and S is a subset of feature values for record f that needs to be explained. For images, a player is a group of features, for example, pixels can be grouped together to form super-pixels. Hence, each super-pixel represents a player.

V. EXPERIMENTS A. DATA
Three real world credit datasets were used in this study, i.e. German, Australian [39] and Home Loan Equity (HMEQ) [40]. The number of samples, the number of features and the types of features for each dataset are shown in Table 3. All three datasets are publicly available credit scoring datasets on UCI repository and Kaggle. The first two datasets are the most used datasets in credit scoring literature [2]. The German credit dataset has 20 features of which 7 are numerical and 13 are categorical. These features include status of existing checking account, duration in month, credit history and purpose, to mention  a few. The Australian credit dataset has in total 14 features and 6 of the features are numerical and 8 are categorical. The original feature names have been changed for confidential purposes. The HMEQ dataset has 13 features of which 11 are numerical and 2 are categorical. The features include amount of loan request, amount due on existing mortgage, value of the current property, years at present job, number of credit lines and etc. The response variable for all three datasets is binary, i.e. applicants are classified either as ''defaults'' or ''nondefaults''. Table 2 shows WOE bins for each feature in German, Australian and HMEQ datasets. Note that unpredictive and weak features were removed according to the information value. Figure 2 shows an example where an applicant information is converted into an image, and in Figure 2 the columns represent features and the rows represent bins for WOE calculations.

B. MODEL
Each of the datasets was split into 70% training set and 30% test set following the common practise in the literature. The samples in each split were shuffled and thereafter stratified random sampling was performed for each split. The training set was used to determine the optimal parameters of the CNN model. The test set was used to assess the performance of the CNN model. Table 4, Table 5 and Table 6 show different architectures of the CNN models for Australian, German and HMEQ datasets, respectively. For Australian dataset, the CNN model was trained using 100 epochs and a batch size of 54. For both German and HMEQ datasets, the CNN model was trained with 100 epochs and a batch size of 64. The 2D CNN model was then compared to a 1D CNN model. The 1D CNN model was trained using 1000 epochs and a batch size of 32 on each dataset. The purpose of using a 1D CNN model is to see the effect of the model when the data is not converted into images. For 1D CNN model, we used the raw tabular dataset. Hence, Table 7 shows two different architectures for a 1D CNN model. Please note that 2D CNNs are not applicable to raw tabular datasets, hence we opted for a 1D CNN model.

C. PREDICTION PERFORMANCE MEASURES
For each of the datasets, we calculated ''Accuracy'' which measures the proportion of the correctly classified examples and is defined as We also calculated other performance metrics which are defined next. The ''Area Under the Curev'' (AUC) [15] which is a metric that measures the predictive power of a model is calculated as follows The higher the AUC, the better the predictive power of the model. The ''Brier Score'', named after Glenn Brier [41], calculates the mean squared error between the predicted probabilities and their respective true class values and is calculated as follows whereŷ i is the predicted probability of the true class, y i is the true class and N is the number of records. The Brier Score takes values between 0 and 1. The lower the Brier Score, the better the predictions are calibrated. On the other hand the ''H-measure'' proposed by Hand [42] assesses the cost of misclassification and it uses a severity ratio (SR) given as where c 0 > 0 is the cost of misclassifying a class 0 record as class 1. Hence, the higher the H-measure the lower the misclassification cost.

D. EXPLANATIONS
After training the CNN models and assessing their performances, explanation methods such as LIME, SHAP values, Grad-CAM and Saliency Map were used to explain individual predictions. For LIME, we set the kernel width σ 2 = 0.25 and the number of perturbed images K = 150. This study focused on local explanations (i.e. individual explanation) as opposed to global explanations (i.e. explaining holistically how the model makes predictions).

A. MODEL PERFORMANCE
We created three CNN architectures (please refer to Table 4,  Table 5 and Table 6) to perform contrast experiments on the design of the architectures. For each architecture we used different activation functions, different number of layers and different number of filters in each layer. Also, Table 8, Table 9 and Table 10 show confusion matrices, AUC, Brier Score and H-measure for each of the CNN architectures on all three  datasets. Amongst the three tables, Table 8 which corresponds to CNN Architecture #1 shows better performances in almost all three datasets. Hence, our proposed method is based on CNN Architecture #1. The results show that the proposed method achieves similar performances on both training and test set, hence it does achieve a good generalisation performance. We compared the 2D CNN model with 1D CNN model to see the impact of converting data into images. Table 11 and Table 12 show the results of the 1D CNN model. It can be seen that when comparing the results in Table 8 with the results in Table 11 and Table 12, the 2D CNN model performs better than the 1D CNN model on German and Australian datasets when looking at all the performance metrics. However, on the HMEQ datasets, both 2D CNN and 1D CNN perform similarly regarding accuracies and brier scores, but 2D CNN show better results when looking at the AUC and the H-measure. These results support the conversion of tabular datasets into images in order to leverage the high performance nature of 2D CNNs.  The results that were found in literature which are shown in Table 13 were compared with the proposed method results. Hence, the proposed method shows in most cases better performances compared to the methods from the literature considered in this paper. The results in Table 13 and Table 8 support the efficacy of the proposed 2D CNN model in accurately assessing the credit applicants. However, we also want to make sure that we have meaningful and good explanations for the proposed model's decision. In the following, we qualitatively and quantitatively evaluate the explanations of the proposed method generated by four different explanation methods (i.e LIME, Saliency Map, Grad-CAM and SHAP values).

B. EXPLANATIONS
For each of the datasets, we randomly selected an image representing the corresponding record in the tabular dataset. Thereafter, we predicted each class for each of the randomly    Table 2 for further details on features.
selected images. We then used Grad-CAM, LIME, SHAP values and Saliency Map to explain each predicted class (i.e. defaults = 1 and non-defaults = 0). Figure 3, Figure 4 and Figure 5 show explanations for each of the predictions. The white blocks/pixels in LIME are super-pixels that are  Table 2 for further details on features.  Table 2 for further details on features.
important in making a prediction of a class. For Saliency Map, the color-bar shows which pixels are important in making a prediction of a class. For SHAP values, the red pixels contribute more in making a prediction of a class and the blue pixels contribute less. To explain in an understandable form to a human, we use Table 2 where each bin corresponds to a pixel.
For example, LIME in Figure 5 explains the reason why a customer is flagged as a defaulted customer. The reason is that the amount due on existing mortgage is at least $88, 210, the value of the current property is at least $106, 450, the number of credit default lines is at most 1, the age of the oldest credit line is between 130 and 166 months and the debt to income ratio is at least 31%. We can interpret this explanation as follows: the customer has defaulted once, he/she still owes a huge amount on his/her mortgage, almost a third of his/her salary/income pays debt and he/she has one credit line that is older than 11 years. Saliency Map in Figure 5 explains that the oldest credit line for the same customer is between 130 and 166 months, the amount due on existing mortgage is less than $42, 000 and the value of the current property is less than $72, 000. SHAP values in Figure 5 explain the following: the amount due on existing mortgage is at least $88, 210 for the customer and the customer is doing an office work, the value of the existing property is at least $106, 450 and the oldest credit line is between 279 and 1168 months. Grad-CAM in Figure 5 explains that the customer requested a loan of atleast $13, 000 and the amount due on existing property is atleast $88, 210.
It is worth to mention that for Australian credit dataset all feature names and values were anonymized to protect confidentiality of the data. Hence, we can't really explain Figure 3. Next, we explain Figure 4 where the correct class is a defaulted customer. Starting with LIME, the defaulted customer is 34 years or older, owns a real estate, his/her account is in arrears and his/her account balance is less than 200 Deutsche Marks. The same defaulted customer is explained by the Saliency Map as 34 years or older and unemployed. For SHAP values, the customer is 34 years and older, account in arrears, owns a real estate and the account balance is less than 200 Deutsche Marks. For Grad-CAM in Figure 4, the customer has a balance that is less than 200 Deutsche Marks and she/he has been working for atleast 4 years but less than 7 years. This is an illustration of how one would explain the predictions using the corresponding bins in Table 2. Please note that we did padding on the images to compensate for empty bins, if an explanation highlights a pixel that falls on a padding, we ignore the importance of that pixel because it carries no meaning.

C. SANITY CHECK FOR EXPLANATIONS
Once an explanation is determined for a prediction, the next thing to do is to assess its validity. To do this, we followed a process suggested in [32]. Firstly, a model is trained on the original labels and an explanation is provided to explain a prediction of image x. Secondly, another model having the exact architecture as the previous model is trained on a copy of the original data but the labels are randomly permuted. The reason for randomizing the labels is to break the relationship that exists between the input features and the labels [32]. An explanation is then provided to explain a prediction of a randomly permuted label for image x. Thirdly, the explanations on both scenarios for image x are compared. If the explanation for image x did not change after randomly permuting the labels, then the explanation on the original labels did not explain anything about the relationship between the inputs and the prediction. The visual comparison of the explanations could be misleading, to ascertain that the explanations are the same/differ, a quantitative measure such as the Spearman rank correlation is used. The Spearman rank  Figure 3, Figure 4 and Figure 5.
correlation is calculated as where l is a number of categories that need to be ranked and m = Rank(x) o − Rank(x) p is the difference between the rankings of the explanations on the original and the permuted labels, respectively. Table 14 shows correlations for each explanation technique for Figure 3, Figure 4 and Figure 5 on each dataset. For Australian data, LIME shows a weaker correlation. For German data, SHAP values and Grad-CAM show a weaker correlation. For HMEQ, LIME shows a weaker correlation. According to [32], weaker correlations mean that the explanations do pass the sanity check.

D. QUANTITATIVE COMPARISON OF EXPLANATION METHODS
This section looks at comparing all four explanation methods quantitatively. Suppose L R is an xgboost model trained on raw dataset and L E is another xgboost model trained on explanations of predictions from the model. The performance of L R on raw test set is denoted as P R and the performance on explanations test set is denoted as P E . We hypothesise that for good explanations, we expect P E > P R . Note that the performance metric we are using is the accuracy which is a proportion of the correctly classified records. We tested our hypothesis in all three datasets and we found that for all four explanations P E i > P R , ∀i ∈ {Grad-CAM, Saliency Map, LIME, SHAP values}. Thereafter, we compared performances of the explanations methods used in this study. The best performing explanation method will have P E i > P E j , ∀j where j = i. To conduct the comparisons, firstly we split each raw dataset into 70% training and 30% test sets. To avoid confusion, for each dataset we only performed a single data split, where we randomly shuffled the records and performed stratified sampling. Secondly, we predict the class of each record for each of the datasets (both training and test sets) using our CNN model. Thirdly, we explain each of the predictions using Grad-CAM, Saliency Map, LIME and SHAP values. Fourthly, we train L R on 70% raw dataset and assess its performance P R on 30% raw test set. Thereafter, we train L E on explanations of the predictions from the 70% raw dataset and assess the performance P E on explanations of the predictions from the 30% raw test set. Table 15 shows performances of the explanation methods together with performances on the raw datasets when using xgboost model. Note that the performances were based on 30% test set. In all three datasets, the SHAP values improve accuracy to 99% and 100% and these are state-ofthe-art results for all three datasets. This shows that the SHAP values may be used for explanations as well as for feature selection for improving model performances.

VII. CONCLUSION AND FUTURE WORK
In this paper, tabular datasets were converted into images using bins that were used to calculate weights of evidence. Each pixel of a feature image corresponds to a feature bin. The purpose of using weights of evidence was to create meaningful bins that are monotonic to the response variable. For instance, in credit scoring, a feature such as age should decrease monotonically with a default risk response variable and this means that the younger applicants are more riskier than the older applicants in terms of defaulting on their loans. Hence, the risk of defaulting decreases monotonically with increasing age. The other purpose for using bins that were used to calculate weights of evidence in this current study is that both continuous and categorical features were accommodated. The images were then used to train 2D convolutional neural networks. The performances of the trained convolutional neural networks were compared with literature results. We found that the trained convolutional neural networks performed better when compared to the literature results. However, the aim of our study was not only on model performance but rather on explaining predictions. The predictions of the models were then explained using Grad-CAM, LIME, SHAP values and Saliency Map. Each of these explanation methods highlight important regions/pixels in an image that correspond to the output/prediction class. To explain in an understandable form, all important pixels were linked to their respective bins that were used to calculate weights of evidence. The explanation techniques were validated using sanity check methodology. Furthermore, performances of explanation methods were compared quantitatively by using an xgboost model. The explanation method which performed better compared to other methods in all three datasets was SHAP values. The purpose of the xgboost model was to evaluate quantitatively the performances of the explanation methods. However, the future research should focus on validating and evaluating performances of the explanations by using domain experts in the field of credit scoring such as the credit risk analysts or managers. Also, future work should focus on comparing the performances of the methods that perform tabular data conversion into images. Further, for future work the proposed framework can be extended to other deep learning architectures such as ResNet, Deep Graph Neural Networks and others.