Model-agnostic Counterfactual Explanations in Credit Scoring

The past decade has shown a surge in the use and application of machine learning and deep learning models across different domains. One such domain is the credit scoring domain, where applicants are scored to assess their credit-worthiness for loan applications. During the scoring process, it is key to assure that there are no biases or discriminations that are incurred. Despite the proliferation of machine learning and deep learning models (referred to as black-box models in the literature) in credit scoring, there is still a need to explain how each prediction is made by the black-box models. Most of the machine learning and deep learning models are likely to be prone to unintended bias and discrimination that may occur in the datasets. To avoid the element of model bias and discrimination, it is imperative to explain each prediction during the scoring process. Our study proposes a novel optimisation formulation that generates a sparse counterfactual via a custom genetic algorithm to explain a black-box model’s prediction. This study uses publicly available credit scoring datasets. Furthermore, we validated the generated counterfactual explanations by comparing them to the counterfactual explanations from credit scoring experts. The proposed explanation technique does not only explains rejected applications, it can also be used to explain approved loans.


I. INTRODUCTION
T HE past decade has shown a surge in the use and ap- plication of machine learning and deep learning models across different domains.One such domain is the credit scoring domain, where applicants are scored to assess their creditworthiness for loan applications.During the scoring process, it is key to assure that there are no biases or discriminations that are incurred.Despite the proliferation of machine learning and deep learning models in credit scoring, there is still a need to explain how each prediction is made by the models.Most of the machine learning and deep learning models act as black-box models, hence these models are likely to be susceptible to unintended bias and discrimination that may occur in the datasets.To avoid the element of unintended model bias and discrimination, it is important to explain each prediction during the scoring process.The Basel Accord [1] requires explanations for denied loan applications in order to assure that transparency is maintained in automated decisions in financial sectors.Emerging regulations such as the European Union General Data Protection Regulation (GDPR) [2] stipulates that, for automated decisions, a "right to explanation" has to be maintained.Prediction explanations are crucial for high-stake decisions such as the presence or absence of a disease in the healthcare sector, rejection or acceptance of a loan application in the finance sector and denial or approval of a parole in the criminal justice sector.
The literature has done a great amount of work on explainable artificial intelligence (XAI) which is an emerging branch of artificial intelligence.The main purpose of XAI is to make deep learning and machine learning models interpretable.A black-box model can be explained by either using a post-hoc, an ante-hoc or an instance-based explanation.A post-hoc explanation uses another model such as a linear regression or decision trees to explain the behaviour of a black-box model (e.g.Local Interpretable Model-Agnostic Explanation (LIME) [3]), an ante-hoc explanation is an inherently interpretable model (e.g.Bayesian Rule List [4]) and an instance-based explanation uses an instance to explain the behaviour of a black-box model (e.g.Visual Counterfactual Explanations (ViCE) [5]).Rudin [6] posited that inherently interpretable models (i.e.ante-hoc explanations) should be used for high-stake decisions in lieu of explainable machine learning (referring to post-hoc explanations).The study argued that explainable machine learning generates explanations that are not faithful to what the black-box model computes.Hence, our study focuses on using a counterfactual explanation technique, which falls under the instance-based explanation.Wachter et al. [7] highlighted three key benefits for using counterfactual explanations, (1) to inform and assist applicants to understand why a certain decision was made, (2) to give grounds to contest unfair decisions, and (3) to understand what could be changed to attain a desired outcome in the future.A classical example of a counterfactual explanation is a loan application [8,9]: "Imagine you applied for a credit at a bank.Unfortunately, the bank rejects your application.Now, you would like to know why.In particular, you would like to know what would have to be different so that your application would have been accepted.A possible explanation might be that you would have been accepted if you would earn 500$ more per month and if you would not have a second credit card." Despite the fact that counterfactuals seem to produce intuitive explanation systems, some problems remain.Most counterfactual explanation methods generate more than one counterfactual explanation, and this problem is referred to as Rashomon effect [9].The generated explanations might have contradicting "paths" on how a certain output was reached.It becomes more difficult and unclear for the user or applicant if there is more than one option of explanations to select from.Verma et al. [10] conducted a review on counterfactual explanations in machine learning applications.The study highlighted metrics for evaluation of generated counterfactuals, amongst other things.The counterfactual generation time was a metric discussed amongst other metrics.This metric can be taken as an average over the generation of a counterfactual for a group of records or for the generation of multiple counterfactuals for a single input record Verma et al. [10].
Our study tries to address these issues and our contributions are as follows: 1) a novel formulation of the optimisation problem that leads to a single sparse counterfactual explanation using a custom genetic algorithm; 2) a better generation time for counterfactuals that is accounted by the normalisation of continuous features and the selection of predictive features; and 3) validating counterfactual explanations using credit scoring experts.
Without loss of generality, we used counterfactual and counterfactual explanation interchangeably.Further, the following words; sample, instance and record, are used interchangeably.
The remainder of this paper is organised as follows.Section II discusses related work on counterfactual explanations.Section III introduces our proposed methodology for generating counterfactual explanations.The experiment setup is described in Section IV and the results are given in Section V. Finally in Section VI, we summarise the paper and discuss our future work.

II. RELATED WORK
In recent years, research has shown an increase in the number of studies that focused on explaining machine learning predictions using counterfactuals.Grath et al. [11] proposed two weighted approaches to produce counterfactuals for credit scoring data, where one approach derives weights from feature importance and the other depends on nearest neighbours.Empirical results showed that weights that are produced from feature importance produce more compact counterfactuals.Furthermore, the study produced positive counterfactuals for accepted loans to assist individuals when they make future financial decisions.
In most cases, counterfactuals are derived by using a metaheuristic approach (i.e. an optimisation approach that is not problem/task dependent).Guidotti et al. [12] proposed a method that learns a local and interpretable classifier on a synthetic neighbourhood of a record of interest (i.e. an instance that needs its prediction to be explained).The synthetic neighbourhood is generated by a genetic algorithm.The proposed method produces a local explanation that consists of logical rules explaining a decision of the instance of interest and a set of counterfactuals which suggest changes in the instance of interest to produce a desirable outcome.The results showed that the proposed method outperforms previous methods with regards to the quality of the produced explanations and the faithfulness to the black-box model.Sharma et al. [13] proposed a unified and model agnostic approach to address non-transparency (amongst other issues) of black-box models by using counterfactuals that are generated via a genetic algorithm.The proposed approach achieved robustness, transparency, interpretability and fairness of blackbox models.Further, the study intends to improve the speed of genetic algorithms in their future work.Sharma et al. [13] is the closest study to our approach.
Once the counterfactuals are generated, the next thing to look at is the feasibility of the generated counterfactuals.A change in a small number of features makes counterfactuals to be feasible.Van Looveren et al. [14] proposed a framework for generating sparse and in-distribution counterfactuals.A sparse counterfactual refers to a counterfactual that requires minimum feature changes to make the counterfactual to belong to a desired class.
Poyiadzi et al. [15] argued that the current methods that generate counterfactuals for explanation are not considering the feasibility of the generated counterfactuals in the real world.This is attributable to counterfactuals that do not represent the underlying data distribution.The study proposed a method that generates feasible and actionable counterfactual explanations based on the shortest path distance determined by density-weighted metrics.The proposed approach generates counterfactuals that are logical and consistent with the underlying data distribution, and this makes counterfactuals to be feasible and actionable.Mothilal et al. [16] posited that the counterfactuals should be feasible and diversified.The study proposed a framework for generating and assessing diversity of counterfactuals based on a determinant of a kernel matrix.The proposed framework generate diverse counterfactuals as opposed to previous methods.
Efficiency and speed are the key factors when generating counterfactuals.It is better to use less computer resources to speed up the computation time.Methods that are resource intensive, tend to result in high computation time.Artelt and Hammer [17] investigated how to efficiently compute counterfactuals for prototype based classifiers.The study discovered that in most cases either a set of linear or convex quadratic programs that generate counterfactuals can be solved efficiently.Van Looveren and Klaise [18] proposed the use of class prototypes to speed up the generation of counterfactuals.The class prototypes are either attained by employing an encoder or class specific k-d trees.
Not only speed and efficiency are key when generating counterfactuals, but also the safety of the black-box models.Counterfactuals guarantee the safety of the black-box models.Sokol and Flach [19] showed that when making AI systems explainable, there is a chance of compromising safety and security of the system and the possibility of data leakage.This pose a challenge to Explainable AI systems, and the study suggests that the security of AI systems will not be compromised when counterfactuals are used in lieu of other explainable AI techniques.
Most of the methods mentioned here generate more than one counterfactual.The problem with multiple counterfactuals is that they might have contradicting paths because of the diversification.The other issue with counterfactuals is the time it takes to generate them, which might be impacted by the number of features in a dataset and also a huge varying degree of within variance in the features.In our study, we attempt to address these issues.

III. METHODOLOGY
This section outlines the methodology that is undertaken in this study.The key design decision was to collect the datasets, select predictive features, calculate correlations between the target variable and the features, normalise each continuous feature of the datasets, formulate an optimisation problem that will help to generate the counterfactuals, measure the time it takes to generate the counterfactuals and compare the generated counterfactuals with experts' opinions.Each of the above steps helped to obtain counterfactuals that are sparse, that are generated fast and are robustly tested.
Figure 1 shows the flowchart of the methodology.Firstly, we focus on feature selection, where we select predictive features using a random forest model.Secondly, we calculate the Spearman's correlation coefficient between the target vari- able and the features.The aim of calculating the correlations is to create sparse counterfactuals.This means that we will only focus on features that are better correlated with the target variable when we generate the counterfactuals.The feature selection together with sparsity ties back to the feasibility of the counterfactuals, which requires a minimal number of features to be changed.Thereafter, the dataset is split into categorical and continuous features, and the continuous features are normalised.The purpose of normalising continuous features is to assure that the features have the same scale.This allows counterfactuals to be generated much quicker because there is no huge varying degree of values within each feature.We proceed by merging categorical and continuous features.The dataset is then split using the K − f old cross validation and a classifier is trained and its performance is assessed.The purpose of the K − f old cross validation is to assure that model robustness is maintained during training.The final step is to generate a counterfactual that will explain the predicted outcome for a data point that we are interested in.

A. DATA PREPROCESSING
We selected predictive features via a random forest.The random forest is an ensemble of decision trees.At each node of the decision tree, a split is determined by the measure of impurity of each feature, either by using the entropy or the gini index.The importance or predictive power of each feature is derived from the impurity calculation, which is shown by either the entropy or the gini index where p j represents proportion of samples in each class.The more "pure" a feature is, the more predictive it is.

B. COUNTERFACTUAL SPARSITY
To assure that the counterfactuals that are generated are sparse, we used a Spearman's rank correlation coefficient.The Spearman's correlation is based on the rankings of a feature as opposed to using raw feature values.The benefit of using rankings is that both continuous and categorical features can be used to calculate the correlation.Hence, sparsity is determined by the correlation between the target variable (which is categorical in nature) and the predictors (which are either categorical or continuous).The Spearman's rank correlation is given as where z i is the difference between two ranks of each record and n is the number of records.The aim of calculating the correlation is to focus only on features that are correlated with the target variable when the counterfactuals are generated.

C. NORMALISATION OF CONTINUOUS FEATURES
Let x k ∈ R d denote a feature vector for record k, where d represents the number of features in a dataset.The labeled dataset is represented by , where y k ∈ {0, 1} and x k = {f i } d i=1 and n denotes the number of records.Let f i represent each feature in X, ∀i ∈ {1, 2, • • • , d}.Continuous feature values were converted by using a normalisation technique such that where min(f i ) and max(f i ) are the minimum and the maximum values for feature i, respectively.

D. GENETIC ALGORITHMS
The genetic algorithm is a heuristic search approach that is motivated by the natural evolution [20].The genetic algorithm searches for optimal values that can either minimise or maximise a certain function.Hence, genetic algorithms are used for optimisation problems.The genetic algorithm process involves a selection of individuals (i.e.parents) (based on some defined fitness function) from an initial population for reproduction.Individuals that are unfit for reproduction are omitted.The parents produce an offspring that inherits the characteristics of the parents.The offspring then gets included in the next generation of the population.The process repeats itself until it converges to a solution.The genetic algorithm process involves five phases i.e. initial population, fitness function, selection, cross-over and mutation.In the context of credit scoring, a population individual is a feature vector and this feature vector is regarded as a countefactual.To select the best counterfactuals, a fitness function is used.The fitness function is the black-box model that will be explained by the counterfactual.The fitness function output is binary (i.e.default or non-default).The quality of each counterfactual is determined by the output of the fitness function being non-default and the distance between the counterfactual and the feature vector of interest (i.e. a feature vector that we want to explain its prediction) should be less than some epsilon value.Selected counterfactuals go to a mating pool and these counterfactuals are known as the parents.Every two parents will produce two offspring (i.e. two counterfactuals).The mating process is actually a cross-over phase.The cross-over is based on a cross-over rate which is defined as the probability of two parents crossing over at a single point [20].Mating high quality parents will generate better quality offspring with similar traits as the parents.This removes bad population individuals from generating more bad individuals.Note that, the offspring will have similar drawbacks as their parents.To overcome the drawbacks, some changes will need to be made on the offspring in order to generate new offspring.The changes are known as the mutation phase.Mutation is responsible for randomly changing values in the counterfactual based on a mutation rate which is defined as the probability of determining the number of counterfactuals that should be mutated in a single population [21].The main purpose of mutation is to preserve diversity in a population, and this prevents early convergence to a solution.

E. FORMULATION OF THE OPTIMISATION PROBLEM
This section defines our formulation for the optimisation problem which is solved via a custom genetic algorithm in order to generate counterfactuals.Let c * ∈ R d denote a counterfactual for record x ∈ R d and g denote a black-box model.The black-box model is given as and the aim is to find c * ∈ R d such that and subject to dist(c * , x) < ϵ and The main idea behind the above optimisation problem is to find optimal values c * ∈ C that result in a non-default class g(c * ) = 0, ensuring that the generated counterfactual c * is as close as possible to the instance of interest x and that c * comes from the same data distribution as x.According to [7], closeness should be measured by the L 1 norm distance divided by the mean absolute deviation (MAD).Wachter et al. [7] showed that the L 1 norm distance divided by MAD is better than the L 1 or L 2 norm distances.The first part of Equation ( 10) is for continuous features and the latter part is for categorical features.Since we are treating categorical values as discrete, hence we used the mean deviation which is suitable for discrete values.The choice of distance in our study is where M AD i and M D i denote a mean absolute deviation and a mean deviation for feature i, respectively, and d * denotes the number of continuous features.The mean absolute deviation for feature i is where n denotes a number of records and f (i) denotes the mean or average of feature i.The mean deviation for feature i is where m is a number of categories, o l is a frequency of each category and f (i) is the median of feature i.
The threshold for the chosen distance is defined by ϵ which is provided by the user.The fitness function in our optimisation problem is g(c).To explain predictions of applicants that are approved for loans, the aim is to find optimal values of c * such that g(c * ) = max subject to the same constraints of Equation (7).The major differences between our study and that of Sharma et al. [13] are 1) the formulation of the optimisation problem, 2) the distance calculation of categorical features and 3) the generation of sparse counterfactuals.Sharma et al. [13] used the following fitness function, where ) and n con and n cat denote the number of continuous and categorical features, respectively, and c * cat and x cat are categorical attributes for the counterfactual c * and record x, respectively.The simple matching coefficient is used in Sharma et al. [13] to deal with categorical features and is given as SimpleMat = number of matching attributes total number of attributes , and has values between 0 and 1.

IV. EXPERIMENTS A. DATA
Publicly available credit scoring datasets were considered in this study, i.e.German [22] and Home Loan Equity (HMEQ) [23].The German dataset can be accessed on the UCI repository, and the HMEQ dataset can be accessed on Kaggle.The German credit dataset has 20 features, where 7 of the features are numerical and the other 13 are categorical.The German credit dataset has features such as status of existing checking account, duration in month, credit history and purpose to mention a few.The HMEQ dataset has 13 features, where 11 of the features are numerical and the other 2 are categorical.The features are the amount of loan request, amount due on existing mortgage, value of the current property, years at present job, number of credit lines to mention a few.The target variable for each of the above credit scoring datasets is binary, i.e. applicants are classified either as "default" (i.e.bad applicants) or "non-default" (i.e.good applicants).

B. DATA SPLIT, MODEL TRAINING AND MODEL PERFORMANCE
For robustness of the classifier g, a K-fold cross validation [24] was used.The K-fold cross validation splits the data into Each fold D k is used as a test set and the remaining D = {D j } j=1,j̸ =k folds are used for model training.The overall performance metric for the model is and g p represents the average model performance metric e.g.accuracy or Area Under the Curve (AUC) and g k p is a model performance metric in each of the test folds D k .Note that the remaining folds D = {D j } (K−1) j=1,j̸ =k were used in turn for training, we did not build K distinct models.

C. EXPERIMENT SETUP
For genetic algorithm, we ran four experiments on each dataset to select optimal parameters based on the execution times for the optimisation problem, the resulting objective function and the distance between the record that needs to be explained and the generated counterfactual.Table 1 shows different parameter values for our experiments for the genetic algorithm and the times (measured in seconds) to execute the optimisation problem , the resulting objective functions and the distances are also shown.For German datasets, we chose ϵ = 4, mutation rate = 0.01, cross-over rate = 0.40 and population size = 30, since they are giving a least amount of execution time, the desired objective function For HMEQ datasets, we chose ϵ = 7, mutation rate = 0.04, cross-over rate = 0.70 and population size = 100, since they are giving the desired objective function output (i.e. the non-default output) and the least distance.The models that were used in our study were random forest and artificial neural network (this was to assure that our approach is modelagnostic).The choice of these models was based on our previous literature review study [25].The parameters for the random forest model were set as follows, maximum depth for each decision tree was set to 5, the number of decision trees was set to 100.For the artificial neural network, we used 3 hidden layers, the first hidden layer had 50 neurons, the second hidden layer had 30 neurons and the third hidden layer had 5 neurons, with the output layer having 2 neurons.The parameters for both neural network and random forest were obtained by using a grid-search approach.To train the models, we used K − f old cross validation, where we set K = 5.

D. EXPERIMENTS FOR CREDIT SCORING EXPERTS
Three experts from financial institutions were willing to partake in the experiments of this study.All of these experts have more than 7 years working experience in the financial sector as quantitative analysts in credit scoring space.They all have a background in developing credit scorecards from scratch.
The experts were given a data dictionary with a detailed description of each feature.The experts were asked to identify features that they deem fit in influencing the prediction of the record of interest based on their domain expertise.This to allow credit experts to have a subjective opinion on the features that they thought were key contributors in credit scoring prediction.The credit risk experts then created their own counterfactuals and they assessed the prediction of their counterfactuals using an online user interface.The experts were also asked to normalise all continuous features before they use the online platform.The online user interface can be found on this link (https://xolani-explanation-research.herokuapp.com/).We created this user interface from scratch and the scoring models that were used in the user interface for predictions are the ones that we are explaining using the counterfactuals in this study.

V. RESULTS
This section is for the results that were obtained from our experiments.This section starts by looking at the features that are correlated with the target variable and thereafter the model performances of the random forest classifier on German credit dataset and the artificial neural network classifier on HMEQ credit dataset are assessed.Lastly, explanations from our approach, Sharma et al. [13] approach and from the experts are examined.

A. COUNTERFACTUAL SPARSITY
We assessed the correlations between the target variable and the predictors.The aim was to focus only on pre-  dictors that are better correlated with the target variable when we are generating our counterfactuals.Figure 2 and Figure 3 show correlation matrices for the German and HMEQ credit datasets, respectively.In Figure 2, the predictors that are better correlated with the target variable are Account Balance and Payment Status of Previous Credit.Please note that by "better correlation" we mean that compared to the rest of the predictors, the selected predictors have a "higher" correlation with the target variable.In Figure 3, the predictors that are better correlated with the target variable are DEBTINC, DEROG and DELINQ.

B. MODEL PERFORMANCE
Table 2 shows the classifier performance results that were reported in literature, and we compared those results to the performance of the models that were used in our study.A detailed comparison of model performances in credit scoring can be found in our previous study [25].In Table 2, the results do not significantly differ from each other, and this proves the efficacy of our model choice for our study.The main purpose of our study is not to compare model performances but to explain how a model made a certain decision and what actions are required to effect a desired outcome for the loan applicant.

Source
Year Model German HMEQ Accuracy AUC Accuracy AUC [26] 2000 Logistic Regression 0.76 --- [27] 2016 Neural Network 0.75 --- [28] 2017 XGBoost 0.77 --- [29] 2018  1) Prediction explanations from our approach and from Sharma et al. [13] approach Please note that in this study, we also implemented from scratch, the approach that was used in [13].The explanations are given in Figure 4 and Figure 5, for German and HMEQ credit dataset, respectively.Please note that all continuous values in the figures were normalised, however when providing explanations using text, the normalised values were denormalised to make sense out of the explanations.Using our approach on German credit data, the applicant would qualify for the loan if the Payment Status of Previous Credit decreases by 3 and Account Balance decreases by 1.Using the approach by [13]  Performance metrics German HMEQ [13] time=879, func=0, dist= 5 time=722, func=0, dist=16 Our approach time=386, func=0, dist=3.75time=192, func=0, dist=6 TABLE 3. Comparison of our approach with the approach in [13] using different metrics.Legend: time (execution time for optimisation problem), func (objective function), dist (distance between record that needs to be explained and the counterfactual).For the objective function, 0 denotes non-default class and 1 denotes default class.The difference between our approach and that of [13] is around the time it takes to generate a counterfactual and the distance between the record of interest and the generated counterfactual.Since our approach is using a binary fitness function, the process of convergence is much quicker compared to the approach presented in [13], and also our approach uses sparse counterfactuals that result in small distances between the counterfactual and the record of interest.This is illustrated in Table 3.For counterfactuals that were generated by the experts, in some cases the features that needed to be changed overlapped with the features that were changed when used our approach to generate counterfactuals.The experts select features that are more influential in determining the output of the model.On the other hand, our approach selects features that are more likely to change the outcome of a prediction.Our approach can play a key role in credit scoring for blackbox model explanations and the credit scoring experts can leverage off the explanations from our approach, and this can result in a human-machine relationship.

D. MEAN OPINION SCORE (MOS)
The set of features that were used in the counterfactual explanations which were generated when using our approach, were compared to the set of features that were chosen by the credit scoring experts.Let L denote the number of credit scoring experts and I denote the number of features to generate the counterfactuals when using our approach.We define below a i l which determines whether there is an overlap between the features from our approach and the i th -feature FIGURE 5. Explanation of on HMEQ credit dataset record using our approach, [13] and experts.The red bars represent a decrease in a feature value and the green bars represent an increase in a feature value.
from the l th -expert opinion, where e i l is the l th -expert feature opinion and m i is the feature which is selected by our approach.Hence, the mean opinion score is given as follows, The M OS ranges between 0 and 1, values that are close to 1 would mean that the credit experts agree more with the features that need to be changed for the generation of a counterfactual.The values that are close to 0, it would mean that the credit experts agree less with the values that are suggested for creating a counterfactual.For German credit scoring dataset, the M OS = 0.17, indicating that the credit experts agree 17% with the features that needed to be changed in order to generate a counterfactual.For HMEQ credit dataset, the M OS = 0.083, this indicates that the credit experts agree 8.30% with the features that needed to be changed in order to result into a counterfactual.The low values of the M OS are due to the fact that our approach looks at sparse counterfactuals, this results in few features that need to be changed.

E. COMPARISON WITH STATE-OF-THE-ART COUNTERFACTUAL EXPLANATION METHODS
Table 4 shows comparisons with state-of-the-art counterfactual explanation methods.Factors such as the black-box nature of the models, applicability of the approach using different classes of models (i.e.model-agnostic), involvement of domain experts, sparseness of the counterfactuals and quantitative assessments of the resulting counterfactuals, were used for comparison purposes.We observe that most methods are focusing on black-box models and model-agnostic behaviour of the explanations, with the exception of [7,16,30].Our approach looks at all the suggested factors, and factors such as expert domain knowledge, sparseness of counterfactual explanations and also the ability to quantitatively assess the generated counterfactuals, assure robustness of the generated counterfactual explanations.Please note that the list of sources in Table 4 is not exhaustive.

VI. CONCLUSION AND FUTURE WORK
In this paper, we selected predictive features to expedite the generation of counterfactuals.The correlations between the selected features and the target variable were assessed for the purpose of generating sparse counterfactuals.All continuous features were then normalised to deal with high within feature variability.A novel optimisation problem was formulated to generate counterfactuals via a custom genetic algorithm.In general, explanations when using counterfactuals provide insights about the actions that are needed to be taken in order to get a desired outcome.In the case of credit scoring, a desired outcome is an approved loan application.The explanations from generated counterfactuals were compared to explanations that were provided by the credit scoring experts.The results showed an overlap between some of the features that needed to be changed when using the proposed approach and when using the credit risk experts' opinions.
In practice, credit scoring experts can leverage off the explanations that are generated by our approach to explain the predictions from black-box models.Further, our results showed that our approach takes into consideration important factors such as sparseness and quantitative assessment of generated counterfactuals, compared to the state-of-the-art counterfactual explanation methods.An immediate future work is the creation of a user interface (UI) of our approach.Also, the future work should focus on extending this work to explain the overall workings of the black-box models instead of looking at individual instances.

Dastile,
Celik and Vandierendonck: Model-agnostic Counterfactual Explanations in Credit Scoring This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/This article has been accepted for publication in a future issue of this journal, but has not been fully edited.Content may change prior to final publication.Citation information: DOI 10.1109/ACCESS.2022.3177783,IEEE Access Dastile, Celik and Vandierendonck: Model-agnostic Counterfactual Explanations in Credit Scoring output (i.e. the non-default output), and the minimal distance.

FIGURE 2 .
FIGURE 2. Spearman's rank correlation coefficient for the German credit dataset.

2 )
Explanations from credit scoring experts Credit experts generated their own counterfactual explanations based on their knowledge domain.Please refer to Figure 4 and Figure 5 to visually see the explanations that are given in the text below.Expert number 1 on the German credit dataset, stated that the applicant would qualify for the loan if the Payment Status of Previous Credit decreases by 4, Duration of Credit (month) increases by 10 and Age (years) increases by 7. Expert number 2 on the German credit dataset, suggested that the applicant would qualify for the loan if the Duration of This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/This article has been accepted for publication in a future issue of this journal, but has not been fully edited.Content may change prior to final publication.Citation information: DOI 10.1109/ACCESS.2022.3177783,IEEE Access Dastile, Celik and Vandierendonck: Model-agnostic Counterfactual Explanations in Credit Scoring

FIGURE 4 .
FIGURE 4. Explanation on German credit dataset record using our approach,[13] and experts.The red bars represent a decrease in a feature value and the green bars represent an increase in a feature value.

TABLE 2 .
Literature results relating to performances of credit scoring models.
[13]erman credit data, the applicant would qualify for the loan if the Account Balance decreases by 1, Credit Amount increases by 15630, Duration of Credit (month) increases by 43, Purpose increases by 5 and Age (years) decreases by 6.Using our approach on HMEQ credit data, the applicant would qualify for the loan if the DEBTINC decreases by 32, DELINQ increases by 1 and DEROG increases by 3. Using the approach by[13]on HMEQ credit data, the applicant would qualify for the loan if the LOAN increases by 39072, MORTDUE increases by 325939, VALUE increases

TABLE 4 .
This article has been accepted for publication in a future issue of this journal, but has not been fully edited.Content may change prior to final publication.Citation information: DOI 10.1109/ACCESS.2022.3177783,IEEE Access Dastile, Celik and Vandierendonck: Model-agnostic Counterfactual Explanations in Credit Scoring Comparison with state-of-the-art counterfactual methods with our approach.
This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/