A Novel GSCI-Based Ensemble Approach for Credit Scoring

Credit scoring is an efficient tool for financial institutions to implement credit risk management. In recent years, many novel machine learning models have been developed for credit scoring. Among the existing machine learning models, the heterogeneous ensemble model receives much attention because of its superior performance. This paper presents a new heterogeneous ensemble model based on the generalized Shapley value and the Choquet integral. To do this, the model first uses the fuzzy measure to express the interactive characteristics between any two coalitions of base learners. Based on the accuracy and diversity objective function, a linear programming model for determining the fuzzy measure is built. To retain the original information as much as possible in the training stage, the normal fuzzy number is employed to express the base learner predicted values. Then, the generalized Shapley Choquet integral (GSCI) aggregation operator is defined to calculate the comprehensive predicted value of the ensemble model. Based on the defined aggregation operator and linear programming model, a GSCI approach is proposed for ensemble credit scoring. To illustrate the efficiency and feasibility of the GSCI approach, an experiment of thirteen machine learning models over four public credit scoring datasets and three real-world P2P leading datasets with large volumes of samples is made. Furthermore, robust tests and comparatives analysis are made to demonstrate the adaptability and performance of the GSCI-based ensemble model.


I. INTRODUCTION
Credit risk is the main risk for financial institutions, and the effectiveness of credit risk management is the critical issue for the survival and development of financial institutions. Enormous credit scoring models are developed for granting credit, and they evaluate credit risk by assigning applicants into 'good credit' or 'bad credit' classes. The discriminative ability of the credit scoring model is important for financial institutions. Even a slight improvement in predictive precision could result in a significant boost in profits or avoid great potential losses [1]- [3]. As a result, developing have been developed recently and have gained much attention, and they can be further split into single models, such as Neural Networks (NN) [21], Support Vector Machines (SVM) [22]- [24], Decision Trees (DT) [25], and Naive Bayes (NB) [26], and ensemble models, such as AdaBoost [27] and Random Forests (RF) [28], [29]. The advantages of the machine learning techniques include the following: 1) they usually outperform traditional statistical techniques in terms of their predictive accuracy [30]; 2) they are superior to statistical techniques in dealing with nonlinear pattern classification problems because they do not require a stable distribution assumption [1], [2], [31]; and 3) they also provide an effective way of dealing with the big data in large, sparse, and complex high dimensional samples [30], [32], which makes them very suitable for developing credit scoring models in the online finance area. Some online finance companies, such as Zest Finance and Ant Financial, have already utilized machine learning techniques for credit scoring. As black-box methods, the disadvantages of machine learning techniques are mainly due to their lack of interpretability [30], [33]. Ensemble machine learning models, which pool several base learners' predicted values together and combine them to make the final decision, may generate complementary strengths and have been experimentally and analytically proven to be more accurate and stable than single models [2], [3], [6]- [9], [34], [35]. Despite the performance of ensemble models being superior to many statistical techniques and single machine learning models, most of the existing ensemble models fail to consider the interaction and corporation among base learners [36], [37]. The weights of the base learners and the aggregation operators in those existing models are based on the assumption that the base learners are independent and make separate decisions. However, as noted in some studies [4], [36], [37], base learners, namely, artificial intelligence agents, can be seen as decision makers in a decision group, and Tan [38] stated that agents working in a cooperative manner can significantly outperform those working independently. On the other hand, although both the accuracy and diversity of base learners are critical factors for constructing ensemble models, most existing models consider only one aspect and ignore the tradeoff between them [3].
To fill the abovementioned research gap, this paper proposes a new approach based on the generalized Shapley value and the Choquet integral. The main contributions of this paper include the following. 1) It originally models the interaction of base learners by defining a fuzzy measure [39] on the set of base learners. 2) It considers the importance of both the accuracy and diversity of base learners simultaneously when constructing the ensemble model by building a linear programming model based on the accuracy and diversity objective function. 3) To fully utilize the information generated in the training stage [37], [40], the normal fuzzy number [41] is introduced to represent base learner predicted values. 4) The predicted values of all base learners are aggregated by the GSCI aggregator to obtain the final comprehensive predicted value of the ensemble model, which can globally reflect the interaction of the coalitions of base learners.
The rest of this paper is organized as follows. Section 2 reviews the related work on the ensemble classification methodologies and their applications to credit scoring. Section 3 briefly introduces some basic concepts related to the fuzzy measure and the related aggregation operator. Section 4 first defines the accuracy, diversity and predicted values of base learners, and then constructs a model for aggregating the predicted values of base learners using the GSCI approach when the fuzzy measure weight information is unknown. Section 5 describes the experimental setup. In Section 6, comparisons are made among thirteen machine learning models over seven credit datasets with six evaluation metrics, and conclusions are offered.

II. LITERATURE REVIEW
In recent years, many credit scoring literatures focus on ensemble models because of their superior performance. Several advancements have also been made in that area, and it has gradually become a research topic of interest [1], [2], [34].
In what follows, we first briefly review some basic concepts related to the ensemble models and then focus on the strengths and shortcomings in the existing ensemble models for credit scoring.

A. ENSEMBLE MODELS
Ensemble models that combine the outputs of base learners together to get the final conclusion have been proven to be more stable and accurate than single models. The rationale for the superiority of the ensemble model mainly lies in that different base learners may view the same pattern differently, thereby complementing the predicted information of each other [42]. Generally, building an ensemble model involves two main steps: the base learner generation and the combination of predicted values [7], [42].
According to the first step, ensemble models can be split into homogeneous and heterogeneous ensembles [35], [42]. Homogeneous ensembles generate base learners using the same type of algorithm but train them using different sample sets or feature sets in order to produce diversity. The common methods include bagging [43], boosting [27] and the Random Subspace Method (RSM) [44]. In bagging, the base learners are generated by training using the datasets chosen from the bootstrap sampling. In boosting, a sequence of base learners is produced using modified training datasets in which misclassified samples have higher probabilities of being selected in the next round of training, and the correctly classified samples have less chance of being used. Different from bagging and boosting that generate base learners according to sample disturbances, the RSM constructs different base learners using the feature disturbances on training datasets. On the other hand, heterogeneous ensembles, which employ different algorithms for generating base learners, have often resulted in better performance than homogeneous ensembles [7], [34]. Some researchers believe that the reason is that different algorithms have different views on what can enhance the diversity of the base learners and it is easier to produce complementary advantages [7].
As the second step, after all the generated base learners provide their predictions, they are pooled together and combined using certain rules or methods. Canuto et al. [45] divided the combination methods into two categories, namely, fusion-based and selection-based methods. Fusion-based methods aggregate the outputs of all the base learners using fixed weights or some functions in order to get the final prediction. Majority Voting (MV) and Weighted Averaging (WA) are the most popular ones due to their simplicity and better results. Stacking [46] is another kind of fusion-based method that combines the outputs of base learners using a meta-learner. Based on the local accuracy of base learners calculated using the nearest neighbor rule [7], selection-based methods aggregate the predictions of base learners by dynamically determining their weights.

B. RELATED WORK
According to existing studies, there are three main ways to improve the predictive accuracy of ensemble models: enhancing base learner accuracy, enhancing base learner diversity, and combining the outcomes of base learners in a proper way [36], [45].
To improve the predictive accuracy, Xia et al. [7] proposed a heterogeneous ensemble model that integrates the Bagging algorithm into a stacking method. Wang et al. [31] introduced a tree-based ensemble model using both the bagging and subspace methods. He et al. [6] used a particle swarm optimization algorithm to optimize the parameters of base learners. Abellan & Castellnao [1] compared different ensemble models and found that the models choosing the base learners with a high degree of diversity perform well. To generate higher diversity, Xiao et al. [3] stated that the base learners generated by random sampling may not guarantee the diversity and presented a supervised clustering approach to ensure the diversity of base learners. Ding et al. [5] proposed a variable weighting clustering approach to improve the diversity of base learners. Sun et al. [47] constructed a decision tree ensemble model with differentiated sampling rates to increase the diversity of base learners. For combination methods, MV, WA and HWA are the most widely used methods due to their simplicity and good performance [7], [36]. In addition, some other novel combination methods are developed. Zhang et al. [48] developed an innovation fusion rule to blend the prediction results from two types of machine learning models. He et al. [6] developed a three-stage ensemble model using the stacking method to fuse the predicted values of base learners. Lessmann et al. [33] stated that selection-based combination methods attract lots of attention due to their promising results. Zhou et al. [49] proposed a correlationbased static selection strategy that aims to choose a group of base learners with the minimum correlations.
Despite recent efforts on improving the performance of the ensemble model, all the studies mentioned above fail to reflect the interaction among base learners in the combination stage. In addition, most of the studies only consider either accuracy or diversity or consider them separately when constructing ensemble models, thus ignoring their counterbalanced relationship in the performance of the model. Moreover, the outputs of base learners are usually in the form of binary numbers, which may lose some original information when training the base learners. To cope with the problems mentioned above, Ala'raj & Abbod [4], [36] stated that the base learners in the ensemble model interacting in a cooperative manner can improve the predictive accuracy compared with those using traditional aggregation operators and presented a new combination approach based on a consensus system [50] to solve the conflicts among base learners. Peng et al. [51] proposed a fusion approach to provide a compatible ranking of classification algorithms when different MCDM techniques yield conflicting results. Yu et al. [30] utilized the triangular fuzzy number to reflect the opinions of base learners and to make full use of all the original information that was extracted from the training datasets. Meng & Chen [52] stated that the fuzzy measure is an effective tool to describe the interactions between elements in a group and has been widely applied in many fields since it was proposed by Sugeno in 1974. Based on the fuzzy measure, Meng et al. further defined a series of aggregation operators that can reflect the interactions between any two elements or attributes [52]- [56].
Inspired by the previous work of Ala'raj & Abbod [4], [36], Peng et al. [51], and Meng [52], this paper proposes an ensemble model based on the fuzzy measure and the defined GSCI aggregation operator that can globally reflect the interactions between any coalitions of base learners. To consider both the accuracy and diversity of base learners simultaneously, a linear programming model for determining the fuzzy measure is built using the accuracy and diversity objective function. Similar with Yu et al. [30], this paper applies fuzzy numbers to represent the outputs of base learners, but it chooses the normal fuzzy number [42] as a substitute for the triangular fuzzy number, which is considered to be more suitable for describing natural phenomena [57].

III. BASIC CONCEPTS
Before establishing the ensemble model, this section first introduces the fuzzy measure proposed by Sugeno [58] and some aggregation operators with respect to it, to cope with the situation that decision makers are interactive when facing a group decision making problem.
Definition 1 [58]: Let X = {x 1 , x 2 , . . . , x n } be a set of base learners, and P(x) be the power set of X , i.e., the set of all subsets of X . A fuzzy measure µ on the set X is a set function µ : P(x) → [0, 1] satisfying the following conditions.
where µ(A) represents the importance of the coalition A of base learners. Thus, in addition to the classical weight of the single base learner being taken separately, the weights of any coalition of base learners are also defined. In general, one needs to define the 2 n coefficients corresponding to the 2 n subsets of X .
A fuzzy measure is said to be additive if µ(A ∪ B) = µ(A) + µ(B) whenever A ∩ B = ∅, and super additive (resp. sub additive In the following, we introduce the aggregation operators with respect to the fuzzy measure that consider the interdependence among base learners.
Definition 2 [39]: Let µ be a fuzzy measure on X . The Choquet integral of a positive function f with respect to µ is defined as where (·) indicates that the indices have been permuted such . . ,x (n) and f (x (0) ) = 0. Note that when the fuzzy measure µ is additive, the Choquet integral degenerates into the OWA operator [59].
Meng et al. [53] stated that the Choquet integral only reflects the interactions between the two adjacent coalitions A (i) and A (i+1) . To globally reflect the interactions between the ordered coalitions, we first introduce the following generalized Shapley function [60] derived from the Shapley value in game theory [61].
Definition 3: Let µ be a fuzzy measure on set X . The generalized Shapley function is expressed by where x, s, and t are the cardinalities of the coalitions X , S, and T , respectively. By combining the generalized Shapley function and the Choquet integral, Meng et al. [51] defined an aggregation operator as follows.
Definition 4: Let µ be a fuzzy measure on set X . The Generalized Shapley Choquet Integral (GSCI) of a positive function f with respect to µ is defined as is the generalized Shapley function given in (2).

IV. THE GSCI-BASED ENSEMBLE MODEL
Before establishing the ensemble model, this section first introduces the fuzzy measure proposed by Sugeno [58] and some aggregation operators with respect to it, to cope with the situation that decision makers are interactive when facing a group making decision problem.
This section offers a new GSCI-based ensemble credit scoring model that considers the situation where the base learners in the ensemble model are interactive. Given an ensemble model, without the loss of generality, assume that there are n base learners C = {c 1 , c 2 , . . . ,c n }, m testing samples X = {x 1 , x 2 , . . . ,x m }, and let Y = {y 1 , y 2 , . . . ,y m } be the true labels of the testing samples.
To measure the accuracy and diversity of the base learners, we introduce the following measures.
Definition 5: Let p ij be the predicted value obtained from the base learner c i for the testing sample x j , where x j ⊂ X . The accuracy of base learner c i for testing sample set X is defined as Definition 6: To measure the difference between any two base learners, the contingency table of the two base learners c i and c j with the predicted values of the testing samples X is given as shown in Table 1. In Table 1, a represents the numbers of samples that both c i and c j predict as 'good', b represents the numbers of samples that c i predicts as 'good' and c j predicts as 'bad', c represents the numbers of samples that c i predicts as 'bad' and c j predicts as 'good', and d represents the numbers of samples that both c i and c j predict as 'bad'. Note that a + b + c + d = m.
The pairwise diversity between two base learners c i and c j is defined as The diversity of the base learner c i is defined as To fully use the original information generated in the training stage, we introduce the normal fuzzy number defined by Yang [41]. Compared to other fuzzy numbers, such as the triangular fuzzy number and trapezoidal fuzzy number, the normal fuzzy number is more suitable for expressing the opinions of base learners since a large number of natural phenomena and social phenomena fall into the normal distribution [57].
Definition 7 [41]: In a real number field, letÃ=(µ, σ 2 ) be a normal fuzzy number when its membership function is expressed asÃ Definition 8 [62]: The mean of the normal fuzzy number A=(µ, σ 2 ) is defined as Definition 9 [62]: The variance of the normal fuzzy num-berÃ=(µ, σ 2 ) is defined as Definition 10: Let p 1 ij , p 2 ij , . . . ,p k ij be the k predictions of base learner c i for the testing sample x j . E ij and D ij are the mean and variance of the k predictions, respectively. According to definitions 7, 8, and 9, the normal fuzzy number of base learners c i for the testing sample x j can be defined as where µ ij = E ij , and σ 2 ij = 2D ij . The predicted value of base learner c i for the testing sample x j is defined as where a represents the base learner's attitude toward risk. If a > 0, then the base learner is a risk-seeking decision maker. If a = 0, the base learner is a risk neutral decision maker If a < 0, then the base learner is a risk-adverse decision maker.

A. A MODEL FOR THE OPTIMAL FUZZY MEASURE ON BASE LEARNERS
To reflect the interaction among base learners, we introduce the fuzzy measure on C = {c 1 , c 2 , . . . ,c n } to evaluate the importance of any coalition of base learners. The optimal fuzzy measure is derived by establishing the following linear programming model. Because accuracy and diversity are both critical factors for the performance of the ensemble model, we apply the Choquet integral of the accuracy and diversity with respect to the fuzzy measure as the objective function.
w i is the weight range of base learner c i obtained by the weighted average of the accuracy and diversity. where and

B. AN APPROACH FOR THE COMPREHENSIVE PRE-DICTED VALUE BASED ON THE GSCI OPERATOR
After deriving the optimal fuzzy measure on the set of base learners C, we apply the GSCI aggregation operator to gather the opinions of the base learners to calculate the comprehensive predicted value of the ensemble model, which globally reflects the interaction among the base learners. The comprehensive predicted value for the testing sample x j is calculated by is the generalized Shapley function for the fuzzy measure µ on the set C; s(P (i)j ) is the predicted value of base learner c (i) for testing sample x j , where j=1,2,. . . m; and (·) is a permutation of s(P ij ), where i = 1, 2, . . . , n, such that s(P (1)j ) ≤s(P (2)j ) ≤ . . . ≤s(P (n)j ).
From (3), we know that the Shapley value ϕ Sh is actually an expectation value of the marginal contribution between any base learner (i). it means also reflects the weight of base learner (i). The Shapley value does not only consider the importance of base learners, but also reflect their influence from the other base learners. It is worth pointing out that if there is no interaction of the base learners, then the Shapley values are equal to their importance of their own. The Shapley values can be seen as an extension of additive weights. So, the final prediction of GSCI is performed by voting with Shapley values.

C. ALGORITHM OF THE GSCI-BASED ENSEMBLE MODEL
Based on the optimal fuzzy measure on set C and the GSCI aggregation operator, we propose the algorithm for the new ensemble model as follows.
Step 1: When an input dataset is received, randomly take 80% of the dataset to be the training set and 20% to be the testing set.
Step 2: Randomly take 90% of the training set as the sub training set and 10% as the sub testing set. (i) Train all the base learners c i (i = 1, 2, . . . , n) using the subtraining set.
(ii) Calculate the prediction p ij (i = 1, 2, . . . , n and j = 1, 2, . . . , m s ) of each base leaner on the sub testing set, where n is the number of base learners and m s is the number of testing samples in the subtesting set. (iii) Use (4) and (6) to calculate the accuracy acc i and the diversity div i , respectively, for each base learner for the subtesting set. Repeat this step k times to calculate the average accuracy acc i and diversity div i of each base learner. (iv) Obtain the k predictions for base learner c i for the testing set, namely, p 1 ij , p 2 ij , . . . ,p k ij , where i=1, 2, . . . , n and j = m s+1 , m s+2 , . . . , m.
Step 3: Based on the average accuracy and diversity of each base learner calculated in step 2, use (12) to derive the optimal fuzzy measure on set C.
Step 4: Use (11) and the k predictions of each base learner calculated in Step 2 to derive the predicted value s(p ij ) for base learner c i for the testing sample x j , where i=1, 2, . . . , n and j = m s+1 , m s+2 , . . . , m.
Step 5: Use (16) to calculate the comprehensive value p j for sample x j in the testing set, where j = m s+1 , m s+2 , . . . , m. The final predicted value of the ensemble for the testing sample x j is defined as where θ is the threshold in the interval of 0 to 1.
To show the process of the above algorithm clearly, please see Fig. 1.

V. EXPERIMENTAL SETUP A. CREDIT DATASET AND DATA PREPROCESSING
In this experiment, seven datasets are used to test the proposed model to validate its performance in different imbalance ratios datasets. Within four representative datasets, Australian, Japanese, German and Default credit card clients datasets (PublicData) were obtained from the UCI machine learning repository [63].
In addition to the experiment over four public datasets, we further examine robustness of GSCI ensemble model over three real-world P2P leading datasets to ensure their suitability for real-world problem. The RRDai Data is from a Chinese internet finance enterprise named RenRenDai which contains loan data for 2017. These data are publicly available from https://www.renrendai.com. We collected data through web crawler. The ProsperLoan Data is from an America's first marketplace lending platform named Prosper, which contains loan data through 2009.7 to 2014.3 and can be download from Amazon's AWS Datasets. The LendingClubLoan Data is from a P2P lending company named LendingClub that matches borrowers with investors online. This dataset is published in Kaggle (https://www.kaggle.com/wendykan/lending-clubloan-data), which contains complete loan data through 2007 to 2015, including the current loan status and latest payment information.
Data pre-processing is a crucial step to prepare data for training before constructing the ensemble model. First, features whose correlations are over 98% or missing rate are over 50% are removed. For reserved features, Missing values need to be filled according to the type of features initially. Those missing values are filled with a mean for numeric attributes, and filled with ''unknow'' string for nominal attributes. Numeric features are standardized by removing the mean and scaling to unit variance. Moreover, with the one-attribute-pervalue approach an attribute with N values is transformed into N binary attributes if the class is nominal [64]. Finally, all the values in the datasets are normalized using the following formula. Given a value x in any attribute, the normalization is performed by where min(x) is the minimum value of the attribute, and max(x) is the maximum value of the attribute.
Since most samples of RRDai Data is outstanding loan sample, we judge whether a user defaults by paying off the monthly repayment amount. The RRDai Data dataset has related attribute, includes loan amount P, annualized yield rate apr, repayment period T , remaining repayment period T r and actual outstanding amount L u . Besides, there are two ways of repayment: average capital plus interest method and debt servicing method. So we calculate them as follows.
For average capital plus interest method, we calculate the monthly repayment of loan: Then, theoretical outstanding amount for average capital plus interest method is L c = L e * T r For debt servicing method, we calculate the gross interest: Then, theoretical outstanding amount for debt servicing method is L c = P + R t * (T r /T ).
If the current actual outstanding amount L u in the sample is higher than the theoretical outstanding amount L c , L c < L u , it will be deemed as a default. Otherwise, it is considered non-default.
In ProsperLoan Data, we reclassified the original 12 categories of loan status into 2 categories as the label. In detail, ''Completed'' and ''Current'' were viewed as positives, otherwise as negatives. In the same way we did with ProsperLoan Data, we also divided LendingClubLoan Data into two categories. The label, ''Fully Paid'' and ''Current'' were viewed as positives, otherwise as negatives.
Input variables of these dataset were categorized into the following subsets [2]: (i) applicant assessment (grade, sub grade, etc.); (ii) loan characteristics (loan purpose and amount); (iii) applicant characteristics (annual income, housing situation, etc.); (iv) credit history (credit history length, delinquency, etc.); and (v) applicant indebtedness (loan amount to annual income, annual instalment to income, etc.).A summary of all the datasets is illustrated in Table 2.

B. BASE LEARNERS
The selection of base learners of an ensemble has a strong effect of the ensemble system and is a critical process of developing the ensemble system. Theoretically, any classifier including both single and ensemble can be embedded as a base learner when building an ensemble model. In recent studies, some novel heterogeneous ensemble methods had employed ensembles as base learners to seek a good balance between performance and efficiency [4], [7], [33], [36]. We employ five state-of-the-art homogeneous ensembles as base learners both to guarantee better performance and generalization our proposed GSCI model and to control the ensemble size. Compared with other individual learning models, such as Decision Tress (DT), Support Vector Machines (SVM), Neural Network (NN) and naïve Bayes (NB), the five selected base learners of GSCI model including Bagging (Bag), Random Forest (RF), AdaBoost (Ada), Gradient Boosting (GB), and Extra-trees (ET) had higher ACC scores and lower ACC standard deviations. The ACC scores are calculated by the 5-fold cross validation method on the training sets of the seven credit datasets.

1) BAGGING [43]
The Bagging (Bag) classifier is a parallel ensemble model that fits the base learners using each random subset of the original dataset and then combines their predicted values by voting to form a final prediction. It can reduce the variance of base learners by introducing randomization into its construction procedure. Here, we use the decision trees as the base learners of the ensemble model.
2) RANDOM FOREST [28] The random forest (RF)classifier is an extension of bagging that fits a number of decision trees using various subsamples of the dataset and uses averaging to improve the predictive accuracy and control over fitting. It introduces randomly selecting features during the base learners training stage on the bases of bagging to enhance the stability of the model.

3) EXTRA-TREES [65]
The extra-trees (ET) classifier is an ensemble model with the same structure as the random forest, but the base learners are replaced by randomized decision trees that use random splits to separate the samples of a node into two groups and fit the randomized decision trees using samples drawn from the entire training set. It can further enhance the diversity of the base learners.

4) ADABOOST [27]
The AdaBoost (Ada) classifier is a serial ensemble model that first fits a base learner using the original dataset and then fits additional copies of the base learners using the same dataset, but where the weights of the incorrectly classified samples are adjusted such that the subsequent base learners focus more on the difficult cases. In this way, it can reduce the deviation of base learners. Here, we still use the decision trees as the base learners of the ensemble model.

5) GRADIENT BOOSTING [66]
Similar to Ada and other boosting methods, the Gradient boosting (GB) classifier is an additive model in a forward stage-wise fashion that fits the first base learner using the original dataset and then fits the additional copies of the base learners by allowing for the optimization of an arbitrary differentiable loss function.

C. EVALUATION METRICS AND BENCHMARKS
To get a reliable and robust conclusion, we employ six popular evaluation metrics to measure the predictive performance of the machine learning models, namely, the accuracy (ACC), F1 measure, area under curve (AUC), Brier score (BS), sensitivity (SE) and specificity (SP) [6], [31], [37], [67]. The ACC assesses the correctness of the binary classifications, which is defined by the proportion of correctly classified testing samples for the testing dataset. The F1 measure is a comprehensive evaluation method that is based on the precision and recall. Its optimal value is 1 and its worst value is 0. The AUC performs a global assessment by computing the area under the Receiver Operating Characteristic Curve (ROC) according to the predicted scores, which is plotted against the True Positive Rate (TPR) and False Positive Rate (FPR). The BS measures the accuracy of the probability predictions of the datasets, and VOLUME 8, 2020 a lower Brier score reflects better predictions. Finally, SE and SP measure the accuracy on observed good testing datasets and observed bad testing datasets, respectively.
To test the performance of the proposed GSCI ensemble model, a comparison is carried out between the GSCI ensemble model and other twelve classifiers, including four common single classifiers (DT, SVM, NN, and NB), five homogenous ensemble base learners (Bag, RF, Ada, GB, and ET), and other three heterogeneous ensemble classifiers, that use the same base learners as the GSCI ensemble model but take different combination method, namely, MV, WA, HWA, respectively.

D. STATISTICAL TESTS OF SIGNIFICANCE
For a complete performance evaluation, it is usually appropriate to implement some hypothesis testing to prove that the experimental differences in the performance are statistically significant. In this study, we use the Friedman test [1] to compare the ranking performance of all thirteen classifiers over the seven datasets. The Friedman test is a nonparametric test that ranks the classifiers for each dataset separately. The best-ranking classifier is ranked first, the second-best classifier ranked second, etc. The Friedman statistic is distributed according to Chi-square with k-1 degrees of freedom and is defined as where k is the number of machine learning models to be compared, n is the number of datasets, and r ij is the ranking for model j for dataset i. Under the null hypothesis of the Friedman test, all classifiers from this group perform identically, and all differences are only random fluctuations. If the null hypothesis of the Friedman test is rejected, then it is possible to proceed with a post hoc test in order to find the particular pairwise comparisons that produce significant differences. For instance, the Nemenyi test [7] is one of most common types. In this test, the performances of two classifiers are significantly different if their average ranks differ by at least the Critical Difference (CD) where q α is the critical value of the Tukey distribution. k is the number of machine learning models to be compared, and n is the number of datasets.

A. CLASSIFICATION RESULTS
Based on the proposed GSCI-based ensemble model and the previous experimental setup, we compared the predictive performance between the GSCI-based ensemble model and other twelve machine learning models, and the final computational results are shown below. Starting with the dataset from Australia, the GSCI exhibits the best performance for the all the evaluation measures except SP. Regarding the ACC, the GSCI achieves 91.16%, beating all other machine learning models and improving the performance of the best single classifier (NN) and the best ensemble classifier (Bag) by 0.22% and 1.20%, respectively. The ACC of the single classifier NN is ranked second at 90.94%, which is superior to all other single classifiers, all five homogenous ensemble learners and all heterogeneous ensemble classifiers except the GSCI ensemble. The ACC of the bagging ensemble classifier is third at 89.86% and surpasses the three ensemble classifiers using the traditional combination methods of MV, WA, and HWA. When considering the other measures, F1, AUC Brier score, SE and SP, the study yields similar results as with the ACC.
To show the above results more intuitively, Fig. 2 is offered as follows: The results for the dataset from Germany are consistent with the findings for the dataset from Australia. As shown in Table 4 from Germany, the GSCI gets the best results again for the ACC, F1, AUC and BS measures, reaching 77.75%, 58.22%, 70.42% and 22.25% respectively, and ranks second for SE and SP, reaching 67.12% and 74.87% respectively. Similarly, the second-best classifier is NN and its six  64.75% and 74.25%, respectively. The GB is third, and its six measures are relatively lower than those of NN. In both the datasets from Australia and Germany, all five homogeneous ensemble classifier performances failed to overtake the best single classifier, and all three heterogeneous ensembles with the traditional combination method perform poorer than the best ensemble models. This shows that ensemble classifiers may not always be superior to single classifiers, and ensemble classifiers with the traditional combination method may fail to improve the performance.
To show the associated results more intuitively, Fig. 3 is offered as follows: When observing the dataset from Japan in Table 5, the WA ranks at the top for the average of the six evaluation measures, reaching 89.64%, 88.26%, 89.83%, 10.36%, 87.8% and 86.92%, respectively. The proposed GSCI is second, and its results are slightly lower than those of the WA except the SE measure that is higher than WA. Unlike the datasets from Australia and Germany, the HWA heterogeneous ensemble classifier is third, and its ACC, F1, AUC, BS, SE and SP measures are 88.91%, 87.32%, 88.96%, 11.09%, 88.14%, 86.12%, respectively. In this dataset, all four heterogeneous ensemble classifiers outperform all the single and homogeneous ensemble classifiers, which reflect the effectiveness of combining the homogeneous ensemble classifiers together to obtain a comprehensive predicted value.
To show the associated results more intuitively, Fig. 4 is offered as follows: For the dataset from Taiwan China in Table 6, the GSCI obtains the highest ACC, F1, AUC, BS and SE measures again, reaching 82.72%, 47.29%, 65.98%, 17.28% and 60.75%, respectively, and ranks third for the SP with 79.38%. The second and third places with ACC measure are MV and HWA, reaching 82.65% and 82.58%, respectively. The results for the dataset again show the superiority of the heterogeneous ensemble classifiers, especially the HWA and GSCI ensemble, which consider the importance of both the accuracy and diversity in the combinatory process.
To show the associated results more intuitively, Fig. 5 is offered as follows: The result for RRDai Data is showed in Table 7. The results showed that the GSCI ensemble model gets the highest ACC and BS score with 93.35%and 6.65%, respectively. The ET VOLUME 8, 2020  ranks first for the F1 at 73.69%. And the DT ranks first for the AUC at 84.91%. The NB ranks first for the SE at 72.1%. And RF ranks first for the SP at 76.47%. Although the GSCI failed to rank first, it is not far from the best score. Moreover, the overall performance of GSCI for all the six evaluation measures is the best, which are consistent with the previous conclusion that the GSCI performs well in general. The results for the RRDai Data show that the ensemble classifiers can also perform well on real world datasets, and they perform better than most single classifiers.
To show the associated results more intuitively, Fig. 6 is offered as follows: In Table 8, the results show the performance of the 13 models based on ProsperLoan Data, and the GSCI gets the highest ACC, BS and SE scores, reaching 89.97%, 10.03% and 95%, respectively. The WA and the Bag rank first for the F1 measures at 94.71%. The DT ranks first for the AUC at 56.83%. And the ET ranks first for the SP at 84.65%. In general, Pros-perLoan Data and RRDai Data come to similar conclusions. On the other hand, the results show that the ranking of various classifiers in ProsperLoan Data is relatively dispersed, and no classifier can achieve consistently good performance in   various evaluation methods. However, the GSCI ensemble model still ranks first.
To show the associated results more intuitively, Fig. 7 is offered as follows:  The results (in Table 9) for LendingClubLoan Data demonstrate the robustness of the GSCI model by evaluating in a complex dataset with multiplex variables. The GSCI got the highest AUC score with 93.78%, and showed advanced performance in other five evaluations.
To show the associated results more intuitively, Fig. 8 is offered as follows: According to the comparisons of the thirteen classifiers over seven datasets with six evaluation metrics, we can draw the following conclusions.
(1) The proposed GSCI ensemble model performs well in general, which reflects the effectiveness of the GSCI approach. The experimental results for the GSCI ensemble model for all six evaluation measures are the best for the datasets from Australia, Germany, Taiwan China, RRDai, ProsperLoan and are the second best for the dataset from Japan, LendingClubLoan.
(2) The performances of single classifiers are unstable on different datasets. Taking the RF as an example, it underperformed the NN on the datasets from Australia, Germany, Japan, ProsperLoan, LendingClubLoan and beat it only on the dataset from Taiwan China and RRDai. The comprehensive performance of the RF is third on the dataset from LendingClubLoan but is third from the bottom on the dataset from Australia.
(3) The overall performances of the GSCI and HWA ensemble classifiers are relatively good, which reflect to some extent that it is effective to consider the accuracy and diversity of base learners simultaneously when combining them. Among all the nine ensemble classifiers, only the GSCI and HWA ensemble classifiers take both the accuracy and diversity of base learners into consideration in the combination stage.

B. SIGNIFICANCE TESTS
To determine whether there are statistically significant differences among the thirteen machine learning models, we employ the Friedman ranking tests. According to (21), the Friedman test statistic is 23.01, which is greater than the critical value at the 95% significance level, and so the null hypothesis that there is no difference between the classifiers is rejected. Table 10 shows the Friedman's rank test result of the averages of the six-evaluation metrics.
The proposed GSCI ensemble model performs the best among the thirteen models with the average rank of 2.17. It is significantly better than the second ensemble WA with average rank of 5.31. This result again demonstrates the effectiveness and robustness of the proposed GSCI ensemble model compared to other single and ensemble models. Subsequently, when the null hypothesis of the Freedman's test is rejected, the post hoc Nemenyi test is performed to discover any significant paired comparisons. With reference to (22), the critical difference (CD) value at 10% significance level is 6.078. Fig. 9 shows the CD diagram with the results of the Nemenyi test. The horizontal axis represents the average ranking of all the machine learning models on each dataset. In addition, the models in which the difference in average ranks is lower than the CD value are connected by a black bar. In this Nemenyi diagram we can see clearly that there are 10 firstly methods: GSCI, WA, MV, NN, GB, HWA, RF, Ada, Bag and SVM. The proposed GSCI ensemble model is significantly better than ET, DT and NB at 10% significance level. VOLUME 8, 2020

C. ROBUSTNESS TESTS
In order to demonstrate the robustness of the proposed GSCI ensemble model, we compare the proposed method with other single and ensemble models under multiple working conditions by changing the sample size or variables of the data set.

1) SAMPLESIZE TESTS
To test the robustness of the GSCI model on data sets with different sample size, experiments are conducted on the Pros-perLoan Data with different time period. We divided the original data set into five time periods of different lengths, as shown in Table 11. The sample size and the ratio of positive and negative samples are significantly different in each time period. The experiment still adopts thirteen models employed above to test on data sets of different time periods. Table 12 shows the ACC scores and ranks of all the thirteen models in the samples size tests. It shows that the ACC results of all the models are correlated with the size of samples and the proportion of positive and negative samples in general. The size of samples increases, the models get better performance. On the other hand, models get better performance when the negative sample ratio is about high. It also shows that ACC ranks of the GSCI model rank fourth among all the models when the sample size and the proportion of positive and negative samples are changed, which verifies the robustness of GSCI for different sample size and sample ratio.

2) VARIABLES TESTS
To test the robustness of the GSCI model on data sets with different numbers of variables, we run experiments on the ProsperLoan Data with different variables.
This ProsperLoan Data has a total of 43 variables, in which ListingCreationDate (the loan created date) and LoanStatus time variables and target variables (the target variables that is expired or not) are excluded. We first analyze the correlation between 41 variables and target variables, and sort them. The results are shown in Fig. 10.
In order to eliminate the interference of variables correlation as much as possible, variables sampling was carried out with different fixed step size M and offset F.
Step size M refers to sampling after the sampling cursor moves M times. Offset F refers to the starting position of the sampling cursor.
Then, the sampling set of variables is where n is the index of the variables, and n= 0, . . . , 40.
Notice that F has to be less than M.
In the experiment, we take 5 kinds of step size sampling, M = {1, 2, 3, 4, 5}. In each step size sample, the offset is matched accordingly. So, we have 15 different sampling methods, such as (M = 1,F = 0), (M = 2,F = 0),  The results for variables tests showed in Table 13 demonstrate the robustness of the GSCI model on data sets with different numbers of variable. The ACC scores. The proposed GSCI ensemble model performs the best among the thirteen models in most experiments and the fluctuation of the ACC results for the GSCI model are relatively small. Therefore, the GSCI ensemble model shows strong robustness for different variables.
To show the associated results more intuitively, Fig. 11 is offered. As shown in Fig. 11, the vertical bar represents the span of the best and worst ACC score of each classifier in experiments with different variables, while box represents an interval that contains most of the algorithm's metrics. The shorter of the vertical bar and the higher level of the box the algorithm are, the better its adaptability and performance are. The Fig. 11 shows that GSCI ensemble model can achieve good results both in performance and adaptability under different variables datasets. This result again demonstrates the effectiveness and robustness of the proposed GSCI ensemble model compared to other single and ensemble models.

VII. COMPARATIVE ANALYSIS
This subsection presents comprehensive comparison of results obtained from the prior works on three public credit scoring datasets. Table 14 shows the ACC scores of the proposed GSCI model and other ensemble models employed by prior works. This study only focuses on the combination steps for constructing the ensemble model and there are no feature selection steps involved. Equally important, however, is the feature selection, another important issue for improving the performance of the machine learning models.  Table 14 that in the ensemble group the proposed GSCI model obtains the highest ACC score for Japan dataset, the second highest ACC score for Australia dataset, and in the medium level German dataset. It can be concluded that the GSCI model performs well in general compared in the group of ensemble models. Further, the ACC scores of the GSCI model for the three datasets are only in the medium lever in the group of the ensemble combined with feature selection. It shows that add the feature selection process to build the ensemble model may further increase models' performance, which is one of the future research directions.

VIII. CONCLUSION
In recent years, with the development of machine learning techniques for credit scoring, ensemble models have received much attention due to their higher performance. However, Ala,raj & Abbod [2], [3] showed that the ensemble approach using traditional combination methods may perform worse than the best base classifiers. Lessmann et al. [26] also found that the progress in developing advanced methods might be stalled. In this paper, we propose a new ensemble model based on the GSCI approach in order to further improve the predictive power of the classifier. To achieve this, the model first introduces the fuzzy measure to solve the problem where the base learners of the ensemble model may interact. Then, in order to consider the accuracy and diversity of the base learners simultaneously, a linear programming model for determining the fuzzy measure is built with an accuracy-and diversity-based objective function. Then, to retain the original information as much as possible in the training stage, the normal fuzzy number is employed to express the base learner predicted values. Finally, to get the comprehensive predicted value of the ensemble model, the GSCI aggregation operator is defined to combine the opinions of the base learners.
The experiment results over the four public datasets and three real-world P2P datasets demonstrate the superiority and stability of the GSCI ensemble model, which significantly outperforms single models and traditional ensemble models. First, based on the fuzzy measure framework for combining the base learners of a ensemble model, we can further explore the influential factors in the performance of the GSCI ensemble model, such as the type and number of base learners, the aggregation operator for the fuzzy measure, and the objective function of the linear programming model for determining the optimal fuzzy measure.
Second, we can further validate the effectiveness and generalizability of the proposed GSCI approach using more datasets and evaluation metrics.
Third, in order to further improve the predictive ability, researchers can integrate feature extraction and feature selection technical methods into the GSCI ensemble model. WENZHI CAO received the Ph.D. degree from the School of Computer Science and Technology, Huazhong University of Science and Technology, China, in 2013. He is currently doing his postdoctoral work at the School of Business, Central South University, China. He is currently an Associate Professor with the Hunan University of Commerce. He has contributed over 20 journal articles to professional journals. His research interests include virtualization, cloud computing, big data analysis, and artificial intelligence.