A Novel Multi-Stage Ensemble Model With a Hybrid Genetic Algorithm for Credit Scoring on Imbalanced Data

Credit scoring models are the cornerstone of the modern financial industry. After years of development, artificial intelligence and machine learning have led to the transformation of traditional credit scoring models based on statistics. In this study, a novel multi-stage ensemble model with a hybrid genetic algorithm is proposed to achieve accurate and stable credit prediction. To alleviate the adverse effects of imbalanced data in credit scoring models, the Instance Hardness Threshold method is extended using a majority voting strategy to deal with data imbalance. To eliminate redundant and irrelevant features in the dataset and select well-performing base classifiers, a new hybrid genetic algorithm is proposed to obtain the optimal feature subset and base classifier subset. To aggregate the predictive power of the base classifiers, a stacking approach is adopted to integrate the optimal base classifiers into the ensemble model. The proposed model is tested on three standard imbalanced credit scoring datasets, compared with similar state-of-the-art approaches, and evaluated using four well-known evaluation indicators. The experimental results prove the effectiveness of the proposed model and demonstrate its superiority.


Clf i
The ith classifier. IHP-Clf i The IHP identified by Clf i . n The number of classifiers. G The maximum number of generations. Q The number of individuals.

Mr g
The mutation rate in the gth generation.

I. INTRODUCTION
The ability to accurately assess the creditworthiness of customers who apply for loans and perform corresponding risk management is the key to the development of the modern financial industry. With economic development, traditional statistics-based credit scoring models have been gradually overwhelmed by the exponentially growing credit big data, and have lost their effectiveness. An intuitive example is that traditional statistics-based credit scoring models require assumptions regarding the statistical distribution of data, which are often not applicable to big data with a complex distribution [54]. Recently, the advancements in artificial intelligence, such as ensemble learningbased methods [32], evolution algorithm-based methods [49], and clustering technique-based methods [28] have been used in credit scoring fields. In our previous study, artificial intelligence and machine learning technologies outperformed statistical approaches in constructing a credit scoring model [60]. Credit scoring data are usually imbalanced data, which means that the numbers of positive and negative samples in the data are inconsistent. In the imbalanced credit scoring data, positive samples refer to the number of defaulting customers, and negative samples refer to the number of non-defaulting customers. Generally, the number of negative samples is much larger than the number of positive samples. The rationale behind this phenomenon is that, in most realworld cases, the number of customers who pay their bills on time is much larger than the number of customers who default. However, both statistics-based and machine learningbased credit scoring models find making accurate predictions challenging when imbalanced data are directly input. Therefore, enhancing the predictive ability of credit scoring models using imbalanced data is the first motivation of this study.
Credit scoring data are also high-dimensional [47]. The feature relations of high-dimensional data are often complex, which makes it difficult to predict the probability of default. Furthermore, redundant or irrelevant features often lead to model overfitting, which affects model performance. Hence, developing an effective feature selection approach is a prerequisite to lower data processing costs, a better understanding of data, and better-performing credit scoring models.
In addition, an appropriate ensemble strategy that integrates multiple base classifiers into an ensemble model, has proven to be an effective approach for solving several data mining tasks [26]. Therefore, various ensemble models based on random forest (RF) [7], extreme gradient boosting (XGBoost) [13], gradient boost tree (GBDT) [19], linear discriminant analysis (LDA) [18], etc., have been utilized for credit scoring. However, it is not taken for granted that an ensemble model composed of random multiple weak classifiers will be a well-performing strong classifier. In particular, multiple poorly-performing or correlated base classifiers in an ensemble model may result in adverse ensemble effects. To overcome these limitations, well-performing and uncorrelated base classifiers must be selected to construct ensemble models. Unfortunately, classifier selection is as complex as feature selection. Therefore, developing an effective classifier selection approach for selecting and composing well-performing base classifiers is another problem worth exploring.
The main contributions of this study can be summarized as follows: 1) A novel multi-stage ensemble model with a hybrid genetic algorithm (HGA) is proposed in this study.
2) The Instance Hardness Threshold (IHT) [53] method is extended via a majority voting strategy to alleviate the adverse effects of imbalanced data in credit scoring.
3) The basic genetic algorithm [25] is extended through a promising initial population and self-adaptive mutation to optimize feature selection and classifier selection.
4) The proposed model is validated on three standard imbalanced credit scoring datasets, indicating its superior performance.
The remainder of the paper is organized as follows: Section 2 reviews related work. Section 3 elaborates on the details of the proposed model. Section 4 presents the experimental design. Section 5 describes the experimental results and provides a comparative analysis. Section 6 presents the conclusions and future work.

II. RELATED WORK
In this study, the proposed model mainly includes three parts: learning from imbalanced data, feature selection and classifier selection, and classifier ensemble. As important subfields of machine learning and credit scoring, these three aspects have drawn much attention from various scholars. In this section, prior studies on the aforementioned subfields are reviewed. The related works all make significant contributions in each subfield, but their limitations are identified to differentiate the proposed study.

A. LEARNING FROM IMBALANCED DATA
A dataset can be considered imbalanced when the number of positive samples is inconsistent with that of the negative samples. In credit scoring data, the number of positive samples is usually much lower than the number of negative samples. In such a situation, the classifiers tend to make false predictions regarding positive samples with fewer numbers [55]. Hence, credit scoring models have difficulty in making accurate predictions when imbalanced data are input directly. Given the importance of this issue, numerous sampling approaches have been proposed to address imbalanced data before training the models.
There are three categories of sampling approaches: oversampling, undersampling, and hybrid-sampling. The oversampling approach can be used to sample imbalanced data by generating new positive samples or replicating some positive samples, such as random oversampling and the synthetic minority oversampling technique (SMOTE) [11]. Almhaithawi et al. [3] applied four common classifiers with SMOTE for fraud detection and concluded that SMOTE could improve the performance of most classifiers. The undersampling approach can be used to sample imbalanced data by eliminating some negative samples or generating new negative samples to replace the original negative samples, such as BalanceCascade [39] and cluster centroids [38]. He et al. [23] extended the BalanceCascade approach to generate adjustable balanced subsets based on the imbalance ratios of training data and obtained a better predictive performance than using the original BalanceCascade approach. The hybrid-sampling approach can be used to sample imbalanced data by combining oversampling and undersampling approaches, such as SMOTE-Tomek links [59]. Sun et al. [54] proposed a hybrid-sampling approach named DTE-SBD, and proved its effectiveness for imbalanced enterprise credit evaluation.
However, the aforementioned approaches only handle the problem of inconsistent sample sizes in imbalanced data. Moreover, traditional oversampling methods tend to inject irrelevant data, which leads to model overfitting. Traditional undersampling methods tend to eliminate too much useful data, leading to information loss. Meanwhile, none of these studies analyzed why imbalanced data affect classifiers. He and Garcia [24] reviewed the advancements of research in imbalanced data and concluded that the class overlap was one of the main reasons that degraded the classifier performance in imbalanced learning. Smith et al. [53] identified and analyzed samples that were frequently misclassified by learning algorithms and found that class overlap was a principal contributor to misclassification. To resolve class overlap, Smith et al. [53] proposed an undersampling approach named Instance Hardness Threshold (IHT), which identified the sample points that were hard to classify, i.e., Instance Hardness Points (IHP), using classifiers, and eliminated these samples from the training data to alleviate the adverse effect of class overlap. Garcia et al. [21] also studied the IHT and proved its effectiveness.
Although the IHT method helps resolve class overlap, it heavily depends on the performance of a single classifier for identifying IHP [34]. Employing a poorly-performing base classifier for identifying IHP tends to eliminate a significant number of negative samples, leading to information loss. Even if a well-performing classifier is employed to identify IHP, some useless negative samples cannot be eliminated, and the sampled data obtained are not widely applicable to all classifiers in the model. Therefore, to further develop the IHT into an effective approach for credit scoring, in this study, it is extended by the majority voting strategy, which not only improves the data quality after sampling but also improves the applicability of the sampled data.

B. FEATURE SELECTION AND CLASSIFIER SELECTION
In credit scoring, redundant or irrelevant features in training data often lead to model overfitting, which affects model performance. Furthermore, poorly-performing or correlated base classifiers in an ensemble model may affect ensemble performance. These problems highlight the importance of effectively selecting optimal features and base classifiers. Feature selection and classifier selection in credit scoring can be considered as a NP-hard problem [47]. Thus, several optimization approaches have been proposed to resolve these problems, such as grid search and random search [5]. However, with the increase of data and model complexity in credit scoring, the search space of feature selection and classifier selection increases sharply, considerably increasing the cost of the aforementioned approaches.
To address the data and model complexity, Hutter et al. [30] proposed a sequential model-based algorithm configuration (SMAC) procedure for solving general optimization problems by training an RF in the search space. This allows the optimization of both numerical and categorical parameters on a set of instances with less overhead. However, despite its success, the SMAC method is restricted to problems with moderate dimensions [57]. Hence, finding the optimal feature subset and classifier subset in a high-dimensional search space becomes challenging.
To overcome the aforementioned limitations, many scholars have adopted the genetic algorithm (GA) [25] that is motivated by the Darwinian biological evolution theory and has been widely employed in the optimization field to reduce the overhead of the optimization process and find the optimal solution. Similar to the biological evolution process, GA mainly solves the problem through four strategies, population, selection, crossover, and mutation. It has been proven that, with the same time cost, GA generally outperforms grid search and random search [37]. Ali et al. [2] proposed an LDA-GA-SVM, which used GA to optimize the parameters of the support vector machine (SVM). Chen et al. [12] used the GA to improve the complexity and weights of a learning vector quantization model for optimal or near-optimal cost-sensitive bankruptcy prediction. Oreski and Oreski [44] proposed a hybrid GA with neural networks and used it to select the optimal feature subset in credit risk assessment. Zhang et al.
[61] proposed a multi-population GA that enhanced the selection, crossover, and mutation steps through multi-population interaction, and applied this method to feature selection and classifier selection for credit scoring. The aforementioned studies have demonstrated the application and superiority of the GA in machine learning models. However, the population is randomly initialized in the above applications of GA, which requires more iterations to determine the direction of evolution for populations, and hence, increases the time cost. Therefore, in this study, a new HGA is proposed that generates a promising initial population for GA and enhances the mutation step through self-adaptation, to achieve better performance in terms of feature selection and classifier selection.

C. CLASSIFIER ENSEMBLE
The ensemble model has been proven to be an effective approach for improving the performance of the credit scoring model [56]. It is designed to enhance model performance by training multiple single base classifiers and integrating VOLUME 9, 2021 their decisions using an ensemble strategy. Currently, voting, bagging [6], boosting [51], and stacking [58] are the mainstream ensemble strategies for credit scoring. In particular, owing to its high robustness and excellent performance, the stacking approach has been extensively employed in credit scoring models [61]. Fedorova et al. [17] applied different stacking-based combinations of machine learning algorithms to construct a well-performing ensemble model for the bankruptcy prediction of Russian manufacturing companies. Wang et al. [56] compared three ensemble strategies in credit scoring and concluded that stacking could significantly improve model performance. Ali et al. [1] proposed a stacked support vector machine and demonstrated its superior performance over the other state-of-the-art machine learning ensemble models. The aforementioned works have demonstrated the superiority of the stacking strategy in constructing ensemble models. However, the effect of the stacking strategy also depends on the performance of the base classifiers [61]. Hence, this study uses a stacking strategy to construct a multi-stage ensemble model by integrating the decisions of the selected base classifiers from the candidate classifiers, which are trained by sampled data with selected features.

III. PROPOSED MODEL
In this study, a novel multi-stage ensemble model with an HGA is proposed. As shown in Figure 1, the proposed model consists of four stages: data sampling, feature selection, classifier selection, and classifier ensemble. In the data sampling stage, the traditional IHT method is extended through a majority voting strategy so that it can handle imbalanced training data. In the feature selection stage, the proposed HGA is used to select optimal feature subsets for each base classifier, and then, majority voting is employed to integrate all subsets into the aggregated optimal feature subset. In the classifier selection stage, the proposed HGA is used to select the optimal base classifier subset and further develop it into an ensemble model with a stacking strategy in the classifier ensemble stage. The details of the above stages and approaches are presented in the following subsections.

A. VOTING-BASED IHT (VIHT) METHOD
To handle imbalanced training data, the IHT method [53] is employed in this study to identify the sample points that are hard to classify, i.e., IHPs, using a classifier. For example, the four scatter plots at the bottom right corner of Figure 2 illustrate the IHPs that are identified by different base classifiers. The axes of the scatter plots demonstrate the values of the two dimension-reduced features through principal component analysis [41]. The pink, blue, and red points represent the IHP, negative, and positive samples, respectively. It can be seen that the IHPs identified using different base classifiers differ significantly in terms of quantity and distribution. Thus, the credit scoring data sampled through IHT using a certain classifier are not applicable to all base classifiers. Therefore, the IHT method is extended by combining multiple classifiers with a majority voting strategy to alleviate the aforementioned limitations in this study. The different IHPs, obtained through IHT using multiple base classifiers are integrated using the majority voting strategy into the aggregated IHP, which will be eliminated eventually, thus enhancing the applicability of sampled data to diverse base classifiers.
The framework of the proposed voting-based IHT (VIHT) method is illustrated in Figure 2. The scatter plot at the top-left corner illustrates the raw training data that will be identified using the VIHT method. In the grids, −1, 1, and 0 represent the negative sample points, positive sample points, and aggregated IHP, respectively. First, multiple base classifiers Clf i (i = 1,2. . . n) are used to identify the different IHPs, i.e., IHP-Clf i (i = 1,2. . . n) through IHT. Then, record the number of times for each sample point to be identified as an IHP. Next, these different IHPs are integrated into aggregated IHPs using the majority voting strategy. For example, if a sample point is identified as an IHP by more than half of the base classifiers, it will be considered as an aggregated IHP. Finally, the sampled data are obtained by eliminating aggregated IHPs. Hence the output of the VIHT is a sampled data that are applicable to most classifiers.

B. HYBRID GENETIC ALGORITHM (HGA) METHOD
To optimize feature selection and classifier selection for ensemble modeling, a new HGA with a promising initial population and self-adaptive mutation is proposed in this study ( Figure 3).
Step 1: Parameter initialization: The number of individuals in the initial population, crossover rate, initial mutation rate, maximum number of generations, and gene number of each individual are initialized.
Step 2: Encoding: A binary encoding scheme is employed to encode a candidate feature subset or candidate base classifier subset, where 0 indicates that the feature or classifier corresponding to the current gene is not selected and 1 represents that it is selected.
Step 3: Population initialization: The initial population plays an important role in the evolution of the GA toward a promising direction. Inspired by the SMAC procedure [30] for solving general optimization problems by training an RF in the search space, this study incorporates the procedure into the GA to generate a promising initial population in the HGA, as shown in Figure 4. a) Generate Q individuals randomly, representing the candidate feature subset or candidate base classifier subset, and employ balance accuracy [8] to evaluate the practical classification performance (i.e., fitness) of each individual through 5-fold cross validation. b) Train an RF using the individuals of the candidate feature or candidate base classifier subset as predictors and the practical classification performance of individuals as the target label. c) Generate Q1 (Q1 Q) new individuals randomly, and use the trained RF to predict the classification performance of Q1 generated individuals. d) Select the Q individuals with top-ranked predicted classification performance through RF as the initial population of GA. Use the predicted classification performance, instead of practical classification performance, to select the initial population of GA will reduce the algorithmic overhead of GA greatly.
Step 4: Crossover: A two-point crossover approach is used to evolve the individuals in the parent population according to a certain crossover rate to obtain the offspring population.
Step 5: Mutation: A fixed mutation rate tends to lead GA to fall into the local optimum or makes it difficult for the GA to converge. Hence, a self-adaptive mutation is proposed to overcome the above problems in HGA. The single point mutation approach is used to randomly mutate a gene of individuals in the offspring population according to a certain mutation rate and then update the mutation rate using Equation (1): where Mr 0 and Mr g represent the initial mutation rate and the mutation rate in the gth generation, respectively, g ∈ {1, 2. . . , G}, and G indicates the maximum number of generations. It can be seen from the equation, in the early stage of evolution, the mutation rate is increasing with the iteration to  enhance the diversity of the population. In the late stage of evolution, the mutation rate is decreasing with the iterations to speed up the convergence.
Step 6: Evaluation: Calculate the practical classification performance (i.e., fitness) of each individual in both the parent and offspring populations through 5-fold cross validation.
Step 7: Selection: Select Q individuals with top-ranked practical classification performance in both the parent and offspring populations, and used as the new population for further evolution.
Step 8: Let g = g+1. Repeat steps 4-7 until the termination condition is satisfied, and the optimal individual is decoded to obtain the optimal feature/classifier subset.

C. FEATURE SELECTION
The proposed HGA is used to select the optimal feature subset for each base classifier to eliminate redundant or irrelevant features and increase the applicability of the selected features to various base classifiers. Because the different classifiers have different optimal feature subsets, to enhance the applicability of the selected feature subset, the majority voting method in VIHT is also employed for feature selection to output the aggregated optimal feature subset. For example, if a feature is selected into the optimal feature subsets by more than half of the classifiers, this feature will be added to the aggregated optimal feature subset, otherwise it will not be added. In the feature selection procedure, an individual in the HGA represents a candidate feature subset, a population in the generation consists of multiple individuals, and the optimal individual represents the optimal feature subset that is obtained through genetic evolution. Furthermore, the practical classification performance (i.e., fitness) of each candidate feature subset for each base classifier is evaluated using the balance accuracy through 5-fold cross validation.

D. CLASSIFIER SELECTION
Considering that the poorly-performing or correlated base classifiers in an ensemble model may affect the ensemble performance, the proposed HGA is used to select the optimal base classifier subset, which is further integrated into an effective ensemble model through a stacking strategy. In the classifier selection procedure, an individual in the HGA represents a candidate base classifier subset, a population in the generation consists of multiple individuals, and the optimal individual represents the optimal base classifier subset that is obtained through genetic evolution. The practical classification performance (i.e., fitness) of each candidate base classifier subset corresponding to a candidate ensemble model is evaluated using balance accuracy through 5-fold cross validation.

E. CLASSIFIER ENSEMBLE
Although the optimal base classifier subset is obtained, the stacking strategy is employed to integrate the trained selected base classifier subset. The stacking strategy consists of two stages. First, the base classifiers are trained using training data through cross validation, and the predictions of the base classifiers are combined into a new feature matrix. Second, the obtained new feature matrix is used to train a meta-classifier to output the final decision. Due to the superiority of kernel ridge classifier [46], it is employed as a meta-classifier to integrate the decisions of multiple base classifiers in this study.

A. DATASET PREPROCESSING AND PARAMETER SETTING
In this study, three standard credit scoring datasets from the UC Irvine (UCI) machine learning repository, i.e., the German [4], Polish 1, and Polish 2 [62] datasets, were used to test the effectiveness of the proposed model. The details of these datasets are presented in Table 1.
As shown in Table 1, the German dataset contains 1000 samples, 300 of which are positive and 700 are negative. The dimension of the features, including the target label,  Three basic data preprocessing approaches, namely standardization, normalization, and one-hot encoding [43], were used to pre-process the datasets before training the models. The numerical features were standardized and normalized by removing the mean and scaling to unit variance, to handle different orders of magnitude on different numerical features. To ensure effectiveness and comparability of the experiment, the optimal parameters were preset through a trial-run for HGA. In the HGA, the maximum number of generations G was set to 100, the number of individual Q in each population was set to 50, and the randomly generated individual Q1 was set to 200. A greater number of generations G will lead to more computation time but will also result in better performance. The crossover rate and initial mutation rate were set to 0.8 and 0.2, respectively. A higher crossover and mutation rate will lead to algorithmic convergence difficulty, but also produce a larger search space.

B. EVALUATION INDICATORS
In this study, four comprehensive evaluation indicators were employed to evaluate the performance of the proposed model, namely, balance accuracy (BACC) [8], F-score [36], G-mean [35], and Recall [9]. These comprehensive evaluation indicators are all determined by true positive (TP), false positive (FP), true negative (TN), and false negative (FN), and the higher comprehensive indicator value represents the better performance of evaluated models. These comprehensive evaluation indicators are widely employed in imbalanced learning [24], and their superiorities are detailed as follows.
BACC can better reflect the performance of classification models than accuracy during imbalanced learning [55]. Hence, in this study, BACC was adopted to evaluate model performance. Its calculation rule is shown in Equation (2), VOLUME 9, 2021 and the true positive rate (TPR) and true negative rate (TNR) are defined in Equations (3) and (4) Recall evaluates the ability of models to identify positive samples, which is critical for the credit scoring model. Its calculation rule is shown in Equation (4).
G-mean is another comprehensive evaluation indicator for the imbalanced learning model. The G-Mean shows whether the balance between classes is reasonable. The calculation rule of G-mean is shown in Equation (5), where Sensitivity and Specificity are defined in Equations (4) and (6), respectively.
G-mean = Sensitivity * Specificity (5) F-score is the harmonic average of Precision and Recall, and reflects the tradeoffs between precision and recall. The calculation rule of F-score is shown in Equation (7), where Precision is defined in Equation (8).

C. EXPERIMENTAL SETTING
The raw dataset was divided as follows: Twenty percent of the total data were used as the test data, and the remaining 80% were further divided into 80% for training and 20% for validation. The data preprocessing approaches (e.g., standardization, normalization, and one-hot encoding) were imported from the Python module ''sklearn''. In the proposed VIHT method, the basic IHT method was imported from the Python module ''imblearn''. In the ensemble stage, the stacking approach was imported from the Python module ''mlxtend''. The classifiers LDA, RF, GBDT, SVC, DT, LR, ET, Bagging, and ridge regression were imported from Python module ''sklearn''. The classifiers Xgboost and Light-GBM were imported from the Python modules ''xgboost'' and ''ligtgbm'', respectively. For a fair comparison, default parameters were adopted in all imported modules.

V. EXPERIMENTAL ANALYSIS
To enhance the diversity of the base classifiers, three base classifiers were reproduced from each type of base classifier with different appropriate parameters through trial and run. Four comprehensive evaluation indicators were used to evaluate the model performance, namely BACC, F-score, G-mean, and Recall. To enhance the robustness of the experiments and avoid single-results bias, each experiment was performed 10 times, and the average values were used as evaluation results. All the experiments were performed using Python version 3.7.5 on a PC with a 3.8-GHz Intel Core I7-10700 K, 32 GB RAM, and a Windows 10 operating system.

A. BASELINE RESULTS
To verify the performance of the proposed model, all base classifiers were tested on three datasets, and the results were indicated as the baseline results. As shown in Table 2, 10 types of base classifiers were tested on the German, Polish 1, and Polish 2 datasets.

B. PERFORMANCE EVALUATION OF THE VIHT METHOD
To prove the effectiveness of the proposed VIHT method on real datasets, its performance was evaluated on three datasets as shown in Table 3. The bolded values of evaluation indicators represent better performance of base classifiers after VIHT was applied than the baseline results. It can be seen from the Table 3, by comparing with the baseline results, all the base classifiers are significantly enhanced by the VIHT method in all or most evaluation indicators on all datasets. To further verify the effectiveness of the VIHT method, two traditional sampling approaches, namely, IHT and SMOTE, were tested under the same conditions for comparison, with the results outlined in Table 4. The bolded values indicate better performance after IHT or SMOTE were applied than after VIHT was applied. It can be seen from the Table 4, most base classifiers perform worse after IHT or SMOTE are applied than after VIHT is applied in most evaluation indicators on all datasets. The outperformance of VIHT is owing to the following reasons: 1) VIHT eliminates the IHPs effectively to alleviate the class overlap problem.
2) VIHT extends the traditional sampling approaches by integrating the decisions of various classifiers to improve the applicability of sampled data to diverse base classifiers.
3) VIHT only eliminates samples considered hard to learn by multiple classifiers, instead of adding additional data in sampled data or eliminating too much information of data. Hence it can alleviate the model overfitting.

C. PERFORMANCE EVALUATION OF HGA-BASED FEATURE SELECTION METHOD
To prove the effectiveness of the proposed HGA-based feature selection method, its performance was evaluated on three sampled datasets.
The feature correlations before and after the HGA-based feature selection method were performed on three sampled datasets were shown through heatmaps in Figure 5 respectively. The axis represents the index of the features; the bluer region in the heatmap represents higher correlations between the corresponding feature pairs, and the redder region represents the contrary.
It can be seen from the figures, the feature correlations between different features on any sampled datasets are significantly reduced after HGA-based feature selection is  performed. It indicates that the HGA-based feature selection method effectively reduces the relevant or redundant features. Its evaluation performance on three sampled datasets is shown in Table 5, where the bolded values of evaluation indicators represent that the base classifiers performed better after HGA-based feature selection was applied. It can be seen from  the Table 5, all base classifiers are significantly enhanced by the HGA-based feature selection method in all or most evaluation indicators on all sampled datasets, because the HGA-based feature selection method effectively eliminates the irrelevant and redundant features.

D. PERFORMANCE EVALUATION OF THE CLASSIFIER SELECTION AND ENSEMBLE
To prove the effectiveness of the proposed HGA-based classifier selection method, its performance was evaluated on three sampled datasets after feature selection.  The correlations between the classifier predictions before and after the HGA-based classifier selection method was performed on three sampled datasets after feature selection were shown through heatmaps in Figure 6 respectively. The axis represents the classifier index; and the bluer region in the heatmap represents higher prediction correlations between the corresponding classifier pairs, and the redder region represents the contrary. It can be seen form the figures, the prediction correlations between different classifiers on any sampled datasets are significantly reduced after HGA-based classifier selection is performed. It indicates that the HGA-based classifier selection method effectively reduces the correlated classifiers with high prediction correlations. Further, the selected optimal base classifiers were incorporated into the proposed ensemble model through stacking strategy. The proposed ensemble model was then compared to the general ensemble model that was combined from all base classifiers without classifier selection (abbreviated as ''The general ensemble model''). Its evaluation performance on three sampled datasets after feature selection is shown in Table 6, where the bolded values of evaluation indicators represent that the proposed ensemble model performed better than the general ensemble model without classifier selection. It can be seen from the Table 6, all ensemble models are significantly enhanced by the HGAbased classifier selection method in all evaluation indicators on all sampled datasets, because the HGA-based classifier selection method effectively eliminates the poor-performed and correlated classifiers.

E. STATISTIC RESULTS
To demonstrate the reliability of the experimental results, a statistical test should be performed. The well-known analysis of variance (ANOVA) and its non-parametric counterpart, the Friedman test can be used to test the effectiveness of all models under different methods. Friedman [20] experimentally compared ANOVA with his proposed Friedman test on 56 independent problems and showed that the two methods mostly agree. In recent years, Demšar [15] presented that due to the possible violations of the tests' assumptions by a typical machine learning data, non-parametric tests (Friedman test) should be preferred over parametric tests (ANOVA). Hence, the Friedman test was adopted in this study. When the null-hypothesis was rejected, the Nemenyi test [42] was applied. In the Friedman test, 52 classification models were ranked based on different evaluation indicators. These models included 10 types of base classifiers before the VIHT method was applied, 10 types of base classifiers after the IHT method was applied, 10 types of base classifiers after the SMOTE method was applied, 10 types of base classifiers after the VIHT method was applied, 10 types of base classifiers after the HGA-based feature selection method was applied, the general ensemble model without classifier selection, and the proposed ensemble model. Then the score of each method was calculated by averaging the ranking of the models that employ this method. Finally, the statistical significance of the Friedman test can be obtained from the scores of all the methods. Table 7 shows the scores of each method on different datasets and evaluation indicators, and the results of the Friedman test. It can be seen from the table, all of proposed methods obtain higher scores than the comparison methods in all or most evaluation indicators on all datasets. It can be seen from Table 7 that the statistics value of the Friedman test on most evaluation indicators is larger than the critical value (i.e., 2.996). The null-hypothesis, i.e., all of the methods having the same performance, is rejected, according to Demšar [15]. Subsequently, the Nemenyi test was used to perform the post hoc test, the results of which are shown in Figure 7, where the critical distance (CD) indicates the mean ranking score difference. The higher the position of the classifier on the coordinate axis to the left, the better is the performance of the classifier, and vice versa. It can be seen from the figure that the proposed methods perform better than the corresponding comparison methods on most evaluation indicators, and the proposed ensemble model always has the best performance.

F. PERFORMANCE COMPARISON BETWEEN THE PROPOSED ENSEMBLE MODEL AND THE BENCHMARK ENSEMBLE MODELS
The performance comparison between the proposed ensemble model and benchmark ensemble models proposed by Seiffert et al. [52], Sun et al. [54], and Liu et al. [40] is presented in Table 8, where the bolded values represent the best performance in the comparison. The source codes for the ensemble models by Seiffert et al. [52] and Liu et al. [40] are public; thus, for fair comparison, they were tested with the same experimental settings as the proposed ensemble model, including the datasets, running times, and preprocessing approaches. The source codes for the ensemble model proposed by Sun et al. [54] are not public; hence, this model was rigorously reproduced using the provided modeling scheme and methodology, and then, tested under the same experimental settings as the proposed ensemble model in this study.
To observe the performance of each ensemble model more intuitively, all the results are presented in the form of histograms in Figure 8, where the indices represent the dataset and evaluation indicators, respectively. For example, the histogram corresponding to the indices German and BACC represents the performance of the four comparison models using the BACC indicator on the German dataset. It can be seen from both Table 8 and Figure 8, the proposed ensemble model achieves the best performance in all or most evaluation indicators on all datasets.

G. COMPLEXITY AND APPLICATION ANALYSIS
To verify the practicality of the proposed model, a German dataset was used to assess the computational complexity of the various methods employed in the proposed model. Furthermore, the time of model prediction was tested. All methods were implemented in Python version 3.7.5 on a PC with a 3.8-GHz Intel Core I7-10700 K, 32 GB RAM, and    Windows 10 operating system. The running times of each model and method are listed in Table 9.
As observed in Table 9, during imbalanced data processing, VIHT only requires 1.639 s, whereas the HGA-based feature selection and classifier selection requires 161.046 s to select optimal feature subsets and 846.887 s to select optimal classifier subsets respectively with 1000 training samples. Furthermore, the running time of the proposed model in predicting 1000 test samples was only 0.324 s. Notably, the training process of the proposed model can be performed offline. In addition, the trained model occupies less than 1M of memory, which means that the trained models can be easily installed. Hence, the proposed model is feasible for practical applications.

VI. CONCLUSION AND FUTURE WORK
Credit scoring is currently a promising research field in data mining. Herein, to address imbalanced data on credit scoring, a novel multi-stage ensemble model with a hybrid genetic algorithm was proposed. First, VIHT approach was proposed to address imbalanced data. Next, a novel HGA approach was proposed and subsequently applied to select optimal feature and classifier subset. Finally, a stacking method was applied to reach the final prediction. Four performance indicators, i.e., BACC, F-score, G-mean, and Recall were used to evaluate the performance of the proposed model. The results demonstrated that the superior performance of the proposed model compared to other benchmark credit scoring models.
In future studies, the demands from the financial industry to reduce the complexity of classification models that are used to obtain prediction results rapidly, will be given serious consideration. The proposed ensemble model requires fair-sized memory and resources in the classifier selection and ensemble stages of the base classifiers. Therefore, more effective strategies for classifier selection and ensemble will be explored to further reduce the complexity and enhance the scalability of the model. In addition, the proposed ensemble model only involves the classic machine learning classifiers instead of deep learning classifiers, which affects the diversity of the base classifiers. In recent years, some scholars have demonstrated the effectiveness of clustering techniques for feature learning in the ensemble model [10], [31]. The clustering techniques proposed by Hu et al. [27] and Hu et al. [29] have superior clustering performance, which can be integrated into the ensemble model to enhance the model performance in our future work.
[61] W. Zhang He has published more than 60 articles in international journals and more than 20 papers in international conference proceedings in the recent ten years, covering supply chain management, digital library, bibliometrics, concurrent engineering, distributed manufacturing, credit scoring, business analytics, data mining, multi-agent technology, and semantic web.
XIN WU is currently a full-time Associate Professor with China Academy of Financial Research, Zhejiang University of Finance and Economics, China. He has published more than 20 articles in international journals, covering data mining, financial technology, and sustainable development.
YANAN LIU received the Ph.D. degree in computer science and technology from Zhejiang University, China, in 2010. She is currently a full-time Associate Professor with the School of Information Management and Artificial Intelligence, Zhejiang University of Finance and Economics, China. She has published more than 20 articles in international journals, covering data mining, artificial intelligence, and credit scoring.
ZEQIAN HU is currently pursuing the B.S. degree with the School of Information Management and Artificial Intelligence, Zhejiang University of Finance and Economics, China. His current research interests include machine learning and data science. VOLUME 9, 2021