An Online Transfer Learning Framework with Extreme Learning Machine for Automated Credit Scoring

Automated Credit Scoring (ACS) is the process of predicting user credit based on historical data. It involves analysing and predicting the association between the data and particular credit values based on similar data. Recently, ACS has been handled as a machine learning problem, and numerous models were developed to address it. In this paper, we address ACS issues concerning credit scoring in a batch of machine learning problems, namely, feature irregularities due to empty features in many records, class imbalance due to non-uniform statistical distributions of the records between classes, and concept drift due to changing statistical characteristics concerning certain classes and features with time. Considering the limited credit scoring data volume, we propose to address the challenge using the Transfer Learning with Lag (TLL) algorithm based on embedded shallow neural networks that enable knowledge transfer when the number of active features changes. Knowledge transfer is based on lags having an adaptive length that is changed based on performance change feedback. Furthermore, the framework proposes classifier aggregation and the chunk balancing mechanism for handling class imbalance. An evaluation was conducted using the Lending club, German, Default, and PPDai datasets. The results show the superiority of the proposed algorithm over the benchmarks in terms of the majority of classification metrics concerning both time series and overall results. TLL offered improvements of 58.6% and 28.2% over FA-OSELM and OSELM using the Lending club dataset.


I. INTRODUCTION
Managing credit risk and supporting credit application decision-making has become a demanding artificial intelligence and machine learning application. It comprises providing the probability of default for lending institutions' clients and satisfying the minimum loss principle for business sustainability. Transitioning from manual borrowing application processing based on officers or expert-based credit scoring to establishing automated credit scoring helps create a more promising system to avoid credit and opportunity loss. Limiting user intervention is the general direction targeted using automated systems. The financial sector provides numerous examples of financial services having an automated credit-scoring decision-support system, e.g., internet banking firms in South Korea [1], based on a tablet banking system acting as a smart branch to enable various business functions concerning financial services.
These services prompt the user to scan an ID using a mobile device camera after which, the user can access the majority of the bank services, eliminating the need to visit a branch for financial consultation or product services. According to [2], credit scoring is far from being a process implemented only by financial institutions. Other types of firms, such as mobile phone companies, insurance companies, or government departments, use similar approaches before accepting to provide their services. However, there is a concern about model choice and indicators to determine the best model and dynamics: how to introduce them to provide a figure concerning future risks. Automated credit scoring performance has been assessed using various approaches specified in the literature. Some studies used the binary classification problem. Others incorporated using data mining and machine learning techniques like discriminant analysis [3], neural network [4], support vector machine [5], decision tree [6,7], logistic regression [8], fuzzy logic [9], genetic algorithm [10] [11], Bayesian networks [12], hybrids methods [13] [7] [14] [15], ensemble methods [16], and feature selection [17] [18] [5]. Researchers [19] indicated that previous studies used the binary classification system, which is insufficient to predict the score correctly. We highlight three additional issues, namely, feature irregularities, class imbalance, and concept drift. Applications might not provide complete information, leading to empty features on records, causing feature irregularities. Non-uniform statistical distributions of records cause class imbalance because most records belong to one class. Varying statistical characteristics of certain classes cause concept drift. Financial and economic features are highly dynamic with time. The literature around credit scoring does not jointly address these issues using a single framework to the best of our knowledge. This study aims to bridge this gap using a knowledge transfer integrated online extreme learning machine. This integration is accomplished using a single framework where knowledge transfer is combined with lag awareness. It enables avoiding concept drift by slowing model movement towards fast knowledge update. In addition, incorporating transfer learning enables using missing feature values to facilitate learning despite the absence of data. The remaining article is organised as follows: Section 2 discusses the literature review, Section 3 presents the methodology and the proposed model, Section 4 presents the experimental evaluation and results; lastly, Section 5 comprises concluding remarks and future work recommendations.

II. RELATED WORK
Managing credit risk and supporting credit application decision-making has become a demanding artificial intelligence and machine learning application. It comprises providing the probability of default for lending institutions' clients and satisfying the minimum loss principle for business sustainability. Transitioning from manual borrowing application processing based on officers or expert-based credit scoring to establishing automated credit scoring helps create a more promising system to avoid credit and opportunity loss. Limiting user intervention is the general direction targeted using automated systems. The financial sector provides numerous examples of financial services having an automated credit-scoring decision-support system, e.g., internet banking firms in South Korea [1], based on a tablet banking system acting as a smart branch to enable various business functions concerning financial services. These services prompt the user to scan an ID using a mobile device camera after which, the user can access the majority of the bank services, eliminating the need to visit a branch for financial consultation or product services. According to [2], credit scoring is far from being a process implemented only by financial institutions. Other types of firms, such as mobile phone companies, insurance companies, or government departments, use similar approaches before accepting to provide their services. However, there is a concern about model choice and indicators to determine the best model and dynamics: how to introduce them to provide a figure concerning future risks. Automated credit scoring performance has been assessed using various approaches specified in the literature. Some studies used the binary classification problem. Others incorporated using data mining and machine learning techniques like discriminant analysis [3], neural network [4], support vector machine [5], decision tree [6,7], logistic regression [8], fuzzy logic [9], genetic algorithm[10] [11], Bayesian networks [12], hybrids methods [13][7] [14][15], ensemble methods [16], and feature selection [17][18] [5]. Researchers [19] indicated that previous studies used the binary classification system, which is insufficient to predict the score correctly. We highlight three additional issues, namely, feature irregularities, class imbalance, and concept drift. Applications might not provide complete information, leading to empty features on records, causing feature irregularities. Non-uniform statistical distributions of records cause class imbalance because most records belong to one class. Varying statistical characteristics of certain classes cause concept drift. Financial and economic features are highly dynamic with time. The literature around credit scoring does not jointly address these issues using a single framework to the best of our knowledge. This study aims to bridge this gap using a knowledge transfer integrated online extreme learning machine. This integration is accomplished using a single framework where knowledge transfer is combined with lag awareness. It enables avoiding concept drift by slowing model movement towards fast knowledge update. In addition, incorporating transfer learning enables using missing feature values to facilitate learning despite the absence of data. The remaining article is organised as follows: Section 2 discusses the literature review, Section 3 presents the methodology and the proposed model, Section 4 presents the experimental evaluation and results; lastly, Section 5 comprises concluding remarks and future work recommendations.

I. Methodology
This section describes the methodology devised for accomplishing a credit scoring framework based on stream data while handling missing values (non-active features), data imbalance, and concept drift. The problem is formulated, followed by a general framework overview. Chunk balancing is described, following by preprocessing and normalisation. Subsequently, gamma generation and transfer learning are described. Lastly, ensemble learning, concept drift, and lag update are presented.

A. Problem Formulation
Consider a sequential dataset , where denotes a chunk arriving at time t and denotes chunk labelling information.
might contain the ground truth of when class information is available; otherwise, it contains the character , indicating that label information of is not available.
,and where denotes the set of real numbers, denotes the set of natural numbers, and denotes the number of rows in chunk t. The objective is to predict the class of samples ) with a minimal percentage of false predictions.

B. Transfer Learning with Lag (TLL) Framework
The Transfer Learning with Lag (TLL) framework is depicted in Figure 1. It is observed that the arriving chunk first moves to a load balancing process. It is based on a buffer that stores and uses recent samples to balance chunks to maintain similar class percentages for the labelled samples. It is used to reduce bias caused by the imbalance in labelled sample distribution. Subsequently, a pre-processing stage is used for normalisation. Next, the Gamma generation stage is invoked; it codes active and inactive features using 1 and 0. Gamma generation information is used by the learning block (TL) to build the classifiers for the next time step using previous classifiers and memory information.
Furthermore, memory is also updated using transfer learning to ensure that information is saved for future learning. After obtaining the classifiers for the next time step, an aggregator is used for prediction. Predictions are used to detect concept drift using available class information, while prediction is used to adjust lag values. Subsequent sections present internal block operations comprehensively. As depicted in Algorithm 1, the framework uses random weights to initiate the learners. Next, it uses the boosting data to update the knowledge of the learners. Boosting data represents labelled data used to provide initial knowledge to the system. Next, the generate Gamma process is applied to the boosting data to generate active feature indices; these are saved for subsequent chunks.
Transfer uses previous and current Gamma to determine the weights that need to be saved in memory or restored from memory. The transfer learning process outputs current moment learners used to predict current sample labels using aggregation, which comprises an ensemble learning rule that uses the current moment learners for prediction. Aggregation results are provided to the concept drift detector, which compares deviation values from the ground truth concerning the data and decides if drift has occurred. For a concept drift scenario, a lag update is performed, and free Memory is invoked. The function clears the memory of outdated weights and knowledge.

C. Chunk Balancing
Balancing is the first step of the framework. Its role is to enable balanced training data for the learners from the label perspective. It executes based on the balancing period or condition. It uses class samples to calculate the ratio of each class against the entire sample count and uses the majority ratio to complement minorities using data from the buffer.

D. Pre-processing and Normalisation
Data pre-processing combines various steps. It starts by converting the data to numerical values. Every categorised feature is encoded using a binary scheme. Multi-category features are split into binary features under the assumption that they have a non-deterministic relation, leading to several binary features that are one count less than the number of categories. In addition, we eliminate statistical redundancy of features by calculating the correlation matrix and removing features having a correlation value of more than 0.95. It enables data compaction and provides a discriminative version. Subsequently, Equation 1 is used to normalise data: (1)

E. Gamma Generation
Assuming that the dataset is combined of a set of chunks ordered with respect to time, as shown in Equation 2: (2) Where denotes the number of chunks.
We consider that each chunk has the same active features, which means that there is no change between the active features among the chunk records. In the case of absent or missing features, we prefer avoiding dummy values to indicate an absent feature because it might affect Prediction accuracy. Instead, we build an indicator vector for the missing feature. An active feature vector Γ denotes this vector with a size of .
Here, m is the number of features in the data, and any component of this vector is binary or . An active feature is indicated using a value of 1.
It is depicted using Equation 3: where Γ(x) denotes the set of features in vector x.

F. Transfer Learning.
Transfer learning is used to create new learners for predicting current chunk labels. Hence, learner input must match the active feature of the current chunk. In addition, transfer learning is responsible for two tasks: (1) restore old knowledge from memory by inserting the weights connected to the new active features into the learners (lines 5-8 of the pseudocode); (2) maintain the memory by storing the weights connected to the non-active features from the previous learners (line 11 of the pseudocode).

G. Ensemble Learning (Aggregation)
Ensemble learning is responsible for aggregating the basic learners. It is based on majority decisions for an odd number of classifiers and performing classifiers for even classifiers. Algorithm 5 depicts the pseudocode.

H. Concept Drift Detection and Lag Update
Concept drift is detected using two methods: The first method is named the accuracy drop (AD) method. It calculates accuracy using labelled samples and triggers concept drift when a decline occurs over time. It is tested mathematically using Equation 5.
Where denotes the calculated accuracy at the moment t, we designate TLL with AD as (TLL-AD).
ADWIN is the second method that uses a moving window (buffer) with a fixed maximum length buffer for old samples to determine concept drift. The method iteratively drops samples from the window tail until a smaller window is obtained with no concept drift. A window without concept drift is one lacking statistical significance concerning the differences between all sub-window partitions. For efficient calculations, we use ADWIN2, which uses logarithmic partitioning for checking concept drift inside the window. We designate TLL with ADWIN2 as TLL-ADWIN2. The lag update changes lag from one time moment to another based on the concept drift decision. Equation 6 is used for changing the lag.
Where denotes the lag at moment t. denotes the initial lag.
Lastly, the free memory process removes older weights from memory based on the lag provided by the update.

II. Memory Freeing
Memory freeing cleans the memory of the weights related to outdated classifiers identified by . Hence, for every moment t, the free Memory process is supposed to remove all weights related to classifiers trained at the moment or older. This process ensures that older weights are not restored from memory and expired knowledge is not considered.

III. Experimental Evaluation and Results
For evaluation, we compared Transfer Learning with Lag TLL-OSELM using two benchmarks, namely, Feature Adaptive FA-OSELM and OSELM. Results were generated based on the parameters depicted in Table 2. We selected the VOLUME XX, 2022 sigmoid activation function, 100 hidden neurons, 50 records per chunk, 50 time-units as initial tag, and 0.5 ∆Acc.  Another observation is that OSELM outperformed FA-OSLEM in terms of accuracy for all datasets except the German dataset. It indicates that transfer learning concerning FA is not adequate, considering that it transfers knowledge from one neural network to the next without considering lag to handle the evolving nature of data. TTL provided better results regarding behaviour interpretation by integrating three algorithm functionalities, i.e., data imbalance processing using the window technique, ensemble learning, and handling concept drift using knowledge transfer based on lag and memory. We elaborate on the predicted time series using Figures 9-12. The plots indicate that accuracy oscillates for the four approaches and is caused by dynamic changes to data characteristics. Consequently, learner performance degrades. We also observe that in time intervals, FA outperformed OSELM, while the opposite happened for others. Nevertheless, the plots VOLUME XX, 2022 indicate that TLL methods are generally superior to FA and OSELM.

OVERALL PRECISION
Precision indicates the percentage of actual positive predicted values against all positive predicted values, as depicted in Figure 13-16. This metric is essential to indicate the learner's level of avoiding bias for negative samples that indicate the majority class. We see that TTL accomplished the best precision for the Lending Club dataset, close to 100% levels. However, the difference between the approaches in terms of precision is more pronounced for other datasets. For example, Figure 15 indicates that TLL-ADWIN2 obtained a precision of about 86%, compared to slightly lower values for FA and OSELM. Like accuracy, we present the time series for precision using separate plots, depicted using  We elaborate performance further by presenting detailed time series data using Figure 17-20. As seen, the approaches have volatility because of the dynamic characteristics of actual data. However, it is evident from the German dataset that TLL-ADWIN2 was maintaining better performance levels.

OVERALL RECALL
Recall provides the percentage of actual predicted positive records from all actual positive records, depicted using Figures

IV. Summary and Conclusion
This article handles the credit scoring problem as a batch learning problem. We considered three specific problems: feature irregularities due to empty features in many records, class imbalance due to non-uniform statistical distributions of records among classes, and concept drift due to varying statistical characteristics for specific classes based on certain features with respect to time. The article proposed transfer learning to handle evolving features and changes concerning active/disabled features across batches. It also incorporated lag to remove outdated knowledge and focus on new knowledge based on adaptive lag and accuracy-change feedback. Furthermore, the framework proposes a chunk balancing mechanism and classifier aggregation for handling class imbalance. Additionally, window-based chunk balancing was incorporated to augment imbalance handling. The evaluation was conducted based on the Lending Club, German, Default, and PPDai datasets. The results show the superiority of the proposed algorithm over the benchmarks in terms of the majority of classification metrics concerning both time series and overall results. The highest improvement percentage was 53% over OSELM and 65% over FA-OSELM. Future work should incorporate feature selection to handle dynamic changes concerning relevant features and high dimensional data. In addition, the developed framework should be evaluated on other machine learning fields that share the same issues concerning the credit scoring problem.