The Naïve Associative Classifier With Epsilon Disambiguation

This paper presents the Naïve Associative Classifier with Epsilon disambiguation (NAC<inline-formula> <tex-math notation="LaTeX">$\varepsilon $ </tex-math></inline-formula>), an extension of the Naïve Associative Classifier that, by including a procedure to disambiguate classes in regions where Bayes risk is high, has a positive effect on the performance of the classifiers of the associative approach on several datasets belonging to the financial environment, particularly in terms of credit risk. The experiments conducted to test the NAC<inline-formula> <tex-math notation="LaTeX">$\varepsilon $ </tex-math></inline-formula> were based on 12 datasets composed with financial information and associated with five stages of the credit process: promotion, evaluation, granting, monitoring and recovery. Due to the severe imbalance present in most datasets, the performance of the proposed algorithm was measured using the area under the ROC curve. Likewise, <inline-formula> <tex-math notation="LaTeX">$5\times 2$ </tex-math></inline-formula> stratified cross validation was made and finally a couple of statistical tests were applied to compare the results. After applying the NAC<inline-formula> <tex-math notation="LaTeX">$\varepsilon $ </tex-math></inline-formula> to the datasets, a successful disambiguation of classes was observed. In the real world this fact could help financial institutions to evaluate the credit applications more effectively and thus, contribute to the mitigation of monetary losses derived from the poor quality of the information.


I. INTRODUCTION
Financial risk includes different approaches: market risk, credit risk, liquidity risk and operational risk, and has been studied since the banking institutions were created. Starting in the 1950's, analysis and management of financial risks have been turning more relevant in the global economy, although as shown by the financial crises observed since then, these tasks have not fulfilled their mission; as a consequence, the analysis of financial risks is a current topic of research [1]- [23].
Concerns about credit risk have increased, becoming an on-topic issue among various financial institutions. These institutions seek to efficiently maximize potential benefits, which turns out to be a complicated task given the conditions of competition [24].
Credit default events arise when a borrower is unable to partially or fully cover its contractual obligations in terms of the total debt or the due date. Credit risk is measured in terms of the potential loss derived from any event of that kind, and is a value expressed in monetary units.
The associate editor coordinating the review of this manuscript and approving it for publication was Zheng Xiao .
Considering that potential loss interpretation may be subjective for anyone external to the evaluated transaction, frequently the risk level is reported in terms of rating or credit scoring [25].
Financial risk forecasting can be interpreted as a pattern classification problem [1]. Considering that the associative approach of intelligent pattern classification has currently acquired great popularity because of its effectiveness and efficiency, this document is based on one of the most recent associative classifier: the Naïve Associative Classifier (NAC) [26].
Recently created, the NAC exhibits characteristics that represent advantages over other state-of-the-art classifiers.
It is simple, transparent, transportable and precise, and is capable of handling numerical, categorical and mixed data, in addition to missing values. It has been applied in financial datasets, including credit allocation, marketing, bankruptcy and banknote authentication [26].
Despite of the above, the NAC has a notable disadvantage: under certain conditions, its results exhibit ambiguity in the classes of the objects to be classified. Therefore, in order to face the mentioned disadvantage, we present through this document the Naïve Associative Classifier with Epsilon Disambiguation (NACε), which includes a successful process of disambiguation of classes, thus improving the performance of the Naïve Associative Classifier (NAC). This proposal has important repercussions on the performance of associative approach classifiers, in different datasets belonging to the financial environment.
The paper is structured as follows: Section II reviews some of the most relevant related works. Section III details the proposed model, section IV explains the experimental design, while section V discusses the obtained results. The paper ends with conclusions at section VI.

II. RELATED WORKS
Through IC algorithms, financial risk patterns can be analyzed in the context of four basic tasks: classification, regression, recovery and clustering [27]- [32]. This document emphasizes the task of intelligent classification of patterns in financial data.
IC algorithms, mainly intelligent classifiers of patterns [33] are theoretically supported by several conceptual bases: Bayesian classifiers, based mainly on the Bayes Theorem, such as the proposals of Berrar [34] and Ala'raj and Abbod [35]; classifiers based on the nearest k neighbors (k-NN), and one of whose emblematic algorithms, 1-NN, proposed by Cover and Hart [31], was used in the experimental part of this document; decision trees, such as C4.5, widely used for their simplicity and effectiveness; additionally, models that imitate nature, implemented to classify patterns or optimize pattern-classifying algorithms through metaheuristics, among which neuronal classifiers stand out; there are also algorithms that are based on logical functions or on the optimization of analytic functions, such as vector support machines by Chen at al. [36]. In 1968, one of the first attempts to apply IC in financial issues was the development of the Z-Score, proposed by Altman [37].
A recurring topic in financial risk management is bankruptcy, because investors (financial institutions, firms or even common people) need to minimize the possibility of not receiving the expected dividends or even not recovering their capital. Estimation of a firm's bankruptcy may be addressed as a problem of pattern classification, as described by Chen et al. [1], where a bankruptcy prediction model using diffuse k-NN is proposed.
Credit risk has been the subject of several scientific publications. A computational model to improve corporate credit rating systems oriented to help financial experts in decision making is presented by Petropoulos et al. [2], where two important aspects of financial data are considered: the distribution of heavy tails and their time series nature, for which they propose a system based on Markov models; a new scheme for the modeling of financial time series is also proposed.
Another approach, based on social media, and proposed by Yang and Zhou, is shown in [3], which analyzes whether the different opinions shared by users about different financial topics are relevant or not for the prediction of credit risk. Opinions were analyzed from two social networks for financial investors in China, in addition to various articles published by financial analysts. The authors showed that opinions published online can strongly predict the credit risk of companies, while the analysis of the opinions of financial analysts obtained lower performance.
Regarding the classification of clients, a model to help in the prediction of desirable clients (those who will fulfill their payment obligations, labeled as good) and undesirable clients (those who are likely not to fulfill their payment obligations because they are bankrupted, labeled as bad) is proposed by Gahlaut et al. [5]. Data mining techniques are used to select the best attributes that can support the credit rating, in order to identify good clients.
Credit granting is the core business of retail banking and other financial institutions, and therefore they should take advantage of any resource that gives them competitive advantages and focus on creating long-term relationships with their customers instead of just promoting products individually. Ladyzynski et al. [38] proposed a time-series-based approach to predict the willingness of individuals to hire personal credit. For this purpose, historical information of banking transactions of clients is used to detect significant patterns.
Referring to marketing, data mining techniques are proposed by Moro et al. [6] to predict the outcome of a telephone call in which a banking service is offered. We used a dataset that contains information from a telemarketing campaign carried out by a Portuguese bank from 2008 to 2013. Another similar proposal is presented in [7] by the same authors, but applying artificial neural networks.
A conceptual foundation of IC that is relevant for this document is that in which associative models are based. The first associative model registered in the annals of literature is the Lernmatrix, created in 1961 by Karl Steinbuch; since then, several associative models with applications in various fields has been generated, as the one by Yáñez-Márquez et al. [39]. In the Computational Research Center of the National Polytechnic Institute (México), the Alpha-Beta branch of associative models was created in 2002 [39]; one of the most recent models, the NAC classifier by Villuendas-Rey et al. [26], is the foundation for the proposed model.
Despite the numerous techniques for credit risk analysis, there is still a need for transportable and transparent models, able to deal with both mixed and incomplete datasets [26]. In addition, the NAC model (which fulfills the above requirements) has some difficulties in classification where the overall class similarities are too close. To overcome these limitations, we introduce the Naïve Associative Classifier with Epsilon disambiguation (NACε).

III. PROPOSED MODEL
The Naïve Associative Classifier with Epsilon disambiguation (NACε) is based on the Naïve Associative Classifier (NAC) [26]. This proposal aims to solve NAC's shortcomings by including a class disambiguation process in zones where the Bayes risk is high.

A. TRAINING PHASE
The proposed model allows to handle mixed and incomplete data, and during its training phase it is assumed that: • There is a training set T , whose elements are instances described by a set of attributes A = {A 1 , . . . , A m }.
• Each instance in T belongs to one and only one class in the set of classes K = K 1 , . . . , K p .
• Set A must be associated to a set of weights w = {w 1 , . . . , w m }.
• No attribute has only missing values and no instance has its entire attribute values missing. The training phase of the NACε is done through two steps: 1) Storage of the training set T.
2) Computation and storage of the standard deviation σ ij for each numeric attribute A i in each class K j ( Fig. 1)

B. CLASSIFICATION PHASE
Let o be an instance with unknown class. The classification of such instance (Fig. 2) takes two steps.
Step 1 -Obtaining the average similarities First, the average similarity of o against each available class is computed: Let T j be the set of instances belonging to the class K j . The average similarity between an instance o and K j is given by: where s t is a similarity function between two patterns o and y.
Let w i be the weight of attribute A i and let m be the total number of attributes. The total similarity function is given by: MMIDSO (Modified Mixed and Incomplete Data Similarity Operator) is an extension of NAC's operator (MIDSO), MMIDSO separates categorical and numerical attributes: It should be noted that standard deviation σ ij of a numerical attribute A i is computed without considering instances containing a missing value in attribute i.
Step 2 -Disambiguation Once s j (o) has been obtained for each j, the second step consists in defining in which class does the test instance o suits better. To do so, the average similarities must be sorted in a descending manner in order to select the first and the second one.
Let st 1 and st 2 be the first and the second sorted similarities and let ε ∈ [0,1] be a user-defined disambiguation value. On determining the existence of a significant difference between st 1 and st 2 , the ratio between them will be considered. If such ratio is higher than ε, then no ambiguity in class assignment is considered. Otherwise, ambiguity existence is considered. Formally, the disambiguation function, denoted by des and with parameters st 1 , st 2 and ε, is defined as follows: If there is no ambiguity, the class of greatest similarity is returned. Otherwise, the ambiguity is eliminated through one out of two variants proposed here. Variant 1. Based on the total similarity, determine which of all the instances of the training set is the most similar to o and return its class.
Variant 2. Apply a user-defined dissimilarity function and determine which of all objects in the training set is closest to o and return its class.

C. COMPLETE PSEUDOCODE OF THE ALGORITHM
To better understand the proposed algorithm and its variants, we will use their pseudocode (Fig. 3 and Fig. 4).

D. COMPLEXITY OF THE ALGORITHM
The efficiency of NACε can be measured in terms of space complexity or time complexity. Space complexity is measured from the storage of the training set as well as the storage of the standard deviation of each class in each attribute. Hence, the space complexity is bounded by where n is the number of instances, In turn, classification phase comprises two steps: obtaining average similarities and disambiguation. The complexity of classification is measured from these steps. The step in which the average similarities are obtained compares the similarity of the test instance with the instances of the training set, and its complexity is given by O (n * s), where s is the complexity of computing the similarity between attributes. However, the computational complexity of the similarity is constant and defined as O (1) because it is a simple operation, and therefore the complexity of obtaining the average similarities is bounded by O (n).
The second step, disambiguation, includes finding the two higher similarities, bounded by O (2n), the computation of the disambiguation function, bounded by O (n) and, eventually, the application of some disambiguation variant, also bounded by O (n), that is, the total time complexity of classification phase is bounded by O (2n).

IV. EXPERIMENTAL DESIGN
Along this section, the validation processes are described, and the obtained results are discussed.

A. DATASETS
Several datasets containing financial data were used; almost all of them are available in the machine learning repository of the University of California at Irvine. The information contained in the datasets is related to three phases of the credit process: granting, promotion (specifically fund raising) and recovery.
Accordingly, Australian-credit, Japanese-credit, Defaultof-credit-card, German-credit, Iranian-credit [40] include information that is useful when authorizing or rejecting the granting of credit. The first two are variations of Creditapproval, which is a dataset consisting of information collected to evaluate credit card applications.
The Bank datasets, donated by Moro [6], contain information related to a telephone campaign from a Portuguese bank seeking to fundraise financial resources via long-term deposits; there are two classes that describe if the product offered was hired or not.
Even that recovery stage is strongly related to payment collection, it includes prospecting of clients through analyzing the financial health of every credit applicant or current borrowers. Given the importance of the foregoing, loss prediction has been studied from various approaches, such as financial ratio analysis. This approach is exemplified through Polish-companies-bankruptcy-data.Another default forecasting approach considers a qualitative point of view and is based on the comparison of categorical parameters assigned by ''experts'' to hard-to-quantify variables. The Qualitativebankruptcy dataset, containing 250 instances described by six attributes, is an example of it.
Finally, we explored the Banknote-authentication dataset. This is not a financial risk problem, however, the characteristics of this dataset are suitable for testing the model proposed in this document. Table 1 describes the characteristics of each dataset and highlights the imbalance ratio in most of them; an imbalance ratio greater than 9 means that most algorithms will not perform well when classifying patterns belonging to the minority class.

B. PERFORMANCE MEASURES
In order to deal with the problem of imbalanced classes, frequently observed in datasets related to the financial environment, we use the area under the ROC curve (AUC) to evaluate the performance of NACε, since AUC is independent of the class distribution of the dataset [41], [42]. AUC is directly based on the confusion matrix, which is described in Figure 5. In addition, AUC is based on averaging sensitivity (TPR) and specificity (TNR).
Confusion matrix describes the possible cases when classifying a problem with two classes. True Positives (TP) are instances labeled by the classifier as positive when they are positive indeed. False Positives (FP) occur when the classifier labels an instance as negative, but it is, in fact, positive.
A False Negative (FN) is a negative instance classified as positive and True Negatives (TN) happen when an instance is classified as negative and it is negative indeed. It has been proved [42] that AUC can be calculated as the average of the True Positive Rate (TPR) and the True Negative Rate (TNR). Hence, the AUC is given by: Regarding data partition, the stratified cross validation 5×2 (5 × 2 scv) was chosen because it has been recommended to deal with imbalanced data [43], [44]. When using 5 × 2 scv, the training set is stratified and divided into two subsets of the same size: one for training and the other one for testing. This process is repeated five times, resulting in ten partitions (five for training and five for testing). Finally, to measure the performance of the classifiers, the average of the performance in each of the five partitions is considered.

C. STATISTICAL TESTS
To compare the results of classification algorithms on several data sets, the use of nonparametric statistical Friedman test for more than two related samples is considered appropriate [45]- [48]. If the number of samples used in the experiments is small, statistical tests lack of power to determine the existence of statistically significant differences between the performance of the algorithms. The required number of samples to obtain satisfactory results is given by the expression samples ≥ 2 * algorithms to compare [48].
In this research only 12 data sets were used, which means that at most six algorithms can be compared through the Friedman test, without losing statistical power. The application of Friedman test implies the creation of a block for each one of the analyzed subjects in such a way that each block contains an observation coming from the application of each one of the different contrasts or treatments. In terms of matrices, the blocks correspond to rows and treatments to columns. The null hypothesis establishes that the performances obtained by different treatments are equivalent, while the alternative hypothesis proposes that there is a difference between these performances, which would imply differences in the central tendency.
If the Friedman test determines the existence of significant differences between the performance of the algorithms, a post hoc test is highly recommended to find in which of the compared algorithms those differences exist. Among several post hoc tests suggested [48], we chose the Holm Test [49]. The Holm method is designed to reduce the Type I 1 errors when analyzing phenomena that include several hypotheses and consists of adjusting the rejection criteria for each one of them.

V. RESULTS AND DISCUSSION
The results obtained by applying the proposed model to the different datasets are shown below. First, the performance of the algorithm is illustrated from different values of ε for each of the two variants of the model. In the second variant, the dissimilarity function HEOM, proposed by Wilson and Martinez [50], was used. Subsequently, the results obtained are compared with NAC [26] and Nearest Neighbor (NN) [31] classifiers, because the proposed model constitutes a generalization of them, as well as with respect to other classifiers. Finally, the results obtained are compared with other state-of-the-art classifiers. In each case, the corresponding statistical tests were applied.

A. ANALISYS OF THE IMPACT OF ε VALUE
We tested the performance of first variant of the proposed model, in terms of several values of the disambiguation parameter. AUC was considered as a performance measure. After applying Friedman Test over the results obtained from the compared data sets with different ε values, the resulting p-value was p = 0.998057, which means that the null hypothesis is not rejected and therefore, in terms of the performance of the first variant, there are no significative differences when choosing the ε parameter.
The performance of second variant of the proposed model was also tested, in terms of different values of ε. AUC was again considered as a performance measure. Consequently, the Friedman Test was applied to establish the existence, if any, of significant differences.
After applying the Friedman Test with the results obtained from different εvalues in the compared data sets. The resulting p-value was p = 0.05484, which is very close to the defined significance level (α = 0.05); because of that, even that the null hypothesis is not rejected, the results cannot be considered conclusive. This is why we apply the Holm test, 1 A Type I error occurs when the null hypothesis is rejected although it is true.  comparing the best ranked algorithm (ε = 0.2) with the rest. Table 2 shows these results. Holm Test rejects hypotheses which non adjusted Holm probability value ≤ 0.007143.
Thus, the mentioned test rejects the null hypotheses of means equality (H 0 ) when comparing the second variant of the proposed model with ε = 0.2 against the algorithm with ε = 0.8., and the algorithm with ε = 0.9, respectively. The conducted experiments suggest that it is preferable to use ε values less than 0.8 in the second variant, however it may be convenient to perform more experiments to establish if there are significant differences between the first and last values in the obtained ranking.

B. COMPARISON OF THE PROPOSED MODEL WITH OTHER CLASSIFIERS
On the other hand, this research analyzes the performance of the proposed model with respect to the NAC and Nearest Neighbor (NN) models. We also compared the proposed model with respect to other well-known classifiers, such as Naïve Bayes (NB), Multilayer Perceptron (MLP), and Support Vector Machines (SVM). In addition, we include a Deep Neural Network (DL) architecture, as a reviewer suggestion. These results are shown in Table 3 . In each case, the best disambiguation values for each data set were chosen for the proposed model. The classifiers of the state of art were executed using the Weka software [51]. We use the default parameters for all of them.
Once again, the Friedman Test was applied to find if there were significant differences in the performance of the VOLUME 8, 2020   Table 4, which also exhibits the average rank obtained.
The resulting p-value was p = 0.011548, which is lower than the defined significance level (α = 0.05), which means that the means equality null hypothesis is rejected.
Thus, the Holm Test was applied to compare the best ranked algorithm (NACε) against the rest. It can be observed in Table 5 that hypotheses with unadjusted Holm probability value ≤ 0.0125 are rejected. Therefore, null hypotheses of mean equality when comparing the proposed model against Nearest Neighbor, Support Vector Machines and Naïve Bayes classifiers, are rejected.
According to the AUC, the conducted experiments show that the proposed model has a better performance than NAC. It outperformed the NAC classifier in four datasets, and obtained a similar result in the remaining eight datasets. However, the Holm test did not found significant differences.
The model does not show significant differences in terms of performance using the AUC measure with respect to the Multilayer Perceptron and Deep Learning. However, the proposed model has a lower computational cost than MLP and DL, and maintain the desired characteristics of NAC of being transparent and transportable, extremely valuable for financial environments.

C. CONTRIBUTIONS TO THE CREDIT PROCESS
Throughout the second decade of the 21st century, the process of analyzing credit applications has evolved. The traditional way to authorize or reject the granting of a loan has been influenced by the irruption of technology and by the plethora of data, causing that expert judgment, that which credit analysts acquired based on repeating the same task hundreds or perhaps thousands of times, fall into obsolescence to make way for automated credit scoring systems. Ideally, based on the information of each applicant and several parameters, an automated credit scoring system can decide whether the granting of a loan implies a risk level greater than that one the lender is willing to take, and consequently, suggest a rejection.
A system with these characteristics oversees, implicitly, compliance with the regulatory policies for financial risk management, but also optimizes the evaluation stage by allowing the simultaneous management of several applications and increasing business opportunities. However, there is still a lot to do.
Frequently, the information used to train the scoring models lacks precision and certainty, since it is part of datasets whose content is considered sensitive because they describe the equity of firms (or the wealth of persons) and their consuming patterns. Also, the collection of information that applicants provide when asking for a loan remains nowadays as a task exposed to human error, which can arise for many reasons: regular tiredness of persons in charge of data capturing, inadequate design of a questionnaire that may contain questions whose subjective response can be misunderstood or misinterpreted by any analyst. These imperfections can be solved through the application of the NACε, because it has been designed to handle incomplete information and therefore reduce the chances in which analysts can misinterpret or decide in a wrong way.

VI. CONCLUSION
Despite of its simplicity, the proposed model shows ability to solve problems of financial data classification and obtains competitive results. The computational cost is low and the interpretability and transportability requirements that are needed in financial scenarios are fully covered, since the model has the capacity to directly handle mixed or incomplete data.
The proposed model successfully disambiguated classes, which makes easier the handling of data sets with a high degree of imbalance, missing values or heterogeneity in terms of the features that describe them. In addition, the statistical tests carried out support the claim of the proposed model being more accurate than Nearest Neighbor, Support Vector Machines and Naïve Bayes classifiers, for financial data.
Although it is mandatory the suggestion to apply NACε in classifying tasks related to disciplines unrelated to financial risk management, the extent of the latter offers, on its own, a lot of research alternatives.
Financial risk is inherent to economic activity. Unlike other disciplines, economic science is directly influenced by human conduct. This fact adds a chaotic factor to the studied phenomena and complicates classification and regression tasks. The random behavior of economic and financial variables is highly unpredictable, which makes imperative to always consider the factors that modify trends and historical behavior. Incorporating those factors and their dynamics to the intelligent computing algorithms poses great challenges and new phenomena to study.
As future work, we want to explore the impact of feature weighting in the proposed algorithm.