Addressing the Overlapping Data Problem in Classification Using the One-vs-One Decomposition Strategy

Learning good-performing classifiers from data with easily separable classes is not usually a difficult task for most of the algorithms. However, problems affecting classifier performance may arise when samples from different classes share similar characteristics or are overlapped, since the boundaries of each class may not be clearly defined. In order to address this problem, the majority of existing works in the literature propose to either adapt well-known algorithms to reduce the negative impact of overlapping or modify the original data by introducing/removing features which decrease the overlapping region. However, these approaches may present some drawbacks: the changes in specific algorithms may not be useful for other methods and modifying the original data can produce variable results depending on data characteristics and the technique used later. An unexplored and interesting research line to deal with the overlapping phenomenon consists of decomposing the problem into several binary subproblems to reduce its complexity, diminishing the negative effects of overlapping. Based on this novel idea in the field of overlapping data, this paper proposes the usage of the One-vs-One (OVO) strategy to alleviate the presence of overlapping, without modifying existing algorithms or data conformations as suggested by previous works. To test the suitability of the OVO approach with overlapping data, and due to the lack of proposals in the specialized literature, this research also introduces a novel scheme to artificially induce overlapping in real-world datasets, which enables us to simulate different types and levels of overlapping among the classes. The results obtained show that the methods using the OVO achieve better performances when considering data with overlapped classes than those dealing with all classes at the same time.


I. INTRODUCTION
In a classification problem a series of input attributes must be linked to a discrete output class [18], [44]. This relationship is established by learning classifiers, which are models built from a set of labeled samples of the problem. It is usually not a problem to obtain good-performing classifiers when classes are easily separable. However, in real-world data samples from different classes may share similar attribute values [33]. In these cases, the boundaries of the classes may not be clearly defined, being too complex to be correctly learned. This problem is commonly referred as overlapping data [16], [40]. These samples cause uncertainty when determining the decision boundaries and thus, negatively affect classification performance [16].
The existing proposals in the specialized literature to overcome this problem are based on two main strategies: 1) Adaptation of classification algorithms. Some works adapt well-known methods to mitigate the effects produced by overlapping in classification. Thus, for example, Fu et al. [11] and Czarnecki and Tabor [7] propose to adapt Support Vector Machines (SVM) [3] to deal with overlapping data, whereas Xiong et al. [40] focus on modifications of the Naïve Bayes [21] algorithm. 2) Data preprocessing. These works alter the original data aiming to reduce the impact of overlapping in classifier performance [23]. The original data is modified either separating the overlapping classes by the introduction of complementary features or merging overlapping VOLUME 4, 2016 classes to form meta-classes. Even though both approaches can improve classifier performance in specific scenarios, they present some drawbacks. The former is based on modifying an existing method, which may be sometimes hard to perform. Moreover, since the improvement comes from the adaptation of the method, it is not directly applicable to other algorithms [11]. Otherwise, the latter involves the usage of preprocessing techniques, which are time-consuming and are usually designed to deal with data having particular characteristics [23]. Hence, to avoid these shortcomings, other approaches to reduce the impact of overlapping need to be studied, which neither involve algorithm modifications nor making assumptions about data characteristics.
When working with multi-class problems, the usage of binary decomposition strategies [24] has not been yet considered as an alternative to deal with overlapping data. However, it may be an interesting alternative to the aforementioned approaches. Decomposition strategies divide the original problem into several two-class subproblems as a way to reduce the original complexity [17], [46]. Among these strategies, the One-vs-One (OVO) decomposition [45], [46], which divides the original problem into as many subproblems as possible pairs of classes, is one of the most widely used schemes in the literature [13], [28]. This research analyzes the suitability of OVO to deal with overlapping data. Since only two classes are considered in each subproblem, OVO will be able to increase the separability between them, reducing the impacts of overlapping and thus, improving the final classification performance.
However, in order to properly evaluate the benefit of using OVO to deal with overlapping we come to a bigger problem, the lack of evaluation frameworks for overlapping data problem. For this reason, in this paper we also provide a new and systematic way to introduce overlapping into realworld datasets so that methods dealing with overlapping can be properly evaluated. This new framework is introduced in Section IV and will allow as to fairly evaluate the difference between the usage of OVO and not using it. With this framework one can control the amount of overlapping in the data and one can exactly determine which samples in the data belong to the borderline, the overlapping and non-overlapping regions. This rigorous identification of the different types of samples in the dataset implies a completely novel way of understanding and evaluating the overlapping data problem in the literature. This way, we will be able to perform a thorough analysis and extract conclusions on classifier performance in each region (see Sections VI-VII).
The suitability of the OVO decomposition with overlapping data is analyzed in an extensive empirical study considering well-known learning algorithms, such as C4.5 [27], Repeated Incremental Pruning to Produce Error Reduction (RIPPER) [6], k-Nearest Neighbors (k-NN) [5] and SVM [3]. We will analyze the differences between applying standard and OVO-based classifiers over a total of 1394 datasets with different degrees of overlapping. The different regions will be considered to analyze the effect of overlapping in the classifiers' performance. The robustness of the methods in terms of its Equalized Loss of Accuracy (ELA) metric [30] will be also studied. In total, more than 2,091,000 results will be analyzed and will serve as a solid basis to establish a comparison between the OVO and non-OVO versions of the classifiers. The main lessons learned in this research, including interesting findings related to the experimentation performed and its analysis are summarized in Section VIII. A web-page with the datasets and the results obtained for each classification algorithm is available at https://joseasaezm. github.io/overlapping/.
The rest of this work is organized as follows. Section II introduces decomposition strategies and the OVO model as a possible solution for overlapping data. Section III presents related works on overlapping data in classification. Section IV describes the proposed scheme for introducing overlapping in real-world data. Then, Section V presents the experimental framework. Section VI analyzes the results obtained when overlapping data affects training and test sets, whereas Section VII focuses on results with overlapping only in training sets. Section VIII summarizes the main findings of our empirical study. Finally, Section IX presents the concluding remarks.

II. BINARY DECOMPOSITION STRATEGIES FOR DATASETS WITH MULTIPLE CLASSES
Multi-class data [1], [42] are frequent in real-world tasks, being a generalization of data with only two classes (binary problems). Multi-class classification data have been traditionally addressed following two different approaches [24]: 1) Algorithm level approaches. They adapt methods that learn from binary data to deal with more classes [12]. 2) Data decomposition approaches. They decompose multi-class problems into binary subproblems, reducing the complexity of the original problem [13].
Modifying existing methods to deal with multi-class data may be a complex task in some cases, e.g. when working with SVM [3]. Data decomposition can be used in such scenarios, since any binary classification algorithm can be employed as base learner without adapting its learning procedure. In this section, we first introduce decomposition strategies and its advantages [13] (Section II-A) and then focus on the OVO decomposition (Section II-B).

A. DECOMPOSITION OF MULTI-CLASS PROBLEMS
Using binary decomposition in multi-class problems usually carries certain benefits [13], [28]. First, they enable algorithms designed to deal with binary data to address multiclass problems [24]. Another advantage, which this research takes advantage of when dealing with overlapping data, is that the separation of the different classes becomes less complex using decomposition [12]. Thus, decomposition allows classes in certain classification problems to be more easily separable when considered in pairs [2], [17], [45].
On the other hand, decomposition strategies lead to the formation of ensembles of classifiers, which are considered as one of the most powerful techniques in contemporary machine learning [38].
Binary decomposition is based on two main phases [13]: 1) Problem division [24]. The data are split into binary subproblems that are then treated by binary classifiers [12]. Two main decomposition strategies exist [24]: • One-vs-One (OVO) [45], [46] splits a problem with C classes into C(C − 1)/2 subproblems, training a different classifier for each pair of classes. • One-vs-All (OVA) [17], [31] splits a problem with C classes into C subproblems, training a different classifier to distinguish each class from the others. 2) Output combination [13]. To classify new samples, they are presented to all the classifiers and their outputs are combined to obtain the final result. Among the combination methods found in the literature, Weighted Voting [20], probability estimates [39] and majority voting [13] should be highlighted. This research focuses on OVO due to its proven advantages with respect to OVA [28], such as the creation of simpler borders between classes, the increase in classification performance and the shorter training times when working with smaller subproblems. Finally, OVA may also create imbalanced datasets, which is known to be a major problem in machine learning [13].

B. ONE-VS-ONE BINARY DECOMPOSITION
OVO splits a dataset with C classes into C(C − 1)/2 binary problems. Each binary problem consists of those training samples involving the pair of classes (c i , c j ) with i < j. Then, a classifier is built for each one of these binary problems.
New samples are classified by being submitted to all classifiers. A classifier, distinguishing between c i and c j , computes a confidence r ij ∈ [0, 1] in favor of c i (r ji is computed as 1 − r ij ). These confidences are stored in a score matrix: Finally, combination methods [13] are employed to compute the class label of new samples from the score matrix. Among them, majority voting is used in this work. It is one of the most used and simplest approaches, based on predicting the class with the largest number of votes by the classifiers. This approach has shown to provide a similar behavior to more complex strategies [13]. Using the majority voting scheme, the final class label can be computed as: where s ij = 1 if r ij > r ji and s ij = 0 otherwise.

III. CLASS OVERLAPPING AFFECTING CLASSIFICATION PROBLEMS: RELATED WORKS
Real-world data usually involve overlapping among samples of different classes [4], [40]. This fact implies that some samples of a class c i have similar characteristics to those of a different class c j . The area of the domain in which these specific samples are found is called the overlapping region [22]. All the samples belonging to this region are characterized by having non-zero probability densities for each class. Some works have shown that many classification errors occur in the boundaries of the classes, which may be altered by the presence of overlapping samples [33], [40]. This fact may increase the chances of incorrect predictions [16]. Given the loss of accuracy to which these types of samples can lead, methods that can alleviate class overlapping are of special interest [11], [40].
As it was mentioned, some proposals adapt classification algorithms [7], [11], [40] or modify the original data including additional features [23] to mitigate the impact of overlapping. Other works propose the usage of soft decision strategies assigning multiple class labels to the samples of the overlapping region, which can be then analyzed [32], [34].
A large part of the literature studying the overlapping problem also focuses on imbalance data [19], [35]. Although learning difficulties in class imbalance have been traditionally related to bias towards the majority class, some works show that these are more linked to other factors related to data characteristics such as overlapping [4], [33]. For example, Prati et al. [25] developed a study using a set of artificial datasets showing that the degree of class overlapping has a strong correlation with class imbalance. In this scenario, the use of over-sampling methods based on SMOTE [10], [41] has shown to be very effective [4]. The large influence of overlapping on classification performance with respect to the imbalance ratio was also corroborated in the particular case in which the minority class is more represented in the overlapping region than the majority class [14], [15]. Other proposals to deal with overlapping in imbalanced data include removing the samples belonging to the overlapping region [43] or adapting classification methods [26].
An important aspect of the aforementioned works is the way in which they estimate or control the level of overlapping in real-world datasets. Most of the studies do not take this issue into account, which limits their insight into the nature of the problem. Controlling the level of overlapping in datasets enables the possibility of thoroughly analyzing the properties and robustness of the examined methods. Because of this, some works try to quantify the overlapping level of each real-world dataset considered. For example, some of them compute basic statistics for each attribute [11]. Other works consider more complex metrics such as the Fisher's discriminant ratio or the Kullback-Leibler divergence between classes [36].
Many works complement their experiments creating synthetic datasets [4], [15], [36]. These types of data have VOLUME 4, 2016 the advantage of making the level of overlapping easier to control. However, most of the datasets generated in these works have two dimensions and two classes. The basic idea is to create two clusters of samples, one per class, which are initially separated when no overlapping level is considered. Then, an increase of the overlapping level implies that the distance between the cluster centroids is reduced, making them overlap more and more. Clusters with rectangular [14], [15] or circle-like shapes [4], [25] are the most common options in the literature.
The use of each type of data, either real-world or synthetic, offers different advantages: • Data variety and complexity in real-world datasets.
Considering real-world data provides a great variety of choice, since each dataset is different and has different properties, which usually imply a greater complexity and richness of characteristics. This cannot be generally achieved by the synthetic data generators proposed in the literature. • Overlapping level in synthetic datasets. Synthetic data allow to control the overlapping level and extract conclusions based on it. Quantifying the level of overlapping is not always easy in real-world data, which only have a specific quantity of overlapping samples, and the effects in classifier performance of varying levels of overlapping cannot be measured.
For these reasons, a systematic way to combine the advantages of both alternatives is required. It would be interesting to have the possibility of introducing, in a supervised manner, different degrees of overlapping in real-world datasets. This fact leads us to our proposal in Section IV, a new scheme for introducing overlapping in real-world data.

IV. A NOVEL SCHEME FOR CREATING OVERLAPPING REGIONS IN REAL-WORLD CLASSIFICATION DATASETS
This section presents a new scheme designed to introduce overlapping in any real-world problem. Section IV-A details the process to create a set of synthetic overlapping samples S for a specific class in the original dataset. Then, Section IV-B describes how the overlapping dataset is built considering S and the original data, giving a mathematical description on how the sets of samples belonging to the overlapping and non-overlapping regions are composed. Finally, Section IV-C presents two possible schemes to introduce overlapping in real-world data depending on which classes are affected.

A. GENERATING THE OVERLAPPING REGION
The overlapping introduction scheme generates an overlapping level x% affecting one of the classes c of the dataset D. This fact implies that samples from other classes (different from c) invade the domain corresponding to class c, starting from the class boundary of c to its core.
Algorithm 1 shows the pseudocode of the procedure to create the set of synthetic samples S from the original data is D. Notice that the overlapping consists of adding new samples and is not a mere modification of existing samples. This is made this way to avoid altering the underlying class structures of the original data.
The creation of the synthetic samples forming the overlapping region is based on two main steps. They are described afterwards, referring to the associated lines in Algorithm 1: 1. Estimation of the distance of each sample of the target class to the borderline region (lines 1-7). This first step identifies those samples of the target class that are closer to the class boundaries. The closer to the class boundaries are, the more probability to form the overlapping region have (created in the second step).
With this aim, for each sample e i belonging to class c, the average distance d i to its k 1 closest samples of the other classes is computed (line 4). Likewise, the majority class m i of these neighbors is computed (line 5). In this work, we fix a value of k 1 = 1 to compute the distance of each sample of class c to the class boundaries -note that higher values of k 1 could be chosen to reduce the negative effect of noisy samples in the dataset. HVDM [37] is used to compute the distance between samples.
The distances and classes computed in this step are used in the next step. Our assumption is that the lower the distance d i associated to each sample e i is, the closer this sample to the borderline region is. As a result of this step, all the possible triplets {(e i , d i , m i )} are computed (line 6).
2. Generation of the synthetic samples to form the overlapping region (lines [8][9][10][11][12][13][14]. The number of synthetic samples to be introduced (M ) is based on the quantity of samples of the target class c, being its x% (line 8). Then, M samples of class c are sequentially chosen, sorted in a ascending manner by d i (lines 9-10). For each one of these samples, a synthetic sample s i is created in its neighborhood (lines [11][12]. To this end, a random neighbor n i among any of its k 2 -nearest neighbors is chosen and the sample s i is created following an interpolation scheme similar to that used by SMOTE [10], [41], in which e i is the sample of the target class c, n i is the selected neighbor and r is a random number in (0, 1) following an uniform distribution ( Figure 1). For nominal attributes, a random value between those of e i and n i is chosen.
The value of k 2 (k 2 = 3 in our experiments) used to compute the nearest neighbors determines the size of the area around the sample e i in which the synthetic sample s i will be created. The value k 2 = 3 is used trying to introduce some randomness when creating the new synthetic sample. Considering higher values for k 2 , such as k 2 = 5 or k 2 = 7, may imply the risk of creating the synthetic samples too far from the area of interest in the decision boundaries in which we want to introduce the overlapping data.   introduction scheme proposed enables one to easily estimate which samples from O belong to the overlapping region and therefore to distinguish between overlapping and nonoverlapping regions in the new dataset: 1) Overlapping region O ov . The overlapping region is defined as the union of the synthetic samples S and the set B e composed of the original samples e i used to create the synthetic samples (line 9 in Algorithm 1). Figure 2 illustrates the result of applying the proposal to introduce overlapping in the banana dataset [9]. Several levels of overlapping have been introduced into one of its classes (the red one), from 0% (original data) to 100% (maximum overlapping), by increments of 25%. Note that the overlapping levels selected in this example play an illustrative role on the procedure of the proposal, although in real-world data the overlapping levels are not usually so high.

C. OVERLAPPING INTRODUCTION SCHEMES
The proposed scheme can be applied to any class and also to any combination of classes in the data. This fact lead us to define different schemes to introduce overlapping, depending on the classes in which the scheme is applied. In this research two different overlapping schemes are considered: 1) All-classes Overlapping Scheme (AOS). This scheme individually considers each class of a dataset with C classes as the target class, resulting in C different sets of overlapping samples S 1 , . . . , S C , which are finally merged with the original data D to build the overlap- . This scheme tries to simulate the most complex scenario in real- VOLUME 4, 2016 world problems, in which all the classes are overlapped with their surrounding classes -for example, as a result of a faulty sensor device that affects some attributes in all the samples in the dataset. 2) Majority-class Overlapping Scheme (MOS). In this scheme, overlapping is only introduced into the majority class, which is considered as the target class. The overlapping procedure results in a single set of synthetic samples S maj , which is then added to the original dataset D to form the overlapping dataset O (O = D∪S maj ). This scheme tries to model a situation in which the dataset has a difficult class, whose boundaries are not clearly defined. Therefore, the overlapping only affects to this class and its surrounding classes. The procedure detailed in this section is used to generate new datasets with different levels and types of overlapping. All of them are then used to check the behavior of OVO when dealing with this type of data. However, this overlapping generation scheme can be used in future works to analyze any method trying to address the class overlapping problem. The design of the experimental framework and how the results obtained will be analyzed is described in Section V.

V. EXPERIMENTAL FRAMEWORK
First, Section V-A presents the base datasets considered for the experimentation, together with the types of overlapping and levels introduced into them. Then, Section V-B shows the algorithms used and their parameters. Finally, Section V-C explains the methodology for the analysis of the results.

A. REAL-WORLD DATASETS AND OVERLAPPING
The experimentation considers 34 real-world datasets from the UCI repository 1 [9], in which overlapping is introduced. Table 1 shows these datasets sorted by their number of classes (cl), along with the number of samples (sa) and attributes (at). Motivated by the use of a stratified k-fold cross validation to estimate the classifier performance in the analysis of 1 http://archive.ics.uci.edu/ml/ results, the creation of the k folds of the overlapping data O from the original data D is systematically carried out as follows: 1) A level of overlapping x% is used to create the set of synthetic samples S from the original dataset D following the scheme proposed in Section IV (either AOS or MOS). Overlapping levels from x = 5% to x = 50%, by increments of 5% are considered.
2) The original data D and the set of synthetic samples S are partitioned with stratification into k folds each one, that is D 1 , . . . , D k and S 1 , . . . , S k , respectively. 3) The k folds of the overlapping datasets are created as Note that the folds D 1 , . . . , D k in the original data D and those O 1 , . . . , O k in the overlapping dataset O have the same original samples in each fold i = 1, . . . , k for any overlapping type and level, being the synthetic samples of each fold the difference among them. In this way, a fairer comparison is established between different levels and types of overlapping over the same dataset, since possible differences in classifier performance due to a different partitioning in each level/type of overlapping are avoided as much as possible.
Once the k folds of the original dataset D and the overlapping dataset O have been obtained, two different ways of building the final dataset are considered depending on the folds affected by the overlapping (training or test partitions): 1) Overlapping affecting training and test sets (Section VI). The final overlapping dataset is formed by the folds O i (i = 1, . . . , k). This situation represents the most realistic one in real-world data, in which both training and test sets may be affected by overlapping. This scenario allows one to check how classifiers behave in the different regions. Thus, in order to gain a deeper insight into the problem addressed, the analysis of these datasets is conducted in a three-step way: (i) by analyzing performance on all the samples (Section VI-A), (ii) on the samples from the non-overlapping regions (Section VI-B), and (iii) on the samples from the overlapping regions (Section VI-C). In this way we can check where is the contribution of OVO in terms of performance improvement. 2) Overlapping affecting only the training sets (Section VII). In this case, each test fold t is taken from the original data D t (t ∈ {1, . . . , k}), whereas the training set is composed of the remaining folds from the overlapping dataset. Introducing overlapping only into the training partitions while keeping the test partitions overlapping free allows one to observe how overlapping data affect the training process and how the test results are degraded depending on the type and level of overlapping (see Section VII-A). This scheme has also been used to deal with noisy data in classification [29]. As an outline, a total of 40 different configurations are applied to the 34 base datasets, resulting in a total of 1360 datasets with different types and levels of overlapping (1394 datasets if datasets without induced overlapping are also considered). Note that the overlapping level x = 0% is also studied, corresponding to the original datasets without additional induced overlapping. In detail, all the possible combinations among the following three factors are considered in the experiments: 1) Sets affected by overlapping (2): (i) training and test sets or (ii) only training sets. 2) Overlapping schemes (2): (i) AOS or (ii) MOS.
3) Overlapping levels (10): from x = 5% to x = 50%, by increments of 5%. Table 2 shows the classification algorithms considered for the experimentation along with their parameters setup, which is the recommended by their authors. The choice of the learning algorithms has been made on the basis of their good behavior in a large number of realworld problems. They are classic and reference methods widely employed in many recent publications in the data mining literature [3], [5] and belong to different classification paradigms. C4.5 and RIPPER are rule-based classifiers, k-NN is a sample based learner and SVM builds hyperplanes to separate the transformed data in high-dimensional spaces. Note that the experiments performed are not focused on obtaining slightly better results by employing the most powerful algorithms, but checking whether OVO is able to improve the performance of the methods when data is affected by overlapping.

B. CLASSIFICATION ALGORITHMS
Two different values of k are used for the k-NN algorithm: k = 3 and k = 5. Notice that the value k = 1 is not considered in the experiments since 1-NN provides exactly the same classification results with or without the OVO decomposition. In this way, we can check how OVO is affected by different values of this important parameter when working with overlapping data.  [3] polynomial kernel, C = 1, tol = 0.001, = 1.0E-12

C. METHODOLOGY OF ANALYSIS
In order to check the suitability of the methods using OVO when dealing with overlapping data, the results of the classification algorithms with and without decomposition are compared one another. For C4.5, RIPPER, 3-NN and 5-NN this comparison can be directly performed, since these techniques can handle multiple classes inherently. However, SVM is designed to work with binary datasets. For this reason, in the case of SVM, the OVO and OVA strategies are compared, checking which one of them has a better behavior with overlapping data.
The classifier accuracy estimation in each dataset is obtained running 5 iterations of a stratified 5-fold crossvalidation (5x5-fcv). Each partition has a larger number of samples using 5 partitions and thus, the effects of overlapping samples become more notable. Furthermore, the 5 iterations of the 5-fcv enable the final results obtained to be as robust as possible. Hence, each overlapping dataset is created 5 times with different seeds, carrying out a total of 25 runs per dataset configuration, which are averaged to obtain the final result for each configuration and dataset. This fact implies that 34850 executions are carried out for each classifier (1394 datasets · 25 runs), which are repeated for the OVO and non-OVO versions, reaching a total of 348500 executions (5 classifiers, with OVO and non-OVO). For the sake of brevity, only averaged results are presented in this manuscript, but it must be stressed that the conclusions reached are based on the proper statistical analysis, which considers all the results (not averaged). Full results can be found on the web-page associated with this research 2 .
The aforementioned analysis of the accuracy of each classifier is complemented by the study of the ELA [30] metric. This is a metric proposed in the framework of noisy data as a combination of the performance and robustness of the methods. It represents the robustness of the classifier when the noise level increases. This metric helps us check if a good performance is simultaneously combined with a good robustness, that is, the classifier is not so strongly deteriorated when higher levels of overlapping are considered. This metric is computed as where Acc x% is the accuracy with overlapping of x% and Acc 0% is the accuracy in the original data D. The ELA results are shown in percentage in this work, i.e. they are multiplied by 100. In order to properly analyze the results obtained, Wilcoxon's [8] non-parametric test is used. This is a pairwise test aiming at detecting significant differences between two sample means. For each one of the 40 configurations studied, the OVO and non-OVO versions are compared using Wilcoxon's test and the p-values (p W ) associated are obtained. The p-value allows one to know whether two algorithms are significantly different. Even though the significance of the differences found is given by the p-value in Wilcoxon's test, a threshold (significance level) is established to focus the analysis in the most interesting results. Thus, a difference will be considered to be significant if the pvalue obtained is lower than 0.1, which is a value that usually shows important differences between the algorithms compared. Additionally, those cases in which the p-value is lower than 0.05 will also be analyzed, since these differences are even more meaningful than those at significance level 0.1.

VI. ANALYSIS OF RESULTS OF OVERLAPPING DATA AFFECTING TRAINING AND TEST SETS
The experiments in this section deal with the scenario in which overlapping is introduced in both training and test sets. The main aim is to gain a full insight into the influence of overlapping on the classification process and the properties of the OVO decomposition mechanism in such a case. A threestage analysis is designed to tackle the different aspects of this problem. Section VI-A analyzes the accuracy of classifiers on all the samples (those present in the original data and those overlapping samples synthetically generated). Section VI-B only focuses on the accuracy on those samples from the non-overlapping regions, whereas Section VI-C analyzes the accuracy on the samples from the overlapping regions. Finally, Section VI-D examines the robustness results of the classifiers considering the ELA measure.

A. OVERLAPPING IN TRAINING AND TEST SETS: ACCURACY ON ALL THE TYPES OF SAMPLES
This section studies the behavior of standard and OVO-based classifiers over all samples, i.e., those of the original data and those synthetically generated. Thus, the classification accuracy is measured in each dataset with respect to degree of overlapping between classes. Table 3 presents the accuracy of each method in its OVO and non-OVO version together with the output of Wilcoxon's test.
The results in Table 3 show that, in general, applying the OVO decomposition leads to an improvement in accuracy over the standard single-model approach. A constant trend of the obtained accuracies is also observed, regardless of the overlapping level introduced. This fact shows the preferable characteristics of OVO-based learning, which can return an improved accuracy even in cases of extreme overlapping (50% of samples). This trend is further backed-up by Wilcoxon's test, which shows that the gain from applying OVO is statistically significant for each type of base classifier, for almost all the overlapping levels in C4.5, RIPPER and SVM and for the highest overlapping levels in 3-NN and 5-NN (above 30% approximately). These results show the general stability of OVO when dealing with overlapping data.
Analyzing the cases with the AOS scheme, one can observe that the decrease in accuracy has similar trends for both original and OVO-based methods. However, OVO is always delivering an improved performance. AOS introduces artificial samples among all the classes leading to a complex scenario in which many classes may share similar sample distributions in given parts of the decision space. In such a situation, decomposition will lead to a significantly easier-tosolve case, where the classifier needs to deal with only two overlapping classes at once. However, even after such a simplification, the decision boundary is still not easy to estimate. Anyway, one must point out a strong potential advantage of this approach: it is significantly easier to perform data cleaning and transformation procedures on the overlapping region between only two classes. This allows one to conclude that OVO is a suitable approach for those cases when all the classes overlap, maintaining a better accuracy.
In the case of MOS, significantly higher accuracies for all the methods with respect to AOS are observed, as now the number of difficult regions is reduced. Generally, OVO is able to improve the performance of the methods -except for 3-NN, where OVO is statistically better only at the maximum overlapping level. The good performance of OVO in MOS can be explained by the fact that, after the decomposition, some class pairs will contain overlapping and some other will not. This fact allows for an improved classification performance, as standard multi-class classifiers can get their decision boundaries biased towards the overlapping cases.
Notice that with OVO one may identify overlapping classes and apply data cleaning/transformation procedures only on the selected cases. This will reduce the complexity of the process in comparison to processing the whole multi-class dataset (in which some classes do not require any cleaning).
Note that the previous analysis is carried out considering both original and synthetic samples. In order to gain a deeper insight into the performance of OVO for overlapping data, next sections analyze whether the increased performance of OVO can be truly attributed to a greater robustness to overlapping or simply to a better classification of safe (nonoverlapping) samples, as is traditionally checked in the literature. Table 4 presents the averaged accuracy results together with the output of Wilcoxon's test when analyzing the nonoverlapping (safe) original samples. In this case, the conclusions are drawn for both AOS and MOS simultaneously.

B. OVERLAPPING IN TRAINING AND TEST SETS: ACCURACY IN NON-OVERLAPPING REGIONS
Analyzing the performance on samples from nonoverlapping regions, one can observe that the accuracies of the classifiers are highly stable, as they work on safe data. The most notable exception to this fact is the behavior of RIPPER with the AOS scheme, which suffers from a high drop in performance if the OVO decomposition is not used. In this case, the increase in the predictive power of OVO can be purely attributed to its well-known properties of reducing the complexity of multi-class classification problems. As one can observe from both the obtained accuracies and Wilcoxon's tests, OVO is able to boost the performance of all the classifiers. Only for certain cases with k-NN, differences are found to be statistically not significant. One must not forget that k-NN is a local classifier, since it only analyzes the neighborhood of a sample for its classification, and these methods do not work as well with OVO as global ones.
These results make mandatory the analysis of overlapping regions, as the performance improvement at this stage may be attributed only to the better behavior with safe samples.

C. OVERLAPPING IN TRAINING AND TEST SETS: ACCURACY IN OVERLAPPING REGIONS
In this section, we focus on analyzing the performance of < 0.00001* < 0.00001* < 0.00000* < 0.00000* < 0.00000* < 0.00000* < 0.00000* < 0.00000* < 0.00000* < 0.00000* < 0.00000* Concentrating on the AOS scenario, a stable performance is observed for C4.5, RIPPER and SVM, regardless of the amount of overlapping introduced. This means that for both small and high degrees of overlapping these methods can return a similar fraction of correctly classified samples. This is a very desirable property as it proves the high robustness of these algorithms to the cases in which multiple classes overlap. The variation between the obtained results with increasing overlapping levels is always around 1%. For both k-NN classifiers the contrary behavior is observed. Their accuracies tend to significantly drop with increasing overlapping level, showing that these learners are not suitable for such a difficult scenarios.

ORI
When taking into account Wilcoxon's test, OVO returns a statistically significant improvement over the original approach for C4.5 in almost all the overlapping levels; for RIPPER, 3-NN and 5-NN from 15-20% onwards; and for SVM in the maximum overlapping level. Additionally, a faster decrease in accuracies for both 3-NN and 5-NN with OVO than for their multi-class counterparts must be pointed out when overlapping increases (starting from around 3% of higher accuracy for OVO in 5% overlapping, and ending with only about 1-1.5% of gain for 50% of overlapping). This backs-up our previous claim that k-NN methods are not suitable for learning from overlapping datasets and that they do not work as well with OVO.
In the MOS scenario, only C4.5 is a stable learner, displaying identical characteristics as in the AOS case. 3-NN and 5-NN show identical correlation between the loss of their accuracies and the increase in the classification difficulty. The same behavior can be observed for SVM. It steadily loses accuracy, but in slower pace than NN-based approaches. This is an unexpected result, as it seems intuitive for the case with only some overlapping classes to be simpler than the AOS one. RIPPER displays a slightly higher variance than in the AOS scenario.

D. OVERLAPPING IN TRAINING AND TEST SETS: ANALYSIS OF ROBUSTNESS OF THE CLASSIFIERS
This section analyzes the robustness of OVO to increasing overlapping levels. Table 6 presents the averaged ELA obtained, together with the output of Wilcoxon's test (considering all the samples in the dataset). When considering both the AOS and MOS scenarios, C4.5, RIPPER and SVM with OVO obtain a significantly better results than not using OVO. The stability of the differences between the standard and OVO versions should be remarked: OVO always performs better with almost the same difference for any level of overlapping. The differences between the improvement on accuracy and ELA should also be noticed. While on accuracy a gain of 2-4% is usually obtained using OVO, when applying ELA as metric, one observes up to 8% of gain in most of the cases. As ELA is designed for reflecting the performance of classifiers on noisy and difficult data, such a high gap proves the usefulness of applying OVO in scenarios where overlapping is to be expected in both training and testing sets. However, the situation is slightly different for 3-NN and 5-NN classifiers. Their OVO versions deliver a worse ELA performance and Wilcoxon's test does not reject the null hypothesis of equivalence. OVO becomes significantly superior to its normal version only for some of the higher degrees of overlapping. This is another proof that minimal distance-based classifiers display lower robustness to overlapping and should not be used in such scenarios.

VII. ANALYSIS OF RESULTS OF OVERLAPPING DATA ONLY AFFECTING TRAINING SETS
This section assumes a scenario in which overlapping is introduced only in the training sets. This allows us to check how overlapping influences the learning process itself and how the estimated boundaries perform for normally distributed test samples. This way, we can examine the robustness of the training methods themselves and the importance of the training set quality. Section VII-A analyzes the accu- racy of classifiers on all the samples, whereas Section VII-B focuses on the ELA results. Table 7 presents the averaged accuracy obtained, together with the output of Wilcoxon's statistical test. In AOS scenario, higher accuracies than the ones in Section VI are obtained. At the same time, increasing overlapping levels significantly influence the performance of the classifiers, however, once again not as much as in Section VI. This fact shows that overlapping only in the training set does not damage the classifier performance as strongly as the presence of this phenomenon in both sets, i.e., the real difficulty lies in predicting overlapped samples. Otherwise, the fact that in this case SVM is characterized by the smallest loss of accuracy should be highlighted.

A. OVERLAPPING IN TRAINING SETS: ANALYSIS OF ACCURACY ON ALL THE TYPES OF SAMPLES
SVM seems to be more robust to difficult training datasets than other classifiers, whereas this robustness is lost when difficult testing sets are faced (as shown in Section VI). In this case, the OVO approach offers a higher boost of accuracy, showing that by decomposing the multi-class dataset the training difficulties embedded within it may be alleviated. This makes OVO useful for working with uncertain input data. Wilcoxon's test shows that the gains in accuracy when applying OVO are always statistically significant for SVM, C4.5 and RIPPER and at the highest overlapping levels for 3-NN and 5-NN.
The MOS scenario provides similar conclusions in all the cases but one, the effect of increasing overlapping level. A similar behavior of the classifiers for small overlapping ratios can be observed. However, the increase in the overlapping levels shows smaller drops in accuracy when compared to the AOS scenario. Note that testing samples are not affected by overlapping. This shows that OVO can efficiently deal with overlapping happening locally between only certain pairs of classes. Thus locally trained classifiers have a simplified task, since some of them will learn from safe cases without the presence of overlapping. Moreover, in this cases overlapping samples are less frequent, and therefore classifiers are less influenced by their presence at the same overlapping levels. VOLUME 4, 2016   The analysis of ELA for both AOS and MOS shows the robustness of OVO to different overlapping levels in a similar way as in Section VI. In general, the overlapping affecting only to training sets has a strong impact on classifiers and OVO allows to generate a more robust set of base learners. C4.5, RIPPER and SVM combined with OVO return statistically significant improvements in comparison to their standard counterparts. In the case of RIPPER, the usage of OVO at the highest overlapping levels achieves an improvement of almost 15% of ELA. This fact shows the low robustness of RIPPER to difficult training sets, which can be easily improved using OVO. For 3-NN and 5-NN, OVO becomes better for overlapping levels greater than 30% in the most difficult case (AOS), showing greater differences than those observed in Section VI.

VIII. LESSONS LEARNED
This section summarizes the main findings on the usage of OVO [45], [46] from the empirical study in the previous sections: 1) On the performance and robustness of OVO when dealing with overlapping data. The methods that use OVO usually achieve higher performance results than their non-OVO counterparts, regardless of the overlapping level. The ELA metric [30] corroborates this conclusion showing greater differences than considering accuracy, as it takes the robustness of the method with respect to the case without overlapping into account. The robustness results are stable with respect to the overlapping level and strong variations on OVO performance are not observed. These facts show the suitability of OVO for overlapping scenarios. 2) On the performance of OVO in the overlapping and non-overlapping regions. The performance of the classifiers in these two regions shows that OVO is able to improve the accuracy on both sets. The overall performance improvement of OVO can be attributed to its well-known benefits in multi-class problems [13], < 0.00001* < 0.00000* < 0.00000* < 0.00000* < 0.00000* < 0.00000* < 0.00000* < 0.00000* < 0.00000* < 0.00000* < 0.00000* < 0.00030* < 0.00015* < 0.00442* < 0.00826* < 0.03056* < 0.02148* < 0.06483+ < 0.08576+ < 0.06120+ < 0.16874 < 0.08576+ [28]. Moreover, focusing on overlapping samples, OVO alleviates the difficulties by considering classes by pairs, increasing their separability [12], [45]. 3) On the sets affected by overlapping (training/test).

ORI
Overlapping negatively affects the performance of the classifiers, independently of the sets where it is present (in training and test or only in training). However, classifiers have greater difficulties to deal with overlapping when both sets are affected, as correctly classifying test samples becomes harder. Likewise, overlapping in training affects the learning, producing more complex boundaries. 4) On the amount of classes affected by overlapping.
Two different overlapping schemes have been studied with respect to the number of classes affected by overlapping: the AOS scheme (all classes are affected) and the MOS scheme (only the majority class is affected and consequently its surrounding classes.). As it could be expected, the AOS scheme has been generally more detrimental to classifier performance due to its higher complexity. In AOS, the usage of OVO allows us to reduce the multi-class problem to having only two overlapping classes in each base classifier, whereas in MOS some of the base classifiers are trained with nonoverlapping pairs of classes. These facts contribute to the increase in performance of OVO. 5) On the synergy between classifiers and OVO to deal with overlapping. The behavior of five different classifiers (C4.5 [27], RIPPER [6], SVM [3], 3-NN and 5-NN [5]) has been studied with and without OVO in the presence of overlapping data. Three of them (C4.5, RIPPER and SVM) highly benefit from using OVO, providing good performance and robustness results on all levels of overlapping. More specifically, RIPPER obtains the highest improvements when OVO is used. 3-NN and 5-NN only benefit from OVO occasionally, but, in general, their performance is weaker than that of the other methods. Therefore, their usage should be avoided with overlapping data and they do not get the same advantage from decomposition strategies, mainly due to their local nature [5]. <0.00001* < 0.00000* < 0.00000* < 0.00000* <0.00000* <0.00000* <0.00000* < 0.00000* <0.00000* < 0.00000* < 0.00000* < 0.00015* < 0.00012* < 0.00083* < 0.00152* <0.00285* < 0.00192* <0.00492* < 0.00492* < 0.00547* <0.02456* < 0.01168*

IX. CONCLUDING REMARKS
In this research the problem of overlapping [16], [40] in the domain of multi-class classification [1], [42] is addressed. We suggest that using OVO [46] can improve the performance of base classifiers when treating problems with overlapping.
In an exhaustive empirical study we have shown that OVO successfully helps in alleviating the influence of overlapping, without either needing to modify existing algorithms one by one or carrying out any prior data preprocessing step. Furthermore, to develop such an extensive study, we found the necessity for proposing a framework to introduce overlapping into real-world datasets. In this way, this framework is not only useful for the current study, but new developments in the field can also follow this new systematic way for creating a variety of classification problems with a measurable quantity of overlapping.
Our framework for introducing overlapping as well as our empirical study has considered two ways of introducing overlapping into existing datasets (in the majority class or in all the classes) and the possibility of adding it only in the training set or in both the training and test sets. All these combinations have allowed us to study the behavior of OVO in scenarios that display similar properties to real-world problems [33]. Decomposition performed by OVO helps to increase the separation between classes [13], [28] in these difficult to learn problems and it is beneficial to create more regular decision boundaries [45] where overlapping samples are present.
In future works we plan to develop data cleaning methods to reduce the difficulties in overlapping regions and sample weighting solutions to reduce the influence of overlapping samples on the decision boundaries given by the classifiers. We are specially interested in combining these approaches with decomposition strategies, where they can be applied to specific subproblems if needed.