By Topic

• Abstract

SECTION I

## INTRODUCTION

CLASS imbalance learning refers to a type of classification problems, where some classes are highly underrepresented compared to other classes. The skewed distribution makes many conventional machine learning algorithms less effective, especially in predicting minority class examples. The learning objective can be generally described as “obtaining a classifier that will provide high accuracy for the minority class without severely jeopardizing the accuracy of the majority class” [2]. A number of solutions have been proposed at the data and algorithm levels to deal with class imbalance. In particular, ensemble techniques have become popular and important means, such as BEV [3], SMOTEBoost [4], etc. However, the efforts so far are focused on two-class imbalance problems in the literature.

In practice, many problem domains have more than two classes with uneven distributions, such as protein fold classification [5], [6], [7] and weld flaw classification [8]. These multiclass imbalance problems pose new challenges that are not observed in two-class problems. Zhou et al. [9] showed that dealing with multiclass tasks with different misclassification costs of classes is harder than dealing with two-class ones. Further investigations are necessary to explain what problem multiclass can cause existing class imbalance learning techniques and how it affects the classification performance. Such information would help us to understand the multiclass issue better and can be utilized to develop better solutions.

Most existing imbalance learning techniques are only designed for and tested in two-class scenarios. They have been shown to be less effective or even cause a negative effect in dealing with multiclass tasks [9]. Some methods are not applicable directly. Among limited solutions for multiclass imbalance problems, most attention in the literature was devoted to class decomposition—converting a multiclass problem into a set of two-class subproblems [10]. Given a $c$-class problem $(c > 2)$, a common decomposing scheme is to choose one class labeled as positive and to merge the others labeled as negative for forming a subproblem. Each class becomes the positive class once, and thus, $c$ binary classifiers are produced to give a final decision (known as one-against-all (OAA), one-versus-others) [11]. However, it aggravates imbalanced distributions [7], and combining results from classifiers learned from different subproblems can cause potential classification errors [12], [13]. It is desirable to develop a more effective and efficient method to handle multiclass imbalance problems.

This paper aims to provide a better understanding of why multiclass makes an imbalanced problem harder and new approaches to tackling the difficulties. We first study the impact of multiclass on the performance of random oversampling and undersampling techniques by discussing “multiminority” and “multimajority” cases in depth. We show that both “multiminority” and “multimajority” negatively affect the overall and minority-class performance. In particular, the “multimajority” case tends to be more harmful. Random oversampling does not help the classification and suffers from overfitting. The effect of random undersampling is weakened as there are more minority classes. When multiple majority classes exist, random undersampling can cause great performance reduction to those majority classes. Neither strategy is satisfactory. Based on the results, we propose to use our recently developed ensemble algorithm AdaBoost. NC [1] to handle multiclass imbalance problems. Our earlier work showed that it has good generalization ability under two-class imbalance scenarios by exploiting ensemble diversity [14]. As a new study of multiclass imbalance problems, the experiments in this paper reveal that AdaBoost. NC combined with oversampling can better recognize minority class examples and can better balance the performance across multiple classes with high G-mean without using any class decomposition schemes.

The rest of this paper is organized as follows. Section II briefly introduces the research progress in learning from multiclass imbalance problems in the literature and describes the AdaBoost. NC algorithm. Section III investigates the impact of class number in the presence of imbalanced data under some artificial settings. Section IV discusses the effectiveness of AdaBoost. NC in comparison with the state-of-the-art class imbalance learning methods on real-world multiclass imbalance tasks. Finally, Section V concludes this paper.

SECTION II

## RELATED WORK

In this section, we first review the related studies concerning multiclass imbalance learning. There is a lack of systematic research on this topic, and existing solutions are very limited. Then, we introduce the AdaBoost. NC algorithm and describe the main conclusions obtained in our earlier studies, including its performance in two-class imbalance problems. This work will be extended to multiclass scenarios in this paper.

### A. Multiclass Imbalance Problems

Most existing solutions for multiclass imbalance problems use class decomposition schemes to handle multiclass and work with two-class imbalance techniques to handle each imbalanced binary subtask. For example, protein fold classification is a typical multiclass imbalance problem. Tan et al. [7] used both OAA [11] and one-against-one (OAO) [15] schemes to break down this problem and then built rule-based learners to improve the coverage of minority class examples. OAA and OAO are two most popular schemes of class decomposition in the literature. Zhao et al. [5] used OAA to handle multiclass and undersampling and SMOTE [16] techniques to overcome the imbalance issue. Liao [8] investigated a variety of oversampling and undersampling techniques used with OAA for a weld flaw classification problem. Chen et al. [6] proposed an algorithm using OAA to deal with multiclass and then applied some advanced sampling methods that decompose each binary problem further so as to rebalance the data. Fernandez [17] integrated OAO and SMOTE in their algorithm. Instead of applying data-level methods, Alejo et al.'s algorithm [18] made the error function of neural networks cost sensitive by incorporating the imbalance rates between classes to emphasize minority classes, after decomposing the problem through OAA. Generally speaking, class decomposition simplifies the problem. However, each individual classifier is trained without full data knowledge. It can cause classification ambiguity or uncovered data regions with respect to each type of decomposition [7], [12], [13].

Different from the previous discussion, a cost-sensitive ensemble algorithm was proposed [19], which addresses multiclass imbalance directly without using class decomposition. Its key focuses are how to find an appropriate cost matrix with multiple classes and how to introduce the costs into the algorithm. A genetic algorithm (GA) was applied to search for the optimum cost setup of each class. Two kinds of fitness were tested, G-mean [20] and F-measure [21], the most frequently used measures for performance evaluation in class imbalance learning. The choice depends on the training objective. The obtained cost vector was then integrated into a cost-sensitive version of AdaBoost. M1 [22], namely, AdaC2 [23], [24], which is able to process multiclass data sets. However, searching the best cost vector is very time consuming due to the nature of GA. No existing methods can deal with multiclass imbalance problems efficiently and effectively yet to our best knowledge.

We now turn our attention to the assessment metrics in class imbalance learning. Single-class performance measures evaluate how well a classifier performs in one class, particularly the minority class. Recall, precision, and F-measure [21] are widely discussed single-class measures for two-class problems, which are still applicable to multiclass problems. Recall is a measure of completeness; precision is a measure of exactness. F-measure incorporates both to express their tradeoff [2]. For the overall performance, G-mean [20] and AUC [25] are often used in the literature, but they are originally designed for two-class problems. Therefore, they have to be adapted to multiclass scenarios: an extended G-mean [19] is defined as the geometric mean of recall values of all classes; a commonly accepted extension of AUC is called M measure or MAUC [26], the average AUC of all pairs of classes.

### B. AdaBoost. NC for Two-Class Imbalance Problems

We proposed an ensemble learning algorithm AdaBoost. NC [1] that combines the strength of negative correlation learning [27], [28] and boosting [29]. It emphasizes ensemble diversity explicitly during training and shows very encouraging empirical results in both effectiveness and efficiency in comparison with the conventional AdaBoost and other NCL methods in general cases. We then exploited its good generalization performance to facilitate class imbalance learning [14], based on the finding that ensemble diversity has a positive role in solving this type of problem [30]. Comprehensive experiments were carried out on a set of two-class imbalance problems. The results suggest that AdaBoost. NC combined with random oversampling can improve the prediction accuracy on the minority class without losing the overall performance compared to other existing class imbalance learning methods. It is achieved by providing less overfitting and broader classification boundaries for the minority class. Applying oversampling is simply to maintain a sufficient number of minority class examples and to guarantee that two classes receive equal attention from the algorithm. The algorithm is described in Table I.

AdaBoost. NC penalizes classification errors and encourages ensemble diversity sequentially with the AdaBoost [29] training framework. In step 3 of the algorithm, a penalty term $p_{t}$ is calculated for each training example, in which $amb_{t}$ assesses the disagreement degree of the classification within the ensemble at current iteration $t$. It is defined as TeX Source $$amb_{t} = {1\over t}\sum_{i = 1}^{t}\left(\Vert H_{t} = y\Vert - \Vert h_{i} = y\Vert\right)$$ where $H_{t}$ is the class label given by the ensemble composed of the existing $t$ classifiers. The magnitude of $amb_{t}$ indicates a “pure” disagreement. $p_{t}$ is introduced into the weight-updating step (step 5). By doing so, training examples with small $\vert amb_{t}\vert$ will gain more attention. The expression of $\alpha_{t}$ in step 4 is derived by using the inferring method in [24] and [31] to bound the overall training error. The predefined parameter $\lambda$ controls the strength of applying $p_{t}$. The optimal $\lambda$ depends on problem domains and base learners [32]. In general, (0, 4] is deemed to be a conservative range of setting $\lambda$. As $\lambda$ becomes larger, there could be either a further performance improvement or a performance degradation.

SECTION III

## CHALLENGES OF MULTICLASS IMBALANCE LEARNING

Two types of multiclass could occur to an imbalanced data set: one majority and multiple minority classes (multiminority cases), and one minority and multiple majority classes (multimajority cases). A problem with multiple minority and multiple majority classes can be treated as the case when both types happen. Several interesting research questions are raised here: Are there any differences between multiple minority and multiple majority classes? Would these two types of problem pose the same or different challenges to a learning algorithm? Which one would be more difficult to tackle? For such multiclass imbalance problems, which aspects of a problem would be affected the most by the multiclass? Would it be a minority class, a majority class or both?

With these questions in mind, we will give separate discussions for each type under a set of artificial scenarios. For a clear understanding, two kinds of empirical analyses are conducted: 1) Spearman's rank correlation analysis, which shows the relationship between the number of classes and every evaluated performance measure, provides the evidence of the classification difficulty brought by “multiminority” and “multimajority.” It will answer the question of if the difficulties exist. 2) Performance pattern analysis, which presents the performance changing tendencies of all existing classes as more classes are added into training, reveals the different performance behaviors of each class among different training strategies. It will tell us what difficulties are caused by the recognition of each class and what the differences between the two types of multiclass are.

### A. Artificial Data Sets and Experimental Settings

To have a sufficient number of classes for our study, we generate some artificial imbalanced data sets by using the method in [33]. In multiminority cases, the number of minority classes is varied from 1 to 20, and only one majority class exists. Similarly, the number of majority classes is varied from 1 to 20 in multimajority cases, and only one class is generated as the minority. Data points in each class are generated randomly from Gaussian distributions, where the mean and standard deviation of each attribute are random real values in [0, 10].

We consider two different imbalanced contexts here. The first context includes a group of data sets having a relatively small size. Every generated example has two attributes. In training data, each minority class has ten examples, and each majority class has 100 examples. In the second context, we consider larger data sets. We enlarge the feature space to 20 attributes. In the training data of this group, each minority class contains 100 examples, and each majority class contains 1000 examples. The two groups of data are denoted by “10–100” and “100–1000,” respectively. Discussing both “10–100” and “100–1000” is to find out whether data size is a factor of affecting our results in the experiments.

We also generate a set of balanced multiclass data sets with the number of classes increasing from 2 to 21. They are used as the baseline to clearly show the “multiclass” difficulty in both balanced and imbalanced scenarios. The balanced training data have 2 attributes and 100 examples in each class, denoted by “100–100.”

The data generation procedure is randomly repeated 20 times for each setting with the same numbers of minority and majority classes. Every training data set has a corresponding testing set, where each class contains 50 examples.

In the experiments, three ensemble training methods are compared: the conventional AdaBoost that is trained from the original imbalanced data and used as the default baseline method (abbr. OrAda); random oversampling + AdaBoost (abbr. OvAda), where all of the minority classes get their examples replicated randomly until each of them has the same size as the majority class before training starts; and random undersampling + AdaBoost (abbr. UnAda), where all of the majority classes get rid of some examples randomly until each of them has the same size as the minority class before training starts. Every method is run ten times independently on the current training data. Therefore, the result in the following comparisons is based on the average of 200 output values (20 training files ∗ 10 runs). As the most popular techniques in class imbalance learning, oversampling and undersampling are discussed to understand the impact of “multiclass” on the basic strategies of dealing with imbalanced data sets and to examine their robustness to “multiclass.”

C4.5 decision tree [34] is chosen as the base learner and is implemented in Weka [35]—an open source data mining tool. The default Weka parameters are used, resulting in a pruned tree without the binary split. Each ensemble contains 51 such trees.

For the performance evaluation, single-class metrics recall, precision, and F-measure are calculated for each class. To assess the overall performance across all classes, the generalized versions of AUC and G-mean, i.e., MAUC [26] and extended G-mean [19], are adopted in the following discussions.

### B. Balanced Cases

In order to find out whether the “multiclass” issue is due to the type of multiclass imbalance or the multiclass itself, we first examine how it affects the classification performance on balanced data. We carry out Spearman's rank correlation analysis between performance measures and the number of classes by adding new classes of data with the equal size. Only the conventional AdaBoost (i.e., OrAda) is applied since resampling is not necessary for balanced data. The three single-class measures are tracked for the first generated class that joins all of the training sessions. Table II presents the correlation coefficients, where each entry ranges in [−1, 1] (in percent). A positive (negative) value indicates a monotone increasing (decreasing) relationship, and a coefficient of zero indicates no tendency between them.

TABLE II RANK CORRELATION COEFFICIENTS (IN PERCENT) BETWEEN THE NUMBER OF CLASSES AND FIVE PERFORMANCE MEASURES FOR OrAda ON BALANCEDDATA “100–100.” RECALL, PRECISION, AND F-MEASURE ARE CALCULATED FOR THE FIRST GENERATED CLASS

Almost all of the measures have coefficients of −1 in Table II, which means that they present very strong negative correlations with the number of classes. As more classes are added into the training data, the performance of AdaBoost in predicting any single class and its overall ability to discriminate between classes become worse. Multiclass itself increases the data complexity and negatively affects the classification performance regardless of whether data are imbalanced. It may imply that multiclass imbalance problems cannot be simply solved by rebalancing the number of examples among classes. Next, let us examine the imbalanced cases.

### C. Multiminority Cases

The correlation analysis and performance pattern analysis are conducted on the multiminority cases in this section. The number of minority classes is varied from 1 to 20. The impact of multiminority on the performance of oversampling and undersampling techniques is illustrated and analyzed in depth.

#### 1) Correlation Analysis

Five performance measures and three ensemble training methods (i.e., OrAda, OvAda, and UnAda) permit 15 pairwise correlations with respect to the number of minority classes. They show that if multiminority degrades the classification performance of the three ensemble training methods and which performance aspects are affected. The three single-class measures are recorded for the minority class that joins all the training sessions from 1 to 20. Table III summarizes the correlation coefficients for “10–100” and “100–1000” data groups.

TABLE III RANK CORRELATION COEFFICIENTS (IN PERCENT) BETWEEN THE NUMBER OF MINORITY CLASSES AND FIVE PERFORMANCE MEASURES FOR THREE ENSEMBLE METHODS ON “10–100” AND “100–1000” DATA. RECALL, PRECISION, AND F-MEASURE ARE CALCULATED FOR THE MINORITY CLASS
Fig. 1. Single-class performance patterns among classes in multiminority cases of “10–100” ($x$-axis: number of minority classes; $y$-axis: performance output). (a) Recall: OrAda. (b) Recall: OvAda. (c) Recall: UnAda. (d) Precision: OrAda. (e) Precision: OvAda. (f) Precision: UnAda. (g) F-measure: OrAda. (h) F-measure: OvAda. (i) F-measure: UnAda.

All pairs present very strong negative correlations on both groups of small and large data sets. It implies a strong monotone decreasing relationship between the measures and the number of minority classes. All of them are decreasing as more minority classes are added into the training data, regardless of the size of the training data and whether resampling is applied. In other words, multiminority reduces the performance of these ensembles consistently, and data resampling seems not to be helpful. Next, we will give a further investigation into the performance degradation caused by multiminority classes from the level of every single class.

#### 2) Performance Pattern Analysis

We now focus on the “10–100” group of data sets and illustrate the changing tendencies of single-class measures for all classes as the class number increases. In Fig. 1, the presented pattern reveals detailed information about how the classification performance of each class is affected and the differences among ensemble methods and evaluated measures. All of the following pattern plots are scaled in the same range.

According to Fig. 1, every class's performance is decreasing. No evidence shows which class suffers from more performance degradation than other classes. The classification gets equally difficult on all classes. For each class, corresponding to one curve in the plot, the measure value drops faster at the first few steps, when the number of minority classes is approximately smaller than 10. As it gets larger, the reduction slows down.

Among the three performance measures, the drop of precision [Fig. 1(d) and (e)] is more severe than that of recall [Fig. 1(a) and (b)] in OrAda and OvAda. Precision is the main cause of the decrease in F-measure. The reason is that multiminority increases the risk of predicting an example into a wrong class. As to recall, it seems that the difficulty of recognizing examples within each class is less affected by multiminority as compared to precision because the proportion of each class of data in the whole data set is hardly changed by adding a small class. In UnAda, each class is reduced to have a small size. Adding minority classes changes the proportion of each class significantly. It explains why UnAda's recall [Fig. 1(c)] presents higher sensitivity to multiminority than the recall produced by OrAda and OvAda [Fig. 1(a) and (b)].

Among the three ensemble methods, OrAda and OvAda have similar performance patterns, where the majority class obtains higher recall and F-measure than the minority classes, but lower precision values. Oversampling does not alleviate the multiclass problem. Although oversampling increases the quantity of minority class examples to make every class have the same size, the class distribution in data space is still imbalanced, which is dominated by the majority class. In UnAda, undersampling counteracts the performance differences among classes. During the first few steps, UnAda presents better recall and F-measure on minority classes [Fig. 1(c) and (i)] than OrAda and OvAda [Fig. 1(a),(b), (g), and (h)]. From this point of view, it seems that using undersampling might be a better choice. However, its advantage is weakened as more minority classes join the training. When the class number reaches 20, three ensemble algorithms have very similar minority-class performance. The reason could be that undersampling explicitly empties some space for recognizing minority classes by removing examples from the majority class region. When there is only one minority class, a classifier is very likely to assign the space to this class. When there are many minority classes, they have to share the same space. Hence, the effect of undersampling is reduced. Undersampling seems to be more sensitive to multiminority. For this consideration, it would be better to expand the classification area for each minority class, instead of shrinking the majority class. To achieve this goal, advanced techniques are necessary to improve the classification generalization over minority classes.

### D. Multimajority Cases

We proceed with the same analyses for the multimajority data “10–100” and “100–1000.” The number of majority classes is varied from 1 to 20. The impact of multimajority is studied here.

TABLE IV RANK CORRELATION COEFFICIENTS (IN PERCENT) BETWEEN THE NUMBER OF MAJORITY CLASSES AND FIVE PERFORMANCE MEASURES IN THREE ENSEMBLE METHODS ON “10–100” AND “100–1000” DATA. RECALL, PRECISION, AND F-MEASURE ARE CALCULATED FOR THE MINORITY CLASS

#### 1) Correlation Analysis

Table IV summarizes the correlation coefficients. Single-class performance measures are recorded for the only minority class of each data set. Similar to the multiminority cases, strong negative correlations between five performance measures and the number of majority classes are observed in both groups of small and large data sets, which indicate a monotone decreasing relationship. All three ensemble training methods suffer from performance reduction caused by “multimajority.”

Fig. 2. Single-class performance patterns among classes in multimajority cases of “10–100” ($x$-axis: number of majority classes; $y$-axis: performance output). (a) Recall: OrAda. (b) Recall: OvAda. (c) Recall: UnAda. (d) Precision: OrAda. (e) Precision: OvAda. (f) Precision: UnAda. (g) F-measure: OrAda. (h) F-measure: OvAda. (i) F-measure: UnAda.

#### 2) Performance Pattern Analysis

To gain more insight, we focus on the “10–100” group of data sets and present the changing tendencies of single-class measures for each class along with the increase of the number of majority classes in Fig. 2. All plots are in the same axis scale.

Among the classes in each plot, adding majority classes makes the recognition of examples of each class [i.e., recall presented in Fig. 2(a)(c)] equally difficult. In OrAda and OvAda, minority-class precision drops faster than that of the majority classes [Fig. 2(d) and (e)] because the large quantity of new majority class examples overwhelms the minority class even more. Minority class examples are more likely to be misclassified than before compared to majority class examples.

All performance measures present a drastic decrease. Especially in recall plots of OrAda and OvAda [Fig. 2(a) and (b)], more and more majority class examples take the recognition rate of the minority class down to nearly 0. For every existing majority class, adding more majority classes can make it appear to be in minority. Therefore, the recall of majority classes also shows a fast drop.

Among the three ensemble methods, UnAda produces better minority-class F-measure than OrAda and OvAda, but the recall of majority classes is sacrificed greatly. It causes the concern that using undersampling will lose too much data information when multiple majority classes exist and can lead to severe performance reduction in majority classes.

Based on all of the observations in this section, we make the following conclusion: 1) As no new information is introduced into the minority class to facilitate the classification in OrAda and OvAda, overfitting minority-class regions happens with low recall and high precision values when compared with those measures obtained from the majority classes. Oversampling does not help for both multiminority and multimajority cases. 2) UnAda performs the same under multiminority and multimajority cases due to undersampling. In the multiminority case, UnAda can be sensitive to the class number; in the multimajority case, there is a high risk of sacrificing too much majority-class performance. 3) Between multiminority and multimajority, the multimajority case seems to be more difficult than the multiminority case. OrAda and OvAda present much worse minority-class performance in Fig. 2(g) and (h) compared to Fig. 1(g) and (h). This is because adding majority class examples aggravates the imbalanced situation. 4) Between balanced and imbalanced data, multiclass leads to performance degradation in both scenarios. We believe that learning imbalanced data is much harder than learning balanced one, for the performance difference between the types of classes shown in the performance pattern analysis and the particular performance requirement for minority classes, which would not happen in the balanced case. Because of different learning objectives, different treatments should be considered.

SECTION IV

## COMPARATIVE STUDY OF ENSEMBLE ALGORITHMS ON MULTICLASS IMBALANCE PROBLEMS

Armed with a better understanding of multiclass imbalance problems, this section aims to find a simple and effective ensemble learning method without using class decomposition. The presence of multiple minority classes increases data complexity. In addition to more complex data distributions, the presence of multiple majority classes makes a data set even more imbalanced. Balancing the performance among classes appropriately is important. Suggested by the analysis in the previous section, we attempt to improve the generalization of the learning method by focusing on minority classes, instead of shrinking majority classes through undersampling, in order to avoid losing useful data information and to keep the learning method less sensitive to the number of minority classes.

In our previous study [14], we found that the “random oversampling + AdaBoost. NC” tree ensemble is effective in handling two-class imbalance problems. It shows a good recognition rate of the minority class and balances the performance between minority and majority classes well by making use of ensemble diversity. Moreover, its training strategy is flexible and simple without removing any training data. For the aforementioned reasons, we look into this algorithm and extend our study to multiclass cases in this section. The main research question here is whether AdaBoost. NC is still effective in solving multiclass imbalance problems. In order to answer the question and to find out if class decomposition is necessary, AdaBoost. NC is compared with other state-of-the-art methods in cases of using and not using class decomposition, including the conventional AdaBoost, resampling-based AdaBoost, and SMOTEBoost [4]. AdaBoost is discussed as the baseline method, because the AdaBoost. NC algorithm is in the boosting training framework. Resampling techniques and the SMOTEBoost algorithm are chosen for their wide use in the multiclass imbalance learning literature [5], [17]. More ensemble solutions exist for two-class imbalance problems, such as RAMOBoost [36] and JOUS-Boost [37], which will be studied as our next step on how they perform when dealing with multiclass cases and what advantages they might have over the methods that we discuss in this paper. Cost-sensitive algorithms are another class of solutions. They are not considered in our experiments since we do not assume the availability of explicit cost information in our algorithm.

### A. Data Sets and Experimental Settings

In the experiments, we evaluate our candidate methods on 12 classification benchmark problems from the UCI repository [38]. Each data set has more than two classes. At least one of them is significantly smaller than one of the others. The data information with class distributions is summarized in Table V.

TABLE V SUMMARY OF BENCHMARK DATA SETS

Six ensemble models are constructed and compared, including AdaBoost without resampling applied (OrAda, the baseline model), “random oversampling + AdaBoost” (OvAda), “random undersampling + AdaBoost” (UnAda), “random oversampling + AdaBoost. NC” with $\lambda = 2$ (OvNC2), “random oversampling + AdaBoost. NC” with $\lambda = 9$ (OvNC9), and SMOTEBoost [4] (SMB).

With respect to parameter settings, the penalty strength $\lambda$ in AdaBoost. NC is set to 2 and 9 based on our previous findings [32]. $\lambda = 2$ is a relatively conservative setting to show if AdaBoost. NC can make a performance improvement, and $\lambda = 9$ encourages ensemble diversity aggressively, as we have explained in Section II. For a better understanding of how AdaBoost. NC can facilitate multiclass imbalance learning, both values (representing two extremes) are discussed here. Applying random oversampling is necessary for AdaBoost. NC not to ignore the minority class [14]. For SMOTEBoost, the nearest neighbor parameter $k$ is set to 5, the most accepted value in the literature. The amount of new data for a class $c$ is roughly the size difference between the largest class and class $c$, considering that the other models also adjust the between-class ratio to one. For now, we use fixed parameters for the algorithms that we compare with in order to single out the impact of different algorithmic features on the performance. Once we have a better understanding of the algorithms, we can then tune and optimize parameters by using some existing parameter-optimizing methods in our future studies [39].

We still employ C4.5 decision tree as the base learner, following the same settings as in the previous section. Each ensemble consists of 51 trees. As some data sets have very small classes, we perform a five-fold cross-validation (CV) with ten runs instead of the traditional ten-fold CV, to guarantee that each fold of data contains at least one example from every minority class.

MAUC [26] and extended G-mean [19] are used to evaluate the overall performance as before. Recall and precision are recorded as the single-class performance measures to explain how an overall performance improvement or degradation happens.

### B. Ensemble Algorithms for Multiclass Imbalance Problems

In this section, we first give respective discussions on the performance of ensemble algorithms without using any class decomposition and the ones using the OAA scheme of class decomposition—the most frequently used scheme in the literature. Based on the observations and analysis, an improved combination strategy for OAA-based ensembles is then proposed. In order to evaluate the significance of the results from the comparative methods, we carry out the Friedman test [40] with the corresponding post-hoc test recommended by Demsar [41]. The Friedman test is a nonparametric statistical method for testing whether all of the algorithms are equivalent over multiple data sets. If the test result rejects the null hypothesis, i.e., there are some differences between the algorithms, we then proceed with a post-hoc test to find out which algorithms actually differ. This paper uses an improved Friedman test proposed by Iman and Davenport [42]. The Bonferroni–Dunn test [43] is used as the post-hoc test method. Finally, we show whether class decomposition is necessary for multiclass imbalance learning based on the student T-test with 95% confidence level.

TABLE VI MEAN RANKS OF THE SIX ENSEMBLE MODELS WITHOUT USING OAA OVER 12 DATA SETS, INCLUDING OrAda, OvAda, UnAda, OvNC2, OvNC9, AND SMB

#### 1) Ensemble Models Without Using OAA

Tables VI and VII present the Friedman and post-hoc test results on MAUC, extended G-mean, recall, and precision for the six ensemble models without using class decomposition. Considering the existence of multiple minority and majority classes, we only discuss recall and precision for the smallest class and the largest class, which should be the most typical ones in the data set. The minority-class recall and precision are denoted by $R_{\min}$ and $P_{\min}$, and the majority-class recall and precision are denoted by $R_{\rm maj}$ and $P_{\rm maj}$. The Friedman test compares the mean ranks of algorithms, which are shown in Table VI. A smaller value indicates a higher rank, i.e., better performance. Table VII gives the Friedman and post-hoc test results by choosing OvNC9 as the “control” method, in which each number is the difference of mean ranks between the “control” method and one of the other methods in the corresponding column. The CD value shown at the bottom is the critical difference value [41]. It is determined by the number of algorithms and data sets, and a critical value $q_{\alpha}$ that is equal to 2.326 in our case. The performance of any two methods is significantly different if their difference of mean ranks is larger than CD.

TABLE VII FRIEDMAN TEST WITH THE CORRESPONDING POST-HOC TEST (BONFERRONI–DUNN) FOR THE SIX ENSEMBLE MODELS WITHOUT USING OAA. THE DIFFERENCE OF MEAN RANKS BETWEEN OvNC9 AND EVERY OTHER METHOD IS CALCULATED. CD IS THE CRITICAL DIFFERENCE [41]. A VALUE LARGER THAN CD INDICATES A SIGNIFICANT DIFFERENCE BETWEEN THE METHODS, HIGHLIGHTED IN BOLDFACE

For the overall performance measures MAUC and G-mean, we make the following observations. AdaBoost. NC does not show much advantage over other methods in terms of MAUC, where the mean ranks of OvNC2 and OvNC9 are 3.417 and 4.5, respectively. OrAda has the highest rank. The Friedman test rejects the null hypothesis of MAUC, which means the existence of significant differences among the algorithms. Concretely, OrAda and SMB produce significantly better MAUC than OvNC9, and OvNC9 produces comparable MAUC to the others. However, different observations happen to G-mean. OvNC9 achieves the best G-mean, and OrAda gives the worst. OvNC9 is significantly better than resampling-based AdaBoost and comparable to SMB.

It is interesting to observe that the conventional AdaBoost without using any specific rebalancing technique is good at MAUC and bad at G-mean. It is known that AdaBoost itself cannot handle class imbalance problems very well. It is sensitive to imbalanced distributions [24], [44]. Meanwhile, our experiments in the previous section show that it suffers from multiclass difficulties significantly. It means that the low G-mean of AdaBoost results from its low recognition rates on the minority classes, and the high MAUC is probably attributed to the relatively good discriminability between the majority classes. The other ensemble methods, namely, AdaBoost. NC, SMOTEBoost, and resampling-based AdaBoost, seem to be more effective in improving G-mean than MAUC.

To explain this observation, let us recall the definitions of MAUC and G-mean. G-mean is the geometric mean of recall over all classes. If any single class receives very low recall, it will take the G-mean value down. It can tell us how well a classifier can balance the recognition among different classes. A high G-mean guarantees that no class is ignored. MAUC assesses the average ability of separating any pair of classes. A high MAUC implies that a classifier is good at separating most pairs, but it is still possible that some classes are hard to be distinguished from the others. G-mean is more sensitive to single-class performance than MAUC. From this point of view, it may suggest that those ensemble solutions for class imbalance learning, especially OvNC9, can better recognize examples from the minority classes but are not good at discriminating between some majority classes. To confirm our explanations, we look into single-class performance next.

Not surprisingly, UnAda performs best in recall but produces the worst precision for the minority class because of the loss of a large amount of majority class data. OvNC9 ranks second in minority-class recall, which is competitive with UnAda and significantly better than the others. OvNC9 produces higher minority-class recall than OvNC2, which implies that a large $\lambda$ can further generalize the performance of AdaBoost. NC on the minority class. More minority class examples can be identified by setting a large $\lambda$. Meanwhile, it is encouraging to see that OvNC9 does not lose too much performance on minority-class precision, where no significant differences are observed. However, it sacrifices some majority-class recall when compared with OrAda and OvAda because of the performance tradeoff between minority and majority classes.

The observations on the smallest and largest classes explain that the good G-mean of OvNC9 results from the greater improvement in recall of the minority classes than the recall reduction of the majority classes. Its ineffectiveness in MAUC should be caused by the relatively poor performance in the majority classes. Based on the aforementioned results, we conclude that AdaBoost. NC with a large $\lambda$ is helpful in recognizing minority class examples with high recall and is capable of balancing the performance across different classes with high G-mean. From the view of MAUC and majority-class performance, it could lose some learning ability to separate majority classes. In addition, SMOTEBoost presents a quite stable overall performance in terms of both MAUC and G-mean.

TABLE VIII MEAN RANKS OF THE SIX ENSEMBLE MODELS USING OAA OVER 12 DATA SETS, INCLUDING OrAda-d, OvAda-d, UnAda-d, OvNC2-d, OvNC9-d, AND SMB-d

#### 2) Ensemble Models Using OAA

We use exactly the same ensemble methods here, but we let them work with the OAA class decomposition scheme. In this group of models, one builds a set of binary classifiers, which will then be combined for a final decision. We adopt the combining strategy used in [8], which outputs the class whose corresponding binary classifier produces the highest value of belongingness among all. These models are denoted by “-d,” to indicate that OAA is used.

The comparison results for the ensembles using OAA are summarized in Tables VIII and IX. Table VIII presents the mean ranks of constructed models on all of the performance measures. Based on the ranks, the Friedman and post-hoc test results are given in Table IX by choosing OvNC9-d as the “control” method. We observe that no single class imbalance learning method actually outperforms the conventional AdaBoost (i.e., OrAda-d) significantly in terms of either MAUC or G-mean. SMB-d appears to be relatively stable, with slightly better MAUC and G-mean than the others. UnAda-d has the lowest rank of MAUC, and OvNC9-d has the lowest rank of G-mean. Generally speaking, the overall performance of all of the methods is quite close between each other. These results are different from what we have observed in the cases without using OAA, where OvNC9 yields the best G-mean. It seems that class imbalance techniques are not very effective when working with the OAA scheme.

TABLE IX FRIEDMAN TEST WITH THE CORRESPONDING POST-HOC TEST (BONFERRONI–DUNN) FOR THE SIX ENSEMBLE MODELS USING OAA. THE DIFFERENCE OF MEAN RANKS BETWEEN OvNC9-d AND EVERY OTHER METHOD IS CALCULATED. CD IS THE CRITICAL DIFFERENCE [41]. A VALUE LARGER THAN CD INDICATES A SIGNIFICANT DIFFERENCE BETWEEN THE METHODS, HIGHLIGHTED IN BOLDFACE

As to the single-class performance in the smallest class, there is no significant difference among the methods in terms of recall because the Friedman test does not reject the null hypothesis. Resampling techniques, SMOTEBoost, and AdaBoost. NC do not show much advantage in identifying minority class examples. In terms of the minority-class precision, OvNC9-d is significantly worse than OrAda-d and OvAda-d according to Table IX. Similar happens to the largest class. We conclude that class imbalance learning techniques exhibit ineffectiveness in both minority and majority classes when compared with the conventional AdaBoost in this group of comparisons. Neither type of classes is better recognized. When the OAA scheme is applied to handling multiclass, they do not bring any performance improvement.

According to our results here, AdaBoost. NC does not show any significant improvement in minority-class and overall performance when working with the class decomposition scheme in multiclass imbalance scenarios, although it showed good classification ability to deal with two-class imbalance problems [14]. A possible reason for its poor performance could be that the combining step of OAA messes up the individual results. Without using OAA, AdaBoost. NC receives and learns from complete data information of all classes, which allows the algorithm to consider the difference among classes during learning with full knowledge. The OAA scheme, however, makes AdaBoost. NC learn from several decomposed subproblems with partial data knowledge. The relative importance between classes is lost. Even if AdaBoost. NC can be good at handling each subproblem, their combination does not guarantee good performance for the whole problem. Therefore, it may not be a wise idea to integrate class decomposition with class imbalance techniques without considering the class distribution globally. A better combining method for class decomposition schemes is needed.

#### 3) Ensemble Models Using OAA With an Improved Combination Method

To take into account the class information of the whole data set, we improve the combination method of OAA in this section by using a weighted combining rule. Instead of the traditional way of treating the outputs of binary classifiers equally [8], we assign them different weights, determined by the size of each class. For any input example $x$, its belongingness value of class $i$ from the $i$th binary classifier is multiplied by the inverse of its imbalance rate. The imbalance rate is defined as the proportion of this class of data within the whole data set. The final decision of OAA will be the class receiving the highest belongingness value among all after adjusted by the weights.

We apply the same ensemble methods in the previous sections on the 12 UCI data sets. Six ensemble models are constructed. They are denoted by “-dw,” indicating that class decomposition with a weighted combination is used. The Friedman and post-hoc test results are summarized in Tables X and XI.

TABLE X MEAN RANKS OF THE SIX ENSEMBLE MODELS USING OAA WITH THE WEIGHTED COMBINATION OVER 12 DATA SETS, INCLUDING OrAda-dw, OvAda-dw, UnAda-dw, OvNC2-dw, OvNC9-dw, AND SMB-dw
TABLE XI FRIEDMAN TEST WITH THE CORRESPONDING POST-HOC TEST (BONFERRONI–DUNN) FOR THE SIX ENSEMBLE MODELS USING OAA WITH THE WEIGHTED COMBINATION. THE DIFFERENCE OF MEAN RANKS BETWEEN OvNC9-dw AND EVERY OTHER METHOD IS CALCULATED. CD IS THE CRITICAL DIFFERENCE [41]. A VALUE LARGER THAN CD INDICATES A SIGNIFICANT DIFFERENCE BETWEEN THE METHODS, HIGHLIGHTED IN BOLDFACE
TABLE XII MEANS AND STANDARD DEVIATIONS OF MAUC, G-MEAN, AND MINORITY-CLASS RECALL BY AdaBoost.NC WITH $\lambda = 9$ AND SMOTEBoost METHODS WITHOUT USING OAA (i.e., OvNC9 AND SMB) AND USING OAA WITH THE WEIGHTED COMBINATION (i.e., OvNC9-dw AND SMB-dw). RECALL IS COMPUTED FOR THE SMALLEST CLASS OF EACH DATA SET. VALUES IN BOLDFACE INDICATE “SIGNIFICANTLY BETTER” BETWEEN OvNC9/SMB AND OvNC9-dw/SMB-dw

It is encouraging to observe that the ineffectiveness of AdaBoost. NC used with OAA is rectified by the weighted combination method in terms of G-mean and minority-class recall. The following observations are obtained: 1) OvNC9-dw presents significantly better MAUC than UnAda-dw and is competitive with the others. 2) OvNC9-dw produces the best G-mean with the highest mean rank, which is significantly better than the others expect for SMB-dw. UnAda-dw gives the worst G-mean. 3) Except for UnAda-dw, OvNC9-dw produces significantly better minority-class recall than the other models without sacrificing minority-class precision. OvNC9-dw shows competitive ability to recognize minority class examples with UnAda-dw. 4) OvNC9-dw loses some accuracy in finding majority class examples compared to OrAda-dw and OvAda-dw.

In summary, by applying the improved combination method to OAA, AdaBoost. NC with a large $\lambda$ can find more minority class examples with a higher recall and better balance the performance across different classes with a higher G-mean than other methods. SMOTEBoost is a relatively stable algorithm in terms of overall performance that presents competitive MAUC and G-mean with AdaBoost. NC.

#### 4) Is Class Decomposition Necessary?

The discussions in this section aim to answer the question of whether it is necessary to use class decomposition for handling multiclass imbalance problems. We compare the overall and minority-class performance of OvNC9 and SMB with the performance of OvNC9-dw and SMB-dw. AdaBoost. NC with $\lambda = 9$ and SMOTEBoost are chosen because AdaBoost. NC performs better at G-mean and minority-class recall, and SMOTEBoost presents good and stable MAUC and G-mean. Raw performance outputs from the 12 data sets are shown in Table XII. Values in boldface indicate “significantly better” between OvNC9 (SMB) and OvNC9-dw (SMB-dw) based on the student T-test with 95% confidence level.

According to the table, no consistent difference is observed between OvNC9 and OvNC9-dw in the three performance measures. In most cases, they present competitive measure values with each other. OvNC9-dw shows slightly better G-mean with more wins. The same happens between SMB and SMB-dw. It suggests that whether to apply OAA does not affect class imbalance learning methods much. Learning from the whole data set directly is sufficient for them to achieve good MAUC and G-mean and to find minority-class examples effectively. Therefore, we conclude that using class decomposition is not necessary to tackle multiclass imbalance problems. Moreover, Table XII further confirms our previous conclusion from the Friedman and post-hoc tests that AdaBoost. NC has better generalization especially for the minority class. For example, SMOTEBoost produces zero G-mean and zero minority-class recall on data set “Balance,” which means that no examples from the minority class are found and the obtained classifier is barely useful, while AdaBoost. NC changes this situation with much better G-mean and minority-class recall.

SECTION V

## CONCLUSION

This paper has studied the challenges of multiclass imbalance problems and has investigated the generalization ability of ensemble algorithms, including AdaBoost. NC [1], to deal with multiclass imbalance data. Two types of multiclass imbalance problems, i.e., the multiminority and multimajority cases, are studied in depth. For each type, we examine overall and minority-class performance of three ensemble methods based on the correlation analysis and performance pattern analysis. Both types show strong negative correlations with the five performance measures, which are MAUC, G-mean, minority-class recall, minority-class precision, and minority-class F-measure. It implies that the performance decreases as the number of imbalanced classes increases. The results from the performance pattern analysis show that the multimajority case tends to cause more performance degradation than the multiminority case because the imbalance rate gets more severe. Oversampling does not help the classification and causes overfitting to the minority classes with low recall and high precision values. Undersampling is sensitive to the number of minority classes and suffers from performance loss on majority classes. It suggests that a good solution should overcome the overfitting problem of oversampling but not by cutting down the size of majority classes.

Based on the analysis, we investigate a group of ensemble approaches, including AdaBoost. NC, on a set of benchmark data sets with multiple minority and/or majority classes with the aim of tackling multiclass imbalance problems effectively and efficiently. When the ensembles are trained without using class decomposition, AdaBoost. NC working with random oversampling shows better G-mean and minority-class recall than the others, which indicates good generalization for the minority class and the superior ability to balance the performance across different classes.

Our results also show that using class decomposition (the OAA scheme in our experiments) does not provide any advantages in multiclass imbalance learning. For AdaBoost. NC, its G-mean and minority-class recall are even reduced significantly by the use of class decomposition. The reason for this performance degradation seems to be the loss of global information of class distributions in the process of class decomposition. An improved combination method for the OAA scheme is therefore proposed, which assigns different weights to binary classifiers learned from the subproblems after the decomposition. The weight is decided by the proportion of the corresponding class within the data set, which delivers the distribution information of each class. By doing so, the effectiveness of AdaBoost. NC in G-mean and minority-class recall is improved significantly.

In regard to other methods, SMOTEBoost shows a quite stable performance in terms of MAUC and G-mean. Oversampling itself does not bring much benefit to AdaBoost. Undersampling harms majority-class performance greatly.

Finally, we compare the ensembles without using OAA and the ones using OAA with the weighted combination method. The result suggests that it is not necessary to use class decomposition, and learning from the whole data set directly is sufficient for class imbalance learning techniques to achieve good performance.

Future work of this study includes the following: 1) an in-depth study of conditions, including parameter values, under which an ensemble approach, such as AdaBoost. NC, is able to improve the performance of multiclass imbalance learning; currently, the parameter of $\lambda$ in AdaBoost. NC is predefined, and a large $\lambda$ shows greater benefits; some parameter-optimizing methods might be helpful here [39]; 2) an investigation of other two-class imbalance learning methods into how their effectiveness is affected by multiclass and their potential advantages, such as RAMOBoost [36], RareBoost [45], JOUS-Boost [37], and cost-sensitive methods; 3) a theoretical study of the advantages and disadvantages of the proposed methods for multiclass imbalance problems and how they handle the multiclass imbalance; 4) an investigation of new ensemble algorithms that combine the strength of AdaBoost. NC and SMOTEBoost; 5) a theoretical framework for analyzing multiclass imbalance problems since it is unclear how an imbalance rate could be more appropriately defined.

## Footnotes

This work was supported by an ORS Award, an EPSRC Grant (No. EP/D052785/1), and a European FP7 Grant (No. 270428). This paper was recommended by Associate Editor N. Chawla.

The authors are with the Centre of Excellence for Research in Computational Intelligence and Applications, School of Computer Science, University of Birmingham, B15 2TT Birmingham, U.K. (e-mail: S.Wang@cs.bham.ac.uk; X.Yao@cs.bham.ac.uk).

Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.

## References

No Data Available

## Cited By

No Data Available

None

## Multimedia

No Data Available
This paper appears in:
No Data Available
Issue Date:
No Data Available
On page(s):
No Data Available
ISSN:
None
INSPEC Accession Number:
None
Digital Object Identifier:
None
Date of Current Version:
No Data Available
Date of Original Publication:
No Data Available