Feature Learning Viewpoint of Adaboost and a New Algorithm

The AdaBoost algorithm has the superiority of resisting overfitting. Understanding the mysteries of this phenomenon is a very fascinating fundamental theoretical problem. Many studies are devoted to explaining it from statistical view and margin theory. In this paper, this phenomenon is illustrated by the proposed AdaBoost+SVM algorithm from feature learning viewpoint, which clearly explains the resistance to overfitting of AdaBoost. Firstly, we adopt the AdaBoost algorithm to learn the base classifiers. Then, instead of directly combining the base classifiers, we regard them as features and input them to SVM classifier. With this, the new coefficient and bias can be obtained, which can be used to construct the final classifier. We explain the rationality of this and illustrate the theorem that when the dimension of these features increases, the performance of SVM would not be worse, which can explain the resistance to overfitting of AdaBoost.

According to Occam's razor [11], when a classifier was trained too complex, the performance of it would be even worse rather than better.This phenomena is called overfitting, which means that the trained model is so adaptable to the training data that it would exaggerate the slight fluctuations in the training data, leading to poor generalization performance [4], [12].However, the AdaBoost algorithm has the superiority of resisting overfitting, which has been observed by many researches [13]- [15].Understanding the mysteries of this phenomena about AdaBoost algorithm is a fascinating fundamental theoretical problem [4], [12].Many studies are devoted to explaining the success of AdaBoost, which can be divided into statistical view and margin theory [16].
In the statistical view, great efforts were made to illustrate the success of AdaBoost algorithm.Friedman et al. [17] F. Wang, ZH.Li and W. Yu are with the National Engineering Laboratory for Visual Information Processing and Applications, and the School of Electronic and Information Engineering, Xian Jiaotong University, Xian, Shaanxi 710049, China (e-mail: wfx@mail.xjtu.edu.cn,lizhongheng2010@gmail.comand yuwz05@xjtu.edu.cn).
F. He is with the Xi'an Research Institute of Hi-Tech, Xi'an, Shaanxi 710025, China (e-mail:fanghe1107@gmail.com).
Rong Wang and F Nie are with the Center for OPTical IMagery Analysis and Learning (OPTIMAL), Northwestern Polytechnical University, Xi'an, Shaanxi 710072, China (e-mail:wangrong07@tsinghua.org.cn and feipingnie@gmail.com).
utilized the well-known statistical principles, additive modeling and maximum likelihood, to understand this mysterious phenomenon.Besides, many boosting-style algorithms were proposed with optimizing the potential loss functions in a gradient decent way [4], [18], [19].Inspired by this optimal method, some boosting-style algorithms and their variants that were consistent to Bayes's under different conditions were presented.[4], [20]- [26].However, the biggest problem of the statistical view is that these algorithms do not explain well why AdaBoost is resistant to overfitting [4], [12].
The margin theory is another direction to solve this problem.Schapire et al. [27] were the first ones to use the margin theory to explain this overfitting phenomenon.Generally speaking, the margin of an example associated with the classifier is a measuring standard of the classification ability [10].Schapire et al. [27] demonstrated that AdaBoost model can produce a good margin distribution which is the key to the success of AdaBoost.Soon after that, Breiman [28] put a doubt on this margin explanation.He proposed a boosting-type algorithm named arc-gv which directly maximizes the minimum margin for the generalization error of a voting classifier.In experiments, arc-gv can generate a larger minimum margin than AdaBoost, but it brought higher generalization error.Breiman concluded that neither the margin distribution nor the minimum margin has influence to the generalization error.Later, Reyzin and Schapire [3] found that, amazingly, Breiman had not controlled model complexity well in the experiments.They repeated Briemans experiments using decision stumps with two leaves.The results showed that arc-gv was with larger minimum margin, but worse margin distribution.Therefore, a convincing explanation is urgently needed [10].
To support the margin theory, Wang et al. [29] proved a bound in terms of a new margin measure called the Equilibrium margin (Emargin).The Emargin bound was uniformly sharper than Breimans minimum margin bound.The results suggested that the minimum margin may be not crucial for the generalization error and a small empirical error at Emargin implied a smaller bound of the generalization error.Gao and Zhou [4].proposed the kth margin bound to defend the margin theory against Breiman's doubt by a series mathematical derivation.This model was uniformly tighter than Breimans as well as Schapires bounds and considered the same factors as Schapire et al. and Breiman.Zhou et al. [12] proposed the Large margin Distribution Machine (LDM) to achieve a better generalization performance by optimizing the margin distribution.The margin distribution was characterized by the first-order statistic margin mean and second-order statistic variance.Then, the margin mean was tried to be maximum.At the same time, the margin variance was tried to be minimum.This method realized satisfactory results.However, completely explaining AdaBoost's resistance to overfitting is still difficult.
In this paper, we illustrate the resistant overfitting phenomena of AdaBoost from feature learning viewpoint and using the SVM classifier to explain it.SVM classifier is very useful for its clear principles and competitive accuracy [30] [31] [32] [33] [34] [35] [36] [37].We regard the results of base classifiers of AdaBoost as features and input them to SVM and explain the rationality of doing this.This means when the iterations of AdaBoost increase, the features' dimension increases.We illustrate that the margin of SVM (not the margin of AdaBoost itself) will not be smaller when the features' dimension increase.This implies that the performance of our AdaBoost+SVM model will improve when the iterations increase, which can directly and easily explain the resistant overfitting phenomena of AdaBoost rather than the complex proof.By the way, we also illustrate that the error rate of a binary classifier is always not bigger than 0.5, which is not noticed by other researchers.
The rest of this paper is organized as follows.In Section II, we have a briefly survey on the related work of the AdaBoost Algorithm.In Section III, we present our AdaBoost+SVM model.In Section IV, we validate our methods on different datasets.In Section V, we come to a conclusion.

II. RELATED WORK
In this section, we briefly review the general AdaBoost algorithm and the popular theoretical explanation to AdaBoost from the view of margin theory.

A. AdaBoost Algorithm
AdaBoost algorithm [1] is one of boosting classification algorithms which can boost a group of weak classifiers to a strong classifier.These algorithms usually first use a base classify algorithm whose classification ability is just better than random guessing to train a base classifier from the initial training samples.Then adjust the sample weight according to the result of the base classifier, which makes the samples that was classified incorrectly be paid more attention to.And then use the adjusted samples to train a next base learner.After iterations, weighted are added to these base learners to form the final classifier.Next is the description of AdaBoost algorithm. Let is the class label associated with x i .AdaBoost algorithm is based on the additive model, which is the linear combination of the base classifiers h t (x): where t = {1, • • • , T } denotes the iteration number, h t (x) are the base classifiers trained from base classification algorithm L whose classified ability is just better than random guessing and α t are the weight coefficients.

Algorithm 1 AdaBoost algorithm
Input: Base classification algorithm L ; Number of learning rounds T .
Initialize sample weight vector Using the base classification algorithm L and current weight D t to learn the base classifier h t (x) by minimizing the classification error t defined in Eq. ( 6); 2. Calculating the coefficient α t based on Eq. ( 7); 3. Updating the weight D t+1 by Eq. ( 4); end Combining the obtained h t (x) according to Eq. ( 10) to complete the final classifier.

Output:
The final classifier F (x).
In Eq.( 1), h t (x) are learned from base classification algorithm L based on the training sample with weight vector D t at t iteration.
The weight vector D t denotes the weight of each instance in S at t iteration.D 1 is composed as and D t+1 are composed as where Z t are the normalization factors and are calculated as From Eq.( 4) and Eq.( 5) we know that D t+1 are adjusted from D t .Thus the samples which are classified incorrectly in h t (x) will have higher weights in t + 1 iteration.
Given training set S and sample weight D t , the object of h t (x) is to minimizing the classification error t .t is calculated as where P [•] denote the probability and I[•] denote the logic value.
In Eq.( 1), α t measure the importance of h t (x) in the final classifier and are calculated in the following way.
From Eq.( 6) and Eq.( 7) we can know that when t < 0.5 , α t > 0 .And α t would increase with t decrease.In fact, AdaBoost minimizes the exponential loss function in this process [17].
We can also notice that in AdaBoost algorithm, t must smaller than 0.5 .In this condition, we make a remark which is not be noticed by others as follow: Remark.The error rate of a binary classifier is always not bigger than 0.5 .
Explanation.For binary classification problems, if the error rate of weak classifier is bigger than 0.5, we can use the classifier to replace h(x), which makes the error rate c convert to 1− .Then the c is smaller than 0.5 .This remark illustrates that t would always be smaller than 0.5 unless equal to 0.5 .But it is almost impossible that the error rate of a classifier happens to be 0.5 .
After continuous iteration, the final classifier is This algorithm is summarized in Algorithm 1.

B. Intuition of the Margin Theory
AdaBoost is one of the most influential and successful classification algorithms.However, the mystery of the phenomenon of its resistant overfitting attracted many scholars working on it.A theory which is intuitive to explain this phenomenon is the margin theroy.That is although the training error reaches zero, the margin of AdaBoost will increases along with the iterations increase.
Schapire et al. [27] were the first ones to use the margin theory to explain this overfitting phenomenon.Define yf (x) as the margin for (x, y) with respect to f .Use P D [•] to refer as the probability with respect to sample weight vector D, and P S [•] to denote the probability with respect to uniform distribution over the sample S. They first proved the following theorem to bound the generalization error of each voting classifier: Theorem 1 Let S be a sample of n examples chosen independently at random according to D. Assume that the base hypothesis space H and δ > 0. Then with probability at least 1 − δ over the random choice of the training set S, every weighted averange function f ∈ C satisfies the following bound for all θ > 0: If H is finite, then where d is the VC dimension of H.This theorem illustrates that if a voting classifier generates a good margin distribution, then the generalization error is also small.
Then they propose that if θ is not too large, the fraction of training examples for which yf (x) ≤ θ decreases to zero exponentially fast with the number of base hypotheses.
Theorem 2 Suppose the base learning algorithm, when called by AdaBoost, generates hypotheses with weighted training errors 1 , • • • , T .Then for any θ, Assume that, for all t, t ≤ 1/2 − γ for some γ > 0, the upper bound in Eq. 13 can simplify to: If θ < γ, it is easy to find that the expression inside the parentheses is smaller than 1 so that the probability that yf (x) < θ decreases exponentially fast with T .That is to say with the T increase, AdaBoost can provide better margin distribution, which seems to explained the resistant overfitting phenomenon.This explanation is quite intuitive.Later, Breiman [28] proved a generalization bound, which is tighter than Eq.( 11) and designed the arc-gv algorithm which directly maximizes the minimum margin.According to margin theory, arc-gv should perform better than AdaBoost.However, the experiments results show that arc-gv does produce uniformly larger minimum margin but the test error increases.Thus Breiman concluded that the margin theory was in serious doubt.
Several years later, Reyzin and Schapire [3] found that Breiman had not controlled model complexity well in the experiments.They repeated Briemans experiments using decision stumps with two leaves.The results showed that arcgv was with larger minimum margin, but worse margin distribution.Later, Wang et al. [29] proved a bound in terms of a new margin measure called the Equilibrium margin (Emargin).The Emargin bound was uniformly sharper than Breimans minimum margin bound.The results suggested that the minimum margin may be not crucial for the generalization error and a small empirical error at Emargin implied a smaller bound of the generalization error.Gao and Zhou [4].proposed the kth margin bound to defend the margin theory against Breiman's doubt by a series mathematical derivation.This model was uniformly tighter than Breimans as well as Schapires bounds and considered the same factors as Schapire et al. and Breiman.Zhou et al. [12] proposed the Large margin Distribution Machine (LDM) to achieve a better generalization performance by optimizing the margin distribution.The margin distribution was characterized by the first-order statistic margin mean and second-order statistic variance.Then, the margin mean was tried to be maximum.At the same time, the margin variance was tried to be minimum.This method realized satisfactory results.However, completely explaining AdaBoost's resistance to overfitting is still difficult.
In fact, the certain relationship of the margin yf (x) of AdaBoost itself with the iteration number T is still not clearly from the works above.That is these works can not directly explain the resistant overfitting of AdaBoost when the iteration number T increases even after the training error reaches 0. In next section, we will introduce our AdaBoost+SVM model, which can give the certain relationship of iteration number T and the SVM margin to explain the resistant overfitting phenomena of AdaBoost directly.

III. FEATURE LEARNING VIEWPOINT
In this section, we propose our AdaBoost+SVM model to explain the resistant overfitting phenomena of AdaBoost from the feature learning viewpoint and explain the rationality of doing this.
Freund and Schapire [1] have proved that the training error of AdaBoost decreases exponentially fast constantly during the learning process.There is a theorem as follows: Theorem 3 The training error of AdaBoost will always reach 0 since the iterations increase.
This theorem comes from the following equation: where γ t = 0.5 − t , T denotes the number of iteration.
We have explained that t is always smaller than 0.5 in previous chapter.Therefore 0 < γ t ≤ 0.5 .
Let ∀γ t , 0 < γ ≤ γ t , Then from Eq.( 16) we can easily know that the error of AdaBoost will be reduced at an exponential rate and always reach 0 when iterations are enough.Based on this theorem, we will propose our model to view AdaBoost from the feature learning point next.

A. AdaBoost+hard margin SVM
The boosting part T t=1 α t h t (x) in final classifier of Ad-aBoost given in equal (10) can be rewritten as In Eq.( 17), we can regard the right vector as a feature of sample x that learning from AdaBoost.Then α = [α 1 , α 2 , • • • , α T ] is the weight vector of this feature provided by AdaBoost algorithm.In other words, we can regard the process x → z(x) as a R n → R T spatial mapping.
From this view, α can be viewed as a hyperplane in the feature space R T and divide the features into two categories.However, α may not be the best hyperplane for classifying the features.According to Theorem 2, the training error of AdaBoost will always reach 0. And the training error of AdaBoost reaches 0 means the feature z(x) ∈ R T can be linearly separated into two categories.Another fact is that SVM algorithm can provide the separating hyperplane with largest margin in linearly separable problem, which means an excellent solution [38].Based on this, use SVM algorithm to calculate the hyperplane in the feature space R T to replace α should be a better choice.This is the theoretical basis of our model.The algorithm will be described next.
Given the training sample S, get the feature function z(x) by AdaBoost algorithm according to Eq.( 18), first.Then, according to the following objective function of SVM: we can calculate the optimal weight vector β and bias b.Last, the final classifier can be learned by We illustrate the feature learning view of AdaBoost in Fig. 1.

B. AdaBoost+soft magin SVM
Although the training error of AdaBoost will always reach 0 with the iterations growing, in practical situation, the training error may usually not reach 0 beacause of the fixed iterations or large-scale data et al.In this situation, the features from z(x) ∈ R T maybe can not be linearly separated so that the hard margin SVM is not suitable.To solve this problem, we use soft margin SVM to replace the hard margin SVM, i.e. add an additional margin violation ξ i to Eq.( 19).The objective function is [39]: where C represents the tolerance of the margin violation ξ i .
Then we can use this way to solve our model whether the training error of AdaBoost reachs 0 or not.
The algorithm of AdaBoost+SVM is described in Algorithm 2.
linearly separable feature

AdaBoost
Fig. 1: The feature learning view of AdaBoost.

Input:
Training sample S; Base learning algorithm L; Number of base learners T .1. Using AdaBoost to get the feature function z(x) according to Eq.( 18); 2. Using SVM classifier to calculate the new coefficient β and the bias b according to Eq.( 21); 3. Combining the obtained β , b and the feature function to complete the final classifier according to Eq. ( 20).

Output:
The final classifier F (x).

C. The explanation to the resistant overfitting by our model
Overfitting is a common problem in many classification situation.However, the AdaBoost algorithm can resist overfitting.Understanding the mystery is a very fascinating fundamental theoretical problem.Our AdaBoost+SVM model also has the superiority of explaining this phenomenon.We utilize the following theorem to analyze this property from feature learning viewpoint.
Theorem 4 SVM is a linear classifier in the feature space.As the dimensions of features increases, the margin of SVM will not be smaller.
Proof.The objective function of SVM is: The optimal solution β * and b * can be calculated.Then, the corresponding hard margin separation hyperplane is If x increases to (x, x t ), then the corresponding β becomes to (β, β t ).The new optimal solution β * new and b * new can be obtained.Then, the corresponding hard margin separation hyperplane is If β t = 0, the margin will stay the same.If β t = 0, the margin will become larger.In a word, the margin does not become smaller when the dimensions of features are increased.According to [16], the bigger margin is, the higher the predictive confidence is.Therefore, as the number of feature increases, the classification performance will not decrease.
Based on this theorem, we regard the obtained results of base classifiers by the AdaBoost algorithm as features of SVM.As the dimensions of features increases, the performance of SVM classifier would be improved, which can easily explain the advantage of AdaBoost that can resist overfitting.Therefore, our AdaBoost+SVM model have illustrated the mysteries of resistant overfitting from the feature learning viewpoint.
It should be noticed that as the number of T increased, we can not directly obtain the θ in Eq.( 13) also increase.But in our model, the increase of T must be helpful for better performance, which can explain the resistant overfitting phenomena of AdaBoost directly and easily.

IV. EXPERIMENT
In this section, we conduct experiments on four binary benchmark datasets to demonstrate the efficiency and effectiveness of the proposed method.Then, we have a detail analysis about the experimental results.

A. Datasets
We utilize the following 4 binary datasets to evaluate the performance of our model.
1) fourclass: This dataset totally has 862 samples and 2 dimensions.2) ionosphere: This dataset is one of the UCI dataset with 351 samples and 34 features.3) chess: This dataset is also belongs to UCI dataset with 3196 samples and 36 features.4) monk1: This dataset is also one of the UCI dataset with 432 samples and 6 features.The detail descriptions of all datasets are also listed in Table I.

B. Comparison Methods
To demonstrate the effectiveness of the proposed approaches, we compare it with the classical AdaBoost algorithm.For all the methods, we run 5 times and evaluate the classification results with the average classification accuracies

C. The effect on the number of weak learner
In most AdaBoost algorithm, the number of weak learner is set empirically.Now, we conduct experiments on the above four binary datasets to observe the effect on the number of weak learner.We adopt the 10-fold cross validation way [40] to select 90% in each class labeled samples as the training data to construct the classifier, the rest 10% samples as the testing data to evaluate the performance of this classifier.Firstly, each dataset S was divided into 10 mutually exclusive subsets of similar size, that is S = S 1 S 2 • • • S 10 , S i S j = ∅(i = j).Besides, each subset keeps the data distribution as consistent as possible.Then, the set of 9 subsets was used as a training set, and the remaining subset as the test set in      classification problems.Therefore, we can conclude that our AdaBoost+SVM is very close to the AdaBoost, which can be regarded as a new explanation to the AdaBoost algorithm.

D. The effect on the number of training data
Next, we conduct experiments on the above datasets to observe the effect on the number of training data.For all the datasets, we fix the number of weak learner to 200.We vary the percent of labeled samples in each class from 10% to 90% as the training data, the remaining samples as testing data.The results are shown in Fig. 2, Fig. 3, Fig. 4 and Fig. 5.As a general trend, the classification accuracy and time costs increase with the number of training samples increasing on all the datasets.However, more training data brings more time to train this classifier.Then, in terms of classification accuracies and time costs on all the datasets, the results of AdaBoost and AdaBoost+SVM are still close, which verifies the performances of the two is comparable again.

V. CONCLUSION
In this paper, we have presented an AdaBoost+SVM model from the feature learning viewpoint to explain the success of AdaBoost that can resist the overfitting problem.Instead of directly weighted combination the base classifiers calculated by AdaBoost, we regard them as the new features and input them to the SVM classifier.The iterations increase means the dimensions of features are increase, so that the performance of SVM would be improved, which can explain the resistant overfitting phenomenon of AdaBoost model in a simple way.The results on four binary datasets show that AdaBoost+SVM can produce the comparable results to the AdaBoost algorithm, which illustrates the rationality to understand the AdaBoost algorithm from the feature learning viewpoint.

Fig. 2 :
Fig. 2: Classification accuracy (%) and time costs vs the percent of labeled samples on fourclass dataset.

Fig. 3 :
Fig.3: Classification accuracy (%) and time costs vs the percent of labeled samples on ionosphere dataset.

Fig. 4 :
Fig. 4: Classification accuracy (%) and time costs vs the percent of labeled samples on chess dataset.

Fig. 5 :
Fig. 5: Classification accuracy (%) and time costs vs the percent of labeled samples on monk1 dataset.

TABLE I :
Dataset Description

TABLE II
In this way, 10 times training and testing procedures can be conducted.Finally, we can calculate the mean of 10 times testing results.The number of weak learner varies from 100 to 1000 and the corresponding classification accuracy (%) and time cost (s) on the four datasets are recorded in TableII, Table III, Table IV and Table V.From Table II to Table V, we have the following observation.Firstly, on the fourclass and chess datasets, the

TABLE III
the classification accuracies calculated by the two models are exactly equal.On the ionosphere dataset, the classification accuracies obtained by AdaBoost+SVM are a little lower than AdaBoost+SVM.However, the gap between the results of AdaBoost and AdaBoost+SVM is not very large.Then, we can see that time costs obtained by AdaBoost and AdaBoost+SVM have little gap.The reason is that SVM can quickly handle

TABLE V :
Classification accuracy (%) and time cost (s) vs the number of weak learner on monk1 dataset