Optimal Margin Distribution Additive Machine

In recent years, sparse additive machines have attracted increasing attention in high dimensional classification due to their flexibility and representation interpretability. However, most of the existing methods are formulated under Tikhonov regularization schemes associated with the hinge loss, where the distribution information of observations is neglected usually. To circumvent this problem, we propose an optimal margin distribution additive machine (called ODAM) by incorporating the optimal margin distribution strategy into sparse additive models. The proposed approach can be implemented by a dual coordinate descent algorithm and its empirical effectiveness is confirmed on simulated and benchmark datasets.


I. INTRODUCTION
Sparse additive machines (SAMs) have shown promising performance for classification [5], [8], [33], which are rooted in generalized additive models [14], [25]. Following the popular sparse additive model (SpAM) [20] for regression, the sparse additive machine is proposed in [33] for binary classification. Moreover, the group sparse additive machine (GroupSAM) is formulated in [5] based on kernel-based hypothesis spaces and grouped variables, where it can be implemented via the proximal gradient descent algorithm [19], [33]. Notice that the above additive classification models are formulated under the Tikhonov regularization schemes associated with the hinge loss, which is related closely with the large margin strategy for binary classification [10], [23], [28].
In machine learning literature, there are some classification models from the perspective of margin distribution. [22] suggests margin theory may explain the phenomenon that Adaboost may resistant to overfitting, and [3] points out the importance of the minimum margin and proposes the corresponding algorithm by maximizing the minimum margin. Later, [21] conjectures that the margin distribution has a more important influence on the generalization performance, which has been verified empirically in [11]. As shown in [11], [35], both margin mean and variance are crucial to characterize the margin distribution. With respect to the support vector machine (SVM), the optimal margin distribution The associate editor coordinating the review of this manuscript and approving it for publication was Dong Wang . machine (ODM) is proposed in [26], [36] and has shown the competitive performance. Since the previous works mainly focus on linear classification models or kernel-based classifiers, it is natural and important to further incorporate the marginal distribution strategy into additive models.
Inspired by recent works in [26] and [33], we propose a new classification algorithm, called optimal margin distribution additive machine (ODAM), which integrating the optimal marginal distribution strategy, data dependent hypothesis spaces, and sparse additive models together. Firstly, the sparse additive machine is used to construct the frame of model. From the represent theorem of kernel methods, the model can be transformed into a concise expression. And this guarantees the interpretability and flexibility of the model. Then ODAM learns the strategy of [26] and explores the optimal model from the perspective of margin distribution associated with SAM, which assures a strong generalization of the method. Finally, the proposed model can be optimized by a dual coordinate descent method. To support the motivation of algorithmic design, we evaluate the proposed ODAM via data experiment on simulated and benchmark data. Our main contributions can be summarized as follows.
• A new sparse additive machine, called ODAM, is proposed in this paper. The proposed ODAM seeks the decision rule according to marginal distribution, rather than from the perspective of the loss function. To the best of our knowledge, the optimal margin distribution strategy has not been investigated in additive models. • The proposed ODAM shows the empirical effectiveness of classification on both simulated and real-world data sets. Besides, it enjoys the stability and interpretability of classification results. We organized the rest of this paper as follows. After recalling the preliminaries of binary classification and sparse additive machine in Section II, we formulate the optimal margin distribution additive machine and its optimization algorithm in Section III and IV, respectively. Then an empirical evaluation is presented in Section V. Finally, we conclude this paper in Section VI.

II. PRELIMINARIES
In the section, we recall the preliminaries of additive models for binary classification.

A. BINARY CLASSIFICATION PROBLEM
We denote by X ∈ R p a compact input space and Y = {−1, 1} the corresponding output set. Let ρ be an unknown distribution over Z = X × Y to generate the input-output pair (x, y).
drawn independently from unknown ρ on Z, the purpose is to search a classifier sgn(f ) induced by a decision function f : X → R to minimize the misclassification risk Z I(sgn(f )(x) = y)dρ(x, y), where I(A) = 1 if A is true and 0 otherwise.
The most popular surrogate loss of I(sgn(f )(x) = y) is the hinge loss (yf (x)) = (1 − yf (x)) + = max{0, 1 − yf (x)} due to its excellent property in classification margin and support vectors [10], [23], [28]. Associated with the hinge loss, the expected risk and the empirical risk can be denoted as and For support vector machines [29], the margin of instance (x i , y i ) is defined as (1)

B. SPARSE ADDITIVE MACHINE
As nonlinear extensions of linear models (e.g., Lasso [27], 1 -norm SVM [2], [34]), sparse additive models have shown much flexibility and adaptivity on prediction and variable selection. In theory, the additive requirement of hypothesis space in additive models is crucial to cure ''the curse of dimensionality'' of nonparametric learning [13], [17], [24]. For the binary classification problem, additive models under Tikhonov regularization scheme also have shown promising performance., see, e.g., [5], [8], [33]. Now, we recall the kernel additive models in [5], [8]. Let as a Mercer kernel and denote H K j as the corresponding reproducing kernel Hilbert space (RKHS) [1] with norm · K j . The additive hypothesis space is also an RKHS with additive kernel K = p j=1 K j [9], [18], [32]. The support vector machine with additive kernel K = p j=1 K j in [8] is defined as below: where η is a trade-off parameter. According to the representation properties of (2),f z always belongs to the following data dependent hypothesis space where It is natural to consider the coefficient-based penalty (see e.g., [6], [7], [12]) defined as Then, the sparse additive model in [5] is formulated as follows: where λ > 0 is a regularization parameter.

III. OPTIMAL MARGIN DISTRIBUTION ADDITIVE MACHINE
This section formulates the optimal margin distribution sparse additive machine (ODAM). We know the first-order and second-order statistics are the most straightforward metrics for characterizing the margin distribution, e.g., the mean and the variance of the margin. Denote y := [y 1 , . . . , y n ] T , and Y is a n × n diagonal matrix with y 1 , . . . , y n as the diagonal elements. According to the definition in (1) and (3), the margin of sparse additive machine is Then, the margin mean is and the margin variance iŝ where I n is the n-order identity matrix. Following the large margin strategy of SVM in [4], we can get that the maximization of the minimum distance in additive model as below whereι i = ι i α denotes the margin of additive model. Since ι does not have influence on the optimization, we can set ι = 1 simply. Besides, maximizing 1/ α is equivalent to minimizing α 2 /2, we get the following optimization: If the training examples cannot be separated with the zero error, (6) can be reformulated as below: where ξ i is slack factor and C is the slack parameter. Firstly, we consider the separable case where the training examples can be separated with zero error. The minimum of the margin variance and the maximization of the margin mean induce to the following hard-margin ODAM, where λ 1 and λ 2 are the trade-off parameters to balance the margin mean, the margin variance and the model complexity.
In the non-separable cases, similar to soft-margin SVM, the soft-margin ODAM leads to min α,ξ Because (8) is a special case of (9), we will focus on softmargin ODAM. If without clarification, ODAM is referred to the soft-margin ODAM.

IV. COMPUTING ALGORITHM
In this section, we present the optimization algorithm for ODAM. The dual of ODAM is convex quadratic optimization with simple constraints, and can be computed via a dual coordinate descent method [26], [30]. By substituting (4) and (5) into (9), we get the following quadratic programming problem: where e ∈ R n stands for the all-one vector. We introduce the lagrange multipliers η > 0 and β > 0 for first and second constraints respectively to obtain the Lagrangian function of (10) where ω = I n×p + X 2λ 1 (nI n −yy T ) n 2 X T . In order to solve this problem, we set the partial derivations of {α, ξ } to 0, then get VOLUME 8, 2020 Substituting (12) and (13) into (11), we get the dual of (10) According to Lemma 1 in [36], and G = X T X . We denote H = YG(I n + AG) −1 Y , then the objective function of (14) can be cast as Because the value of const is not influent on the optimization, we neglect the const term. Then, the final formulation of ODAM is described as below According to (12) and ω −1 X = X (I n + AG) −1 , we can obtain the coefficients α from the optimal η * as For each input x, we predict its output as As shown in [31], the convex quadratic objective function with simple decoupled box constraints can be efficiently solved by the dual coordinate descent method. Observe that ODAM can be considered as a special case of convex quadratic object function (that is ν = ∞), According to [15], one of the variables is selected to minimize the target while the other variables are kept as constants at each iteration. The following close-form can be obtained at each iteration: where e i denotes the vector with 1 in the i − th coordinate and 0 elsewhere. Let H = [h ij ] i,j=1,...,n , we have where [∇f (η)] i is i-th component of the gradient ∇f (η). From (16), we know that f (η + te i ) is a simple quadratic function of t, so we get a close-form solution We summary the optimization algorithm of ODAM as below.

V. EXPERIMENTAL ANALYSIS
In this section, we assess the empirical performance of our approach on simulated and real-world datasets, where 1 -SVM [34], SAM [33], and ODM [36] are employed as baslines. In Section V-A, we introduce the experimental setting. And we conduct the simulated and the realworld data experiments in Section V-B, V-C respectively, and give related analysis from classification accuracy and interpretation.

A. EXPERIMENTAL SETUP
In all experiments, the RKHS H K (with Gaussian kernel 2σ 2 )) is employed as the hypothesis function space. The learning performance is measured by the average classification accuracy with standard deviation. All parameter selections are performed on the training set and the specific selection method is as follows. For ODM and ODAM, the regularization parameter C is selected by 5-fold cross validation from [0.1, 0.5, 1, 5, 10, 50], and λ 1 and λ 2 are selected from the set of [10 −6 , 10 −5 , . . . , 10 −1 ]. In addition, the bandwidth of kernel σ is obtained from [1 + 10 −1 i, i = 0, 1, . . . , 100].

B. SIMULATED DATA
To illustrated the performance of our method, we consider two synthetic data following the experimental design in [5], [33].
Example 1: We generate each p dimension input x t = (x t1 , . . . , x tp ) T by x tj = W tj +V t 2 , where both W tj and V t are extracted from the uniform distribution U (0, 1). The function selected as the discriminant function in first example and the label is assigned by y t = sgn(f )(x t ). Example 2: The second example follows the way of Example 1 to generate data. The different is that the discrim- For each evaluation, we consider training set with different size n = 100, 200, 400 and dimension p = 100, 200, 400. And the training and test sets are generated with identical sample size. To ensure reliable results, each evaluation is repeated 50 times. The results of experiments are displayed in Table 1. To illustrate the stability of the proposed method, we applied the method in [16] to conduct another experiment on Example 2 with sample size n = 200, dimension p = 100 and different flipping rate r = 0.0, 0.025, 0.05, . . . , 0.3. The experimental results are shown in Figures 1 and 2. Finally, we give the coefficient-based norm of ODAM associated with each input variable α j 1 = n i=1 |α ji |, and the corresponding results are shown in Figure 3.  From Table 1, we can see that the proposed method has higher classification accuracy than compared approaches on two examples. For the second complex example, our method has a more obvious advantage. As shown in Figures 1 and 2, our ODAM also is much stability under different flipping  rates besides promising classification accuracy. Figure 3 illustrates that the coefficient-based norms associated with true variables are larger than the other variables, which implies that we can screen them with a suitable threshold.

C. REAL-WORLD DATA
We select 12 benchmark datasets from UCI (http:// archive.ics.uci.edu/ml) to illustrate the classification performance of ODAM. The partial information of these data sets is counted in Table 2. For each dataset, we select 50% as the test set, and let the rest as the training set. All features are normalized to [0, 1] and each experiment is repeated 25 times. The main metric is classification accuracy with standard deviation and the average results are shown in Table 3.  Table 3, the proposed ODAM has better performance on most benchmark datasets, and has only extremely small gaps on other datasets. In addition, the small standard deviations in most datasets illustrate the stability of the new method. All these outcomes verify the effectiveness of our method.

VI. CONCLUSION
This paper proposes a new optimal margin distribution additive machine (ODAM) in RKHS by integrating the additive machine and the optimal margin distribution strategy. With the help of an optimization strategy based on the dual coordinate descent algorithm, we verified the effectiveness of ODAM on simulated and benchmark datasets.

ACKNOWLEDGMENT
(Changying Guo and Hao Deng contributed equally to this work.) HAO