An Efficient v -Minimum Absolute Deviation Distribution Regression Machine

Support Vector Regression (SVR) and its variants are widely used regression algorithms, and they have demonstrated high generalization ability. This research proposes a new SVR-based regressor : v -minimum absolute deviation distribution regression ( v -MADR) machine. Instead of merely minimizing structural risk, as with v -SVR, v -MADR aims to achieve better generalization performance by minimizing both the absolute regression deviation mean and the absolute regression deviation variance, which takes into account the positive and negative values of the regression deviation of sample points. For optimization, we propose a dual coordinate descent (DCD) algorithm for small sample problems, and we also propose an averaged stochastic gradient descent (ASGD) algorithm for large-scale problems. Furthermore, we study the statistical property of v -MADR that leads to a bound on the expectation of error. The experimental results on both artiﬁcial and real datasets indicate that our v -MADR has signiﬁcant improvement in generalization performance with less training time compared to the widely used v -SVR, LS-SVR, ε -TSVR, and linear ε -SVR. Finally, we open source the code of v -MADR at https://github.com/AsunaYY/v-MADR for wider dissemination.


I. INTRODUCTION
Support vector regression (SVR) [1]- [3] has been widely used in machine learning, since it can achieve better structural risk minimization. SVR realizes linear regression mainly by constructing linear decision functions in high dimensional space. Compared with other regression methods, such as least square regression [4], Neural Networks (NN) regression [5], logistic regression [6], and ridge regression [7], SVR has better generalization ability for regression problems [8]- [10]. In recent years, there have been many studies about SVR-based algorithms. Several SVR approaches have been developed, such as ε-support vector regression (ε-SVR) [1], The associate editor coordinating the review of this manuscript and approving it for publication was Mohamad Forouzanfar . [11], v-support vector regression (v-SVR) [12], and least square support vector regression (LS-SVR) [13], [14]. The basic idea of these methods is to find the decision function by maximizing the boundaries of two parallel hyperplanes. Different from ε-SVR, v-SVR introduces another parameter, v, to control the number of support vectors and adjust the parameter ε automatically. The parameter v has a certain range of values, that is, (0,1]. When solving the quadratic specification problem (QPP), v-SVR reduces the number of computational parameters by half, which greatly reduces the computational complexity. Besides, some researchers have proposed the non-parallel planar regressors, such as twin support vector regression (TSVR) [15], ε-twin support vector regression (ε-TSVR) [16], parametric-insensitive nonparallel support vector regression (PINSVR) [17], lagrangian VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ support vector regression [18], and lagrangian twin support vector regression (LTSVR) [19]. These algorithms demonstrate good ability to capture data structure and boundary information. Support vector (SV) theory indicates that maximizing the minimum margin is not the only way to construct the separating hyperplane for SVM. Zhang and Zhou [20], Zhou [21], Zhang and Zhou [22], and Gao and Zhou [23] proposed the large margin distribution machine (LDM), which was designed to maximize the margin mean and minimize the margin variance simultaneously. Gao and Zhou [23] proved that the margin distribution including the margin mean and the margin variance was crucial for generalization compared to a single margin, and optimizing the margin distribution can also naturally accommodate class imbalance and unequal misclassification costs [21]. Inspired by the idea of LDM, Liu et al. proposed a minimum deviation distribution regression (MDR) [24], which introduced the statistics of regression deviation into ε-SVR. More specifically, MDR minimizes the regression deviation mean and the regression deviation variance while optimizing the minimum margin. In addition, Reshma and Pritam were also inspired by the idea of LDM, and they proposed a large-margin distribution machine-based regression model (LDMR) and a new loss function [25], [26]. However, the definition of the deviation mean in MDR is not very appropriate for positive and negative samples, and the speed of ε-SVR strategy that MDR used can be further improved.
Considering the above advances in SVR, in this research, we introduce the statistical information into v-SVR and propose an v-minimum absolute deviation distribution regression (v-MADR). We give the definition of regression deviation mean which takes into account both the positive and negative values of the regression deviation of sample points. Inspired by recent theoretical results [20]- [24], v-MADR simultaneously minimizes the absolute regression deviation mean and the absolute regression deviation variance based on the v-SVR strategy, thereby greatly improving the generalization performance [21], [23]. To solve the optimization problem, we propose a dual coordinate descent (DCD) algorithm for small sample problems, and we also propose an averaged stochastic gradient descent (ASGD) algorithm for large-scale problems. Furthermore, the boundary on error expectation of v-MADR is studied. The performance of v-MADR is assessed on both artificial and real datasets in comparison with other typical regression algorithms, such as v-SVR, LS-SVR, ε-TSVR, and linear ε-SVR. According to previous research, SVR-based algorithms show better generalization ability for regression problems [8]- [10]. In conclusion, our experimental results demonstrate that the proposed v-MADR can lead to better performance than other algorithms for regression problems. The main contributions of this paper are as follows: 1) We propose a new regression algorithm that minimizes both the absolute regression deviation mean and the absolute regression deviation variance, and this new algorithm takes into account the positive and negative values of the regression deviation of sample points. 2) We propose two optimization algorithms, i.e., the dual coordinate descent (DCD) algorithm for small samples problems and the averaged stochastic gradient descent (ASGD) algorithm for large-scale problems. 3) We theoretically prove the upper bound on the generalization error of v-MADR and analyze the computational complexity of our optimization algorithms. As SVR-based algorithms are widely used for regression problems, v-MADR has great application potential.
The rest of this paper is organized as follows: Section 2 introduces the notations used in this paper and presents a brief review of SVR as well as the recent progress in SV theory. Section 3 introduces the proposed v-MADR, including the kernel and the bound on the expectation of error. Experimental results are reported in Section 4, and finally, the conclusions are drawn in Section 5.

II. BACKGROUND
Suppose D = {(x 1 , y 1 ) , (x 2 , y 2 ) , . . . , (x n , y n )} is a training set of n samples, where x i ∈ χ is the input sample in the form of d-dimensional vectors and y i ∈ R is the corresponding target value. The objective function is To reduce the complexity brought by b, we enlarge the dimension of w and φ (x i ) as in [27] Thus, the function f (x) = w T φ (x)+b becomes the following form: In what follows, we only consider problems in the form of the above function.

A. THE SVR ALGORITHMS
There are two traditional methods for solving support vector regression (SVR) algorithms, namely ε-SVR [1], [11] and v-SVR [12]. In order to find the best fitting surface, ε-SVR maximizes the minimum margin containing the data in the socalled ε-tube, in which the distances of the data to the fitting hyperplane are not larger than ε. Therefore, ε-SVR with softmargin can be expressed as follows: where parameter C is used for the tradeoff between the flatness of f (x) and the tolerance of the deviation larger than ε; ξ = [ξ 1 , ξ 2 , . . . , ξ n ] and ξ * = ξ * 1 , ξ * 2 , . . . , ξ * n are the slack 85534 VOLUME 8, 2020 variables measuring the distances of the training samples outside the ε-tube from the ε-tube itself as soft-margin; e stands for the all-one vector of appropriate dimensions. The dual problem of ε-SVR is formulated as where α and α * are the Lagrange multipliers; In order to facilitate the calculation, Formula (1) can be transformed as follows: [12] is another commonly used algorithm for solving SVR. Compared with ε-SVR, v-SVR uses a new parameter v ∈ (0, 1] to control the number of support vectors and training errors and adjust parameter ε automatically. According to Gu et al., the objective function f (x) in v-SVR is represented by the following constrained minimization problem with soft-margin [28]- [30]: According to Chang et al. and Crisp et al., the inequality e T (α+α * ) ≤ Cv in v-SVR can be replaced by the equality form of e T (α+α * ) = Cv with the constraint 0 < v ≤ 1 [11], [31], so we have We substitute the equation α * = Cve − α into Formula (3), and Formula (3) can be written as follows: As one can see from Formula (2) and (4), by substituting the equation α * = Cve−α into the dual problem, the number of computational parameters of the v-SVR has been reduced by half compared to ε-SVR when solving the QPP. The difference in both time complexity and spatial complexity between ε-SVR and v-SVR can be expressed as follows:

B. RECENT PROGRESS IN SV THEORY
Recent SV theory indicates that maximizing the minimum margin is not the only way to construct the separating hyperplane for SVR, because it does not necessarily lead to better generalization performance [20]. There may exist the socalled data piling problem in SVR [32], that is, the separating hyperplane produced by SVR tends to maximize data piling, which makes the data pile together when they are projected onto the hyperplane. If the distribution of the boundary data is different from that of the internal data, the hyperplane constructed by SVR will be inconsistent with the actual data distribution, which reduces the performance of SVR. Fortunately, Gao and Zhou have demonstrated that marginal distribution was critical to the generalization performance [23]. By using the margin mean and the margin variance, the model is robust to different distributions of boundary data and noise. Inspired by the above research, MDR [24] introduced the statistics of deviation into ε-SVR and this allows more data to have impact on the construction of the hyperplane.
In MDR, the regression deviation of sample (x i , y i ) is formulated as So, the regression deviation mean is and the regression deviation variance is defined aŝ MDR minimizes the regression deviation mean and the regression deviation variance simultaneously, so we have the following primal problem of soft-margin MDR: where λ 1 and λ 2 are the parameters for trading-off the regression deviation variance, the regression deviation mean and the model complexity.
Here, we can see from Equation (5) that the regression deviation, γ i , is positive when the sample (x i , y i ) lies above the regressor and negative when the sample (x i , y i ) lies under the regressor. But in fact, for regression, the regression deviation of the sample (x i , y i ) is the distance between the actual value and the estimated one, that is, Therefore, the definition of the deviation mean in MDR here is not very appropriate.
On the other hand, when solving QPP, MDR uses the ε-SVR strategy, and it needs to calculate 2n (n is the number of training samples) parameters. Calculating a large number of parameters will increase the computational complexity and reduce the speed of the algorithm. Considering this, in the remainder of this paper, we will introduce our latest advances in SV theory and address the limitations of ε-SVR strategy.

III. v-MININUM ABSOLUTE DEVIATION DISTRIBUTION REGRESSION
In this section, we first formulate the absolute deviation distribution which takes into account the positive and negative values of the regression deviation of samples. Then we give the optimization algorithms and the theoretical proof.

A. FORMULATION OF v-MADR
The two most straightforward statistics for characterizing the absolute deviation distribution are the mean and the variance of absolute deviation. In regression problems, the absolute regression deviation of sample (x i , y i ) is formulated as ϕ i is actually the distance between the actual value of the sample (x i , y i ) and the estimated one. According to the definition in Equation (6), we give the definitions of statistics of absolute deviation in regression. Definition 1: Absolute regression deviation mean is defined as follows: The absolute regression deviation mean actually represents the expected value of difference between the actual values of data and the estimated ones. In order to facilitate the calculation, we have done a square process in this definition. In fact, we can view the absolute regression deviation mean as the adjusted distances of data to their fitting hyperplane. Next, we give the concept of the absolute regression deviation variance as follows: Definition 2: Absolute regression deviation variance is defined as follows: We can see that the absolute regression deviation variance quantifies the scatter of regression.
Existing SVR's loss is calculated only if the absolute value of the difference between the actual data and the estimated values is greater than a threshold. The fitting hyperplane constructed by SVR is only affected by the distribution of the boundary data. If the distribution of the boundary data largely deviates from that of the internal data, the hyperplane constructed will be inconsistent with the actual overall data distribution. To overcome this issue, v-MADR aims to obtain a tradeoff between the distribution of the boundary data and that of the internal data. This means that the fitting hyperplane constructed by v-MADR is not only determined by the distribution of the boundary data, but also measures the influence of the overall data distribution on the fitting hyperplane by simultaneously minimizing the absolute regression deviation mean and the absolute regression deviation variance, which is closer to the real distribution for many datasets and is more robust to noise.
Therefore, similar to the soft-margin of v-SVR [28], the final optimization problem considering the soft-margin has the following form: where parameters λ 1 and λ 2 are aimed at achieving the tradeoff among the absolute regression deviation mean, the absolute regression deviation variance and the model complexity.
It is evident that the soft-margin v-MADR subsumes the softmargin v-SVR when λ 1 and λ 2 both equal 0. The meanings of the other variables have been introduced in previous formula.

B. ALGORITHMS FOR v-MADR
Solving Formula (9) is a key point for v-MADR in practical use. In this section, we first design a dual coordinate descent (DCD) algorithm for kernel v-MADR, and then present an average stochastic gradient descent (ASGD) algorithm for large-scale linear kernel v-MADR.

1) KERNEL v-MADR
By substituting the absolute regression deviation meanφ (Definition 1) and the absolute regression deviation variance  (9), we obtain Formula (10) as follows: The yy T and y T ee T y terms inφ (Definition 1) andφ (Definition 2) are constants in an optimization problem, so we omit this term. However, Formula (10) is still intractable because of the high dimensionality of φ (x) and its complicated form. Inspired by [20], [33], we give the following theorem to state the optimal solution w for Formula (10).
Theorem 1: The optimal solution w for Formula (10) can be represented by the following form: Proof: Suppose that w can be decomposed into the span of φ (x i ) and an orthogonal vector, that is, where z satisfies φ x j T · z = 0 for all j, that is, X T z = 0. Then we obtain the following equation: According to Equation (12), the second and the third terms and the constraints of Formula (10) are independent of z. Besides, the last term of Formula (10) can also be considered as being independent of z. To simplify the first term of Formula (10), and consider X T z = 0, we get where the equal relationship in the above '' ≥ is valid if and only if z = 0. Thus, setting z = 0 does not affect the rest of the terms and strictly reduces the first term of Formula (10). Based on all above, w in Formula (10) can be represented as the form of Equation (11). Q.E.D.
According to Chang and Lin, the inequality e T (α + α * ) ≤ Cv in v-SVR can be replaced by the equality form of e T (α + α * ) = Cv with the constraint 0 < v ≤ 1, and there always exists the optimal solution [11]. Based on this conclusion, we can attain the equation for the following form: VOLUME 8, 2020 We thus substitute the equation β * = Cve−β into Formula (19), and Formula (19) can be obtained as follows: As one can see from Formula (21), by substituting the equation β * = Cve − β into Formula (19), the number of computational parameters of the v-MADR has been halved.
Due to the simple box constraint and the convex quadratic objective function, there exist many methods to solve the optimization problem [35]- [38]. To solve Formula (21), we use the DCD algorithm [39], which continuously selects one of the variables for minimization and keeps others as constants, thus a closed-form solution can be achieved at each iteration. In our situation, we minimize the variation of f (β) by adjusting the value of β i ∈ β with a step size of t while keeping other β k =i as constants, then we need to solve the following sub-problem: where d i denotes the vector with 1 in the i-th element and 0 s elsewhere. Thus, we have where p ii is the diagonal entry of P. Then we calculate the gradient ∇f (β) i in Equation (22) as follows: it can be omitted from Equation (22). Hence f (β + td i ) can be transformed into a simple quadratic function. If we denote β iter i as the value of β i at the iter-th iteration, β iter+1 i = β iter i + t is the value at the (iter + 1)-th iteration. To solve Equation (22), we can have the minimization of t which satisfies Equation (22) for the following form: Thus, the value of β iter+1 i is obtained as Furthermore, considering the box constraint 0 ≤ β i ≤ C n , we have the minimization for β iter+1 i as follows: After β converges, we can obtain α according to Equation (15) and Equation (20) as follows: Thus, the final function is where α i = α i − α * i . Algorithm 1 summarizes the procedure of v-MADR with the kernel functions. The initial value of β is Cve/2, which simplifies the calculation procedure of v-MADR and satisfies Equation (20). Parameter v is controllable and its range is (0, 1].

Algorithm 1 Dual Coordinate Descent Solver for Kernel v-MADR
Input: Dataset X, λ 1 , λ 2 , C, v; Output: α ;  We now analyze the computational complexity of Algorithm 1 as follows: The parameters initialization is shown in Table 1, where n represents the number of the examples and m represents the number of features.
The time complexity for the dual coordinate descent (DCD) algorithm is maxIter * n * n, where maxIter is 1000.
We can infer the time complexity of the DCD algorithm is the sum of the above time complexity. In summary, the time complexity of the DCD algorithm is O n 3 and it has the space complexity of O n 2 .

2) LARGE-SCALE LINEAR KERNEL v-MADR
In regression analysis, processing larger datasets may increase the time complexity. Although the DCD algorithm could solve kernel v-MADR efficiently for small sample problems, it is not the best strategy for larger problems. Considering computational time cost, we adopt an averaged stochastic gradient descent (ASGD) algorithm [40] to linear kernel v-MADR to improve the scalability of v-MADR, and ASGD solves the optimization problem by computing a noisy unbiased estimate of the gradient, and it randomly samples a subset of the training instances rather than all data.
We reformulate Formula (10) into a linear kernel v-MADR as follows: where X = [x 1 , x 2 , . . . , x n ] and y = [y 1 , y 2 , . . . , y n ] T . The term Cvε in Formula (10) is constant in an optimization problem, so we omit this term. For large-scale problems, it is expensive to compute the gradient of Formula (23) because we need all the training samples for computation. Stochastic gradient descent (SGD) [41,42] works by computing a noisy unbiased estimation of the gradient via sampling a subset of the training samples. When the objective is convex, the SGD is expected to converge to the global optimal solution. In recent years, SGD has been successfully used in various machine learning problems with powerful computation efficiency [43][44][45][46].
In order to obtain an unbiased estimation of the gradient ∇g (w), we first present the following theorem which can be proved by computing ∇g (w). Theorem 2: If two samples (x i , y i ) and x j , y j are sampled from the training data set randomly, then is an unbiased estimation of ∇g (w). Here Proof: Note that the gradient of g (w) is According to the linearity of expectation, the independence between x i and x j , and with the set of equations (25), we have It is shown that ∇g w, x i , x j is a noisy unbiased gradient of g (w). Q.E.D.
Based on Theorem 2, the stochastic gradient can be updated as follows: where ϕ t is the learning rate at the t-th iteration. VOLUME 8, 2020 Since the ASGD algorithm is more robust than the SGD algorithm [47], we actually adopt the ASGD algorithm to solve the optimization problem in Formula (23). At each iteration, in addition to updating the normal stochastic gradient in Equation (26), we also computē where t 0 decides when to take the averaging operation. This average can also be calculated in a recursive formula as follows:w Algorithm 2 summarizes the procedure of large-scale linear kernel v-MADR.

3) PROPERTIES OF v -MADR
We study the statistical property of v-MADR that leads to a bound on the expectation of error for v-MADR according to the leave-one-out cross-validation estimate, which is an unbiased estimate of the probability of test error. For the sake of simplicity, we only discuss the linear case as shown Formula (10) here, in which w can be represented by the following form: while the result is also used in kernel mapping situations φ. Then we can get the dual problem of Formula (10) using the same steps as in Section III.B.1, i.e. where Definition 3: Regression error is defined as follows: π (x, y) = |y − f (x)| .
We give the following theorem to state the expectation of the probability of test error.
Theorem 3: Let β be the optimal solution of (27), and E[R(β)] be the expectation of the probability of test error, then we have where and the corresponding solution for the linear kernel v-MADR are w and w i , respectively. According to [48], where L ((x 1 , y 1 ) , (x 2 , y 2 ) , . . . c, (x n , y n )) is the number of errors in the leave-one-out procedure.
In the process of solving Formula (27) using the Lagrange multipliers, every sample must meet the following KKT conditions: According to the KKT conditions, we have that if and only if ε + ξ i − y i + x T i α = 0, β i can take a non-zero value, and if and only if ε + ξ * i + y i − x T i α = 0, β * i can take a non-zero value. In other words, if the sample (x i , y i ) is not in the ε-tube in the leave-one-out procedure, β i and β * i can take a non-zero value. In addition, ε + ξ i − y i + x T i α = 0 and ε +ξ * i +y i −x T i α = 0 cannot be established at the same time, so we get that at least one of β i and β * i is zero. The specific breakdown is as follows: i) If the sample (x i , y i ) is in the ε-tube in the leave-one-out procedure, then ε+ξ i −y i +x T i α = 0 and ε+ξ * i +y i −x T i α = 0, so we have β i = 0 and β * i = 0; 85540 VOLUME 8, 2020 ii) If the sample (x i , y i ) is out of the ε-tube in the leaveone-out procedure, we have the following two situations: a) if the sample is above the ε-tube, then ξ i = 0 and ε + ξ * i + y i − x T i α = 0. So we have β i = C/n and β * i = 0; b) if the sample is under the ε-tube, then ξ * i = 0 and ε + ξ i − y i + x T i α = 0. So we have β * i = C/n and β i = 0; iii) If the sample (x i , y i ) is on the gap of the ε-tube in the leave-one-out procedure, we have the following two situations: a) if the sample is on the upper gap of the ε-tube, then ξ i = 0, and we have 0 < β i ≤ C/n and β * i = 0; b) if the sample is on the lower gap of the ε-tube, then ξ * i = 0, and we have 0 < β * i ≤ C/n and β i = 0. Based on the discussion above, we consider the following three cases to calculate the test error: i) If both β i = 0 and β * i = 0, we have that the sample (x i , y i ) is in the ε-tube in the leave-one-out procedure, and where d i denotes the vector with 1 in the i-th element and 0 s elsewhere. We can discovery that the left-hand side of formula (30) is equal to [∇f (β)] 2 i / (8p ii ) = x T i w i − y i 2 / (2p ii ) and the right-hand side of formula (31) is equal to 2p ii β 2 i . So by combining formula (30) and (31), , we have that the sample (x i , y i ) is not in the ε-tube in the leave-one-out procedure. So we can get π (x i , y i ) = ε +ξ i , whereξ i = maxξ i , ξ * i . So we have L ((x 1 , y 1 ) ,. . ., (x n , y n )) ≤ ε|I 1 |+2p n} and β * i = Cv − β i .Take expectation on both side and with formula (29), we reach the conclusion that formula (28) holds. Q.E.D.

IV. EXPERIMENTAL RESULTS
Since SVR-based algorithms are now widely used for regression problems and demonstrate better generalization ability [8]- [10] than many existing algorithms, such as least square regression [4], Neural Networks (NN) regression [5], logistic regression [6], and ridge regression [7], we will not repeat these comparisons. In this section, we empirically evaluate the performance of our v-MADR compared with other SVRbased algorithms, including v-SVR, LS-SVR, ε-TSVR, and linear ε-SVR on several datasets, including two artificial datasets, eight medium-scale datasets, and six large-scale datasets. All algorithms are implemented with MATLAB R2014a on a PC with a 2.00GHz CPU and 32 GB memory. ε-SVR is solved by LIBSVM [49]; ε-SVR is solved by LIBLINEAR [50]; LS-SVR is solved by LSSVMlab [51]; and ε-TSVR is solved by the SOR technique [52], [53].
for nonlinear regression. The values of the parameters are obtained by means of a grid-search method [54]. For brevity, we set c 1 = c 2 ,c 3 = c 4 and ε 1 = ε 2 for ε-TSVR and λ 1 = λ 2 for our nonlinear v-MADR. The parameter v in v-MADR is selected from the set 2 −9 ,2 −8 , . . . , 2 0 , and the remaining parameters in the five methods and the parameters in the Gaussian kernel are selected from the set 2 −9 , 2 −8 , . . . , 2 9 by 10-fold cross-validation. Specifically, the parameter d in polynomial kernel is selected from {2, 3, 4, 5, 6}. In order to evaluate the performance of the proposed algorithm, the performance metrics are specified before presenting the experimental results. Without loss of generality, let n be the number of training samples and m be the number of testing samples, denoteŷ i as the prediction value of y i , and y = m i=1 y i /m as the average value of y 1 , y 2 , . . . , y m . Then the details of the metrics used for assessing the performance of all regression algorithms are stated in Table 2.
To demonstrate the overall performance of a method, a performance metric referred to average rank of each method is VOLUME 8, 2020   In our experiments, we test the performance of the above methods on two artificial datasets, eight medium-scale datasets and six large-scale data sets. The basic information of these datasets is given in Table 3. All real-world datasets are taken from UCI (http://archive.ics.uci.edu/ml) and StatLib (http:// lib.stat.cmu.edu/), and more detailed information can be accessed from those websites. Before regression analysis, all of these real datasets are normalized to zero mean and unit deviation. For medium-scale datasets, RBF kernel and polynomial kernel are used, and for large-scale datasets, only the linear kernel v-MADR is used considering the computational complexity. Each experiment is repeated for 30 trials with 10fold cross validation and the mean evaluation of R 2 , NMSE, MAPE and their standard deviations were recorded. Particularly, the two datasets ''Diabetes'' and ''Motorcycle'' have smaller numbers of samples and features, so we use the leaveone-out cross validation instead.

V. ARTIFICIAL DATASETS
In order to compare our v-MADR with v-SVR, LS-SVR, and ε-TSVR, we choose two artificial datasets with different distributions. Firstly, we consider the function: y = x 2 3 . In order to fully assess the performance of the methods, the training samples are added with Gaussian noises with zero means and 0.5 standard deviation, that is, we have the following training samples (x i , y i ): where U [a, b] represents the uniformly random variable in [a, b] and N µ, σ 2 represents the Gaussian random variable    with means µ and standard deviation σ , respectively. To avoid biased comparisons, ten independent groups of noisy samples are randomly generated, including 200 training samples and 400 none noise test samples.
The estimated functions obtained by these four methods are shown in Figure 1. Obviously, all four methods have obtained good fitted values, but our v-MADR has the best approximation compared to the rest of the methods. Table 4 shows the corresponding performance metrics and training time. Compared with the other methods, our v-MADR has the highest R 2 , lowest NMSE and MAPE, which indicates that our v-MADR achieves good fitting performance and a good presentation of the statistical information in the training dataset. In addition, the CPU time of our v-MADR is not much different from other methods, and equivalent to v-SVR.
The second artificial example is the regression estimation on the Sinc function: y = sin(x)/x. The training samples are added with Gaussian noise with zero means and 0.5 standard deviation. Therefore, we have the following training samples (x i , y i ): The dataset consists of 200 training samples and 400 test samples. Figure 2 illustrates the estimated functions obtained by these four methods and Table 4 shows the corresponding performance. These results also demonstrate the superiority VOLUME 8, 2020  Table 4, we list the average ranks of all four methods on the artificial datasets for different performance metrics. It can be seen that our v-MADR is superior to other three methods on R 2 and NMSE, and is comparable to LS-SVR and ε-TSVR in terms of MAPE.
A. MEDIUM-SCALE DATASETS Table 5 and Table 6 list the experimental results on the eight medium-scale datasets from UCI and StatLib with RBF and polynomial kernels, respectively. From the average rank at the bottom of Table 5 and Table 6, our v-MADR is superior to the other three methods. In detail, on most datasets, our v-MADR has the highest R 2 , lowest NMSE and MAPE. Although on several datasets, such as ''MachineCPU'', our v-MADR does not achieve the best experimental results compared with other methods, it is not the worst. Our v-MADR also has good performance in terms of CPU running time. The above experimental results indicate that v-MADR is an efficient and promising algorithm for regression. Table 7 and Table 8 list the optimal parameters with RBF and polynomial kernels, respectively.       For further evaluation, we investigate the absolute regression deviation mean and variance of our v-MADR with RBF kernel, v-SVR, LS-SVR and ε-TSVR on medium-scale datasets as shown in Figure 4. From Figure 4, our v-MADR has the smallest absolute regression deviation mean and variance on most datasets. In addition, v-MADR also has the most compact mean and variance distribution, which demonstrates its robustness. From the above results, it is obvious that our v-MADR outperforms other three methods.
The change of parameter values may have a great effect on the results of regression analysis. For our RBF kernel v-MADR, there are mainly three trade-off parameters, i.e., λ 1 , λ 2 , C and one kernel parameter σ . Figure 5(a) and Figure 5(b) shows the influence of λ 1 on NMSE and CPU time by varying it from 2 −9 to 2 9 while fixing λ 2 , C and σ as the optimal ones by cross validation. Figures 5(c)∼5(h) show the influence of λ 2 , C and σ on NMSE and CPU time, respectively. As one can see from Figure 5(a), Figure 5(c) and Figure 5(e), the NMSE values on medium-scale datasets do not change significantly when the values of the three parameters λ 1 , λ 2 , and C are changed. Figure 5(g) shows that σ has more obvious influence on NMSE. On most datasets, as σ becomes larger, NMSE will become smaller and smaller until it converges at a fixed value. Figure 5 Table 9 lists the experimental results on six large-scale datasets with linear kernel. We have additionally added a comparison of linear ε-SVR, which was solved by LIBLIN-EAR [50] that can handle large-scale datasets. In this experiment, because the datasets are too large, for each dataset, 2/3 of the dataset is randomly selected as the training set for feature selection, and the rest 1/3 of the dataset is used as the test set for evaluation. From the average rank at the bottom of Table 9, the overall performance of v-MADR is better than other compared methods or is highly competitive. The optimal parameters are listed in Table 10. Figure 6 shows the comparisons of CPU time. From Figure 6, linear kernel v-MADR is the fastest learning method. In particular, the CPU time of linear kernel v-MADR is far superior to v-SVR, LS-SVR and ε-TSVR.

VI. CONCLUSION
In this research, we introduce statistical information into v-SVR and propose a novel SVR method called v-MADR. v-MADR improves the performance of SVR and overcomes the limitations of existing SVR algorithms by minimizing both the absolute regression deviation mean and the absolute regression deviation variance, which takes into account both the positive and negative values of the regression deviation of sample points. v-MADR proposes a dual coordinate descent (DCD) algorithm for small sample problems, and we also propose an averaged stochastic gradient descent VOLUME 8, 2020 (ASGD) algorithm for large-scale problems, which greatly reduces the computational complexity and thus improves the algorithm speed. We provide a theoretical analysis on the boundary of the expectation of error for v-MADR. Experimental results have shown that v-MADR outperforms several regression methods and demonstrates great application potential. Our v-MADR Matlab codes can be accessed from: https://github.com/AsunaYY/v-MADR.
In the near future, we will further investigate the potential of v-MADR for big data problems, e.g., predictive analysis for bioinformatics and systems biology problems, and problems in finance. We envision a great application potential in these problems.