A New L₁ Multi-Kernel Learning Support Vector Regression Ensemble Algorithm With AdaBoost

This paper proposes a new multi-kernel learning ensemble algorithm, called Ada-<inline-formula> <tex-math notation="LaTeX">$L_{1}$ </tex-math></inline-formula>MKL-WSVR, which can be regarded as an extension of multi-kernel learning (MKL) and weighted support vector regression (WSVR). The first novelty is to add the <inline-formula> <tex-math notation="LaTeX">$L_{1}$ </tex-math></inline-formula> norm of the weights of the combined kernel function to the objective function of WSVR, which is used to adaptively select the optimal base models and their parameters. In addition, an accelerated method based on fast iterative shrinkage thresholding algorithm (FISTA) is developed to solve the weights of the combined kernel function. The second novelty is to propose an integrated learning framework based on AdaBoost, named Ada-<inline-formula> <tex-math notation="LaTeX">$L_{1}$ </tex-math></inline-formula>MKL-WSVR. In this framework, we integrate FISTA into AdaBoost. At each iteration, we optimize the weights of the combined kernel function and update the weights of the training samples at the same time. Then an ensemble regression function of a set of regression functions is output. Finally, two groups of the experiments are designed to verify the performance of our algorithm. On the first group of the experiments including eight datasets from UCI machine learning repository, the MAEs and RMSEs of Ada-<inline-formula> <tex-math notation="LaTeX">$L_{1}$ </tex-math></inline-formula>MKL-WSVR are reduced by 11.14% and 9.08% on average, respectively. Furthermore, on the second group of the experiments including the COVID-19 epidemic datasets from eight countries, the MAEs and RMSEs of Ada-<inline-formula> <tex-math notation="LaTeX">$L_{1}$ </tex-math></inline-formula>MKL-WSVR are reduced by 31.19% and 29.98% on average, respectively.


I. INTRODUCTION
Support vector machine (SVM) [1], [2] is an algorithm based on supervised learning mode, which can be used for data classification, model recognition and regression analysis. It has a strong mathematical foundation and theoretical support. SVMs can effectively solve the problems of small samples, nonlinearity, overfitting and local minima, and have been successfully applied in various fields, including text classification [3], image classification [4], bioinformatics [5] and medical diagnosis [6]. Support vector regression (SVR) is an important application of SVMs, which introduces an ε-insensitive loss function in SVM to adapt to the regression problem [7]. In order to achieve nonlinear regression, SVR uses a kernel function to map the sample set to the feature space. SVR has many advantages in solving small sample, The associate editor coordinating the review of this manuscript and approving it for publication was Jingen Ni. nonlinear and high dimensional pattern recognition, and has been widely applied to practical problems, including traffic velocity prediction [8], conductivity prediction [9], spatial prediction of landslide susceptibility [10], and stock price forecasting [11]. However, for the samples containing heterogeneous information, uneven distribution and irregularity, the traditional SVR using single-kernel mapping is not necessarily suitable for sample processing. Therefore, a lot of work has been applied to multi-kernel learning (MKL) [12], which is a more flexible kernel-based learning method. Using MKL instead of the traditional single-kernel learning can greatly improve the interpretability and generalization performance of the model [13].
MKL is the process of obtaining the weights of the combined kernel function. There are many effective learning methods for solving this problem. For example, Rakotomamonjy et al. [14] proposed a valid MKL method to select the kernel functions, in which the kernel functions are VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ set to be a linear combination of multiple basic kernel functions. Cao et al. [15] proposed multi-kernel feature selection based on the L 2,1 norm, called L 2,1 MKFS, and an proximal optimization algorithm is designed for efficient learning the model. To solve highly complex issues of convex quadratic programming in SVR, a novel two-phase MKL-SVR based on linear programming (MK-LP-SVR) was proposed by Zhang et al. [16], and used for feature sparsification and forecasting. Moreover, some studies have tried to assign different weights to training samples in SVM or SVR to solve the problem of heteroscedasticity in training samples. Ada-SVR-R, proposed by Gao et al. [17], used a so-called classification-type loss to increase and decrease the weights of misclassified samples and correctly classified samples, respectively. Tao et al. [18] developed a modification of AdaBoost, which was a self-adaptive cost technique for SVM. Elatter et al. [19] combined locally weighted regression (LWR) and SVR (LWSVR) to build a load forecasting model, in which a weighted distance algorithm based on Mahalanobis distance was proposed to optimize the bandwidth of the weighting function. Xu et al. [20] proposed a weighted twin SVR, that brought different penalties to the samples according to their different locations. In addition, some algorithms were designed so that the weights of each training sample were added as scaling factors of the slack variable in the objective function of SVR [21]- [23]. However, the above algorithms were only improved in one aspect, and few scholars have considered both the adaptive selection of the kernel function and the updation of the weights of the training samples in the framework of SVR.
Inspired by the existing literature, we propose a new multi-kernel ensemble algorithm based on the L 1 norm and weighted support vector regression (WSVR) with AdaBoost, namely, Ada-L 1 MKL-WSVR. First, to adaptively choose the optimal combined kernel function, L 1 MKL-SVR is proposed. Moreover, we design an accelerated method to solve the weights of the combined kernel function with the L 1 norm. Then, we introduce FISTA into AdaBoost to correct the weights of the training samples, and a new multi-kernel ensemble algorithm is proposed. In this method, the optimization of the weights of the combined kernel function and the updation of the weights of the training samples are both considered. Finally, the subregressor of each iteration is integrated into a strong robust regressor. There are extensive experiments have been performed to validate the performances of Ada-L 1 MKL-WSVR. The numerical results are provided to demonstrate the competitiveness of the algorithm proposed in this paper The remainder of this article is arranged as follows. In Section 2, we review some pertinent basic results to WSVR and Ada-SVR-R. The details of L 1 MKL-SVR and Ada-L 1 MKL-WSVR are presented in Section 3. Section 4 discusses our simulations and empirical studies, including dataset descriptions, parameter settings, and a comparative analysis of five different algorithms. Finally, some conclusions are drawn in Section 5.

II. RELATED WORK A. WEIGHTED SUPPORT VECTOR REGRESSION
In SVR, there is a basic assumption that the samples come from the same distribution, that is, the random error items should have the same variance, independent or uncorrelated. However, it is often not satisfactory if we use the standard SVR to establish the model when there is heteroscedasticity in a regression problem. To solve this problem, Sun et al. [28] proposed the so-called WSVR, which introduced the appropriate weights to adjust the role of the training samples in SVR. In what follows, we briefly introduces the basic idea of WSVR. More details can be found in [28]. Let . . , N and y = (y 1 , y 2 , · · · , y N ) with y i ∈ R, i = 1, 2, . . . , N are the input of the training samples and the target values, respectively. The purpose of WSVR is to find a regression function f (x) to precisely estimate y when given an input x. To make f (x) available, the standard WSVR can be transformed into the following convex optimization problem: where C is the penalty coefficient, λ = (λ 1 , λ 2 , . . . , λ N ) are the weights of the training samples, ξ = (ξ 1 , ξ 2 , · · · , ξ N ) and ξ * = (ξ * 1 , ξ * 2 , · · · , ξ * N ) are the slack variables, b is the intercept term, ε is the fitting error, and φ(·) is the map function, which maps the training samples space to a Hilbert space . It should be pointed out that WSVR reduces to the standard SVR if λ i = 1 with i = 1, · · · , N . Particularly, the weight λ i is set as the reciprocal of the variance of the error term δ 2 i , i.e., λ i = 1 [28]. The Lagrange dual optimization problem associated with the problem (1) is given by where α * and α are the Lagrange multipliers.
It is well-known that the regression function can be expressed in the following way In practical research, the variance of the error term σ i (i = 1, . . . , N ) in WSVR is usually unknown and needs to be determined according to the actual situation. To overcome this difficulty, Gao et al. [17] proposed an integrated algorithm based on AdaBoost, namely Ada-SVR-R, which can directly appiled to the regression problem by introducing the classification-type loss. At each iteration of WSVR, SVR receives the training samples and produces a regression function by training. Then the weights of the training samples are updated by calculating regression errors based on the classification-type loss. This process is repeated until error t is larger than 0.5. Finally, the final regression function F(x), i.e., The so-called Ada-SVR-R [17] is given below.

III. METHODOLOGY
In this section, we fisrt introduce the key idea of L 1 MKL-SVR, which is the basis of our algorithm, then we provide the details of L 1 MKL-WSVR and Ada-L 1 MKL-WSVR, respectively.
A. L 1 MULTI-KERNEL LEARNING SUPPORT VECTOR REGRESSION MKL [14], [29] is one of the most important research topic in kernel machine learning. MKL selects two or more kernel functions as the optimal kernel function from the set of basic kernel functions, and assigns the weight to each kernel function. The combined kernel function constructed by MKL takes into account the characteristics of each constituent kernel function, which improves the accuracy of the model to a certain extent. Unlike a single kernel function, such as SVR, MKL assumes that the input of the training samples . . , S), with S mapping functions, and the purpose of MKL is to learn the optimal combined kernel function, which is used to instead of a single kernel function to obtain better prediction effects.
It is well-known that the weights of the combined kernel function obtained by using the L 1 norm is sparse, and it can reduce redundancy and increase the operation efficiency of Algorithm 1 Ada-SVR-R Input: Training samples: Setting the parameters of SVR Threshold: ε > 0 Output: Final ensemble regression function: F(x) 1: Initialize the weights of the training samples: Set the distribution of the weights of the training samples as: Call SVR, providing it with the distribution λ t i , and obtain a regression function f t (x) 5: Calculate the weighted classification-type loss of f t (x): if error t > 1 2 then 7: Set base learner's weight: α t = 1 2 ln 1−error t error t 10: Update the weights of the training samples: model. Therefore, we introduces the L 1 norm of the weights of the combined kernel function into the objective function of SVR, namely L 1 MKL-SVR, which can be expressed as the following optimization problem, i.e., where γ is the regularization parameter and D = (d 1 , . . . , d S ) are the weights of the combined kernel function with d s (s = 1, . . . , S) being the weight of the kernel function K s (x, The optimization problem (5) is nonconvex due to the products of d s and w s , and it can be resolved by applying the variable transformation w s = √ d s w s as in [14], [30], [31]. This yields the following optimization problem, i.e., Similar to the WSVR case, the regression function can be defined by solving the Lagrange dual optimization problem of the optimization problem (6).
To learn the weights of the training samples and the combined kernel function simultaneously, we introduce the weights λ of the training samples into L 1 MKL-SVR, namely L 1 MKL-WSVR, which can be expressed as the following optimization problem, i.e., One can easily verify that if λ i = 1 (i = 1, 2, . . . , N ), L 1 MKL-WSVR degenerates to L 1 MKL-SVR. Furthermore, the optimization problem (7) can be regarded as the composite objective optimization problem, i.e., where and (w ,b,ξ ,ξ * ) is an optimal solution of the following optimization problem, i.e., In addition, for the given weights D, M (D) can be directly obtained by solving the classical WSVR in (1), which is given by It should be noted that the optimization problem (8) has the composite structure, where M (D) is convex and differentiable, while D 1 is a nondifferentiable convex function on the feasible domain. To solve the composite objective optimization problem, the common idea is to use the concept of the proximal gradient proposed by Nesterov [25]- [27]. The quadratic function is used to approximate the objective function, and the proximal gradient method is used to solve the new optimization problem. In this work, FISTA is designed to optimize D [24]. By using the quadratic approximation, we can obtain the proximal operator of the objective function Z (D) at point D, i.e., where with After ignoring the constant term, we can obtain the unique minimum of (12), i.e., with where D is the iteratively updated by FISTA, P L (D (t−1) ) represents a proximal operator, and L (t−1) is the step size of the internal gradient used to control the convergence rate, which is in the form of a linear search. Meanwhile, to speed up the convergence of the system (15), the proximal operator P L (H (t+1) ) is used as the beginning of the current iteration, in which H (t+1) is a linear combination of two previous iterations D (t) and D (t−1) , i.e., 20378 VOLUME 10, 2022 where k (t) is an auxiliary sequence, whose iteration formula is given by Due to the separability of the L 1 norm, i.e., D = (d 1 , . . . , d S ), we can update each weight d s by solving the following one-dimensional problem, that is In addition, a projection operator P is introduced to assure that each weight d The updation of the weights D of the combined kernel function by FISTA is given in Algorithm 2. It follows from the optimization problem (8) is a convex problem that its global optimum solution can be obtained. Moreover, Algorithm 2 minimizes the substitution function in each iteration to ensure that the original objective function iteratively decreases, and finally the global optimization of the convergence domain problem is achieved. The theoretical proof [24] that the convergence rate of such an algorithm is guaranteed to be O(1/t 2 ). By optimizing D, we can learn the relative importance between the different kernel functions and perform parameter estimation at the same time. The final regression function obtained from To simultaneously update the weights of the training samples, we embed Algorithm 2 as a hyper parameter optimization method into Algorithm 1. This yields the so-called Ada-L 1 MKL-WSVR, which is a new boosting algorithm for regression. Furthermore, it is also an integrated algorithm composed of several regression functions, which is followed in Algorithm 3.
As Algorithm 3 shows, Ada-L 1 MKL-SVR mainly performs two tasks at each iteration, including the adaptive selection of the optimal combined kernel function and the updation of the weights of the training samples. Firstly, MKL-SVR trains a set of the training samples to obtain the corresponding regression function. Secondly, the regression errors are calculated based on the classification-type loss. Thirdly, the weights of each training subset are recalculated according to the regression errors. Next, the weight distribution is used to resample the regression samples to form a new training subset. After that, according to the new training subset, the weights of the combined kernel function D are calculated by using Algorithm 2. Finally, the regression functions obtained from each iteration are combined as the final regression function.
There are two contributions from the boosting iteration in Algorithm 3. The first contribution is to skillfully add Algorithm 2 Optimize D Based on FISTA Input: Training samples: (1) = 1, λ = (λ 1 , . . . , λ N ), C, ε, γ and tol Output: The weights of the combined kernel function: S ) = (1/S, . . . , 1/S) 2: H (1) = D (0) 3: for t = 1 to . . . do 4: Calculate M (H (t) ) by using WSVR in (1) and ∇M (D (t−1) ) according to (14) 5: Find the smallest nonnegative integers i t such that 7: 10: if max(|D (t) − D (t−1) |) < tol then 11:D = D (t) , break 12: end if 13: end for 14: returnD the L 1 norm of the weights D to the objective function of WSVR, and an accelerated method based on FISTA is used to optimize the weights D. The second contribution is to embed FISTA into AdaBoost, that is, during each iteration, the weights D and the weights λ are optimized and updated, respectively.
The algorithm finally obtains a regression function, which can be regarded as a separating planes ensemble in the weighted average composed of N optimal separated planes with α t as the confidence of the t-th optimal separation plane. The final regression function is the results of the weighted votes of the multiple regression functions with a prediction accuracy of more than 50%. Without ignoring the normal samples, the algorithm strengthens the training of the abnormal samples to ensure the robustness, that is, the detection of the abnormal samples.

IV. EXPERIMENTAL RESULTS AND ANALYSES
Without causing ambiguity in the context, the prediction model based on Algorithm 3 is still written as Ada-L 1 MKL-SVR in this section. In order to test the performance of the proposed Ada-L 1 MKL-WSVR, We design two groups of the experiments, and compare with four regression models (SVR [7], EGWO-SVR [33], MKL-SVR [8], Ada-SVR-R [17]). The first group of the experiments consists of eight datasets from UCI machine learning repository [32], and the COVID-19 epidemic dataset from eight countries are used in the second group of the experiments. VOLUME 10, 2022 Algorithm 3 Ada-L 1 MKL-WSVR Input: (1) = 1, C, ε, γ and tol Output: Final ensemble regression function: = D (0) 4: for t = 1 to . . . do 5: Set the distribution of the weights of the training samples as: Call MKL-SVR, provide it with the distribution λ t i , and obtain a regression function f t (x) 7: Calculate the weighted classification-type loss of f t (x): if error t > 1 2 then 9: α t = 0 10: else 11: α t = 1 2 ln 1−error t error t 12: end if 13: Update the weights of the training samples: Calculate M (H (t) ) by using WSVR in (1) and ∇M (D (t−1) ) according to (14) 15: Find the smallest nonnegative integers i t such that Z pL(t) (H (t) ) ≤ QL(t) PL(t) (H (t) ), H (t) , wherē 17: 18: 20: if max(|D (t) − D (t−1) |) < tol then 21:D = D (t) , break 22: end if 23: end for 24: The criteria of mean absolute error (MAE) and root mean square error (RMSE) [34] are employed to validate the effectiveness of the models in this paper.
The representations of MAE and RMSE are defined by respectively. Here N is the total number of the samples, and F(x i ) and y(i) denote the predicted and real values of the t-th sample, respectively. It is well-known that MAE is the mean value used to measure the absolute error between the predicted and real values, and RMSE represents the square root of the predicted error, which can measure the dispersion of the predicted error. In each group of the experiments, the smaller the MAE and RMSE, the better the performance of the model is.

B. DATA PREPROCESSING AND PARAMETERS SETTINGS
It is well-known that SVRs produce better models when the data are normalized, all data should be normalized or standardized before the prediction. In this paper, we preprocess the raw data in each group of the experiments by using minmax normalization, i.e., where x ij represents the j-th value of the i-th attribute, max i (x ij ) and min i (x ij ) represent the maximum and minimum values of the i-th attribute, respectively.
As is known to all, the reasonable selection of the kernel function and its parameters can improve the prediction ability. The commonly used kernel functions include the Gaussian kernel function and polynomial kernel function [35], i.e., and where σ represents the width of the Gaussian kernel function, which controls the complexity of the distribution of the feature subspace, and d represents the order of the polynomial kernel function. In this paper, the Gaussian kernel function and polynomial kernel function are selected to combine the multi-kernel functions. The multi-kernel function is composed of 13 different basic kernel functions including 10 Gaussian kernel functions and 3 polynomial kernel functions with different parameters. In addition, we use the grid search approach to adjust hyper parameters, and the values of all hyper parameters settings are shown in Table 1.

C. EXPERIMENT I
In this subsection, we test the accuracy of L 1 MKL-SVR and Ada-L 1 MKL-WSVR based on the first group of the experiments. Each dataset is divided into the training set (60%), the validation set (20%) and the testing set (20%) by using train_test_split() function in Python 3.7.2. The training VOLUME 10, 2022  set is used to train the models, the validation set is used to adjust hyper parameters, and the testing set is used to detect the generalization ability of models. All the experiments are repeated 10 times to demonstrate the robustness of the model (The parameter ''random_state'' in train_test_split() function is set to integers from 0 to 9 in Python 3.7.2).
The descriptive information of these datasets are presented in Table 2, and the experimental results of Experiment I are showed in Table 3.
As the results demonstrated in Table 3, Ada-L 1 MKL-WSVR achieves the best prediction on most datasets against the rest of the models. In particular, the performance of Ada-L 1 MKL-WSVR is superior to other models on Pyrim data, where the MAE and RMSE are 0.059 ± 0.005 and 0.079 ± 0.019, which are reduced by 20.38% and 11.28% on average. Taking the Triazines data as an example, Ada-L 1 MKL-WSVR has the best regression effect with the MAE and RMSE of 0.097 ± 0.012 and 0.130 ± 0.020, respectively, down 0.49% and 3.64% over the second-best model, i.e., L 1 MKL-SVR, while the regression effect of SVR is the worst. In the Boston-housing and Forestfires data, the MAE of Ada-L 1 MKL-WSVR is slightly larger than that of L 1 MKL-SVR, while the variance of the MAE is less than that of L 1 MKL-SVR, and their performances are better than those of SVR and Ada-SVR-R. In the Wine Quality data, one can easily verify that Ada-L 1 MKL-WSVR and L 1 MKL-SVR obtain the best MAE and RMSE, respectively, and there is little difference between them. Both of them are significantly better than that of other models.
In addition, by analyzing the experimental results of SVR, EGWO-SVR and Ada-SVR-R, we can see that the integrated SVRs is superior to SVR. This is due to the fact that Ada-SVR-R trains many times by changing the weighted distribution of the training samples, so as to achieve the effect of multi-kernel learning and increase the integral performance. In general, the two proposed models have smaller MAEs and RMSEs than those of other models, which shows that L 1 MKL-SVR can adaptively select the optimal combined kernel function and its parameter. Due to the advantages of AdaBoost, Ada-L 1 MKL-WSVR can effectively adjust the weights of the training samples and the integrate multiple weak regressions. In the face of the abnormal dataset, it can obtain more robust regression performance than that of L 1 MKL-SVR.
In terms of time complexity, the most efficiency and the least efficiency are SVR and EGWO-SVR. Compared with SVR, EGWO-SVR has higher prediction accuracy, but it needs to constantly update iteration, so the time complexity is high. In addition, our model has higher time complexity than the majority of the comparative models. The time complexity is negatively correlated with the regularization parameter γ . The smaller the regularization parameter, the higher the time complexity is. On the contrary, the larger the regularization parameter, the lower the time complexity is. This is a shortcoming of our model.

D. EXPERIMENT II
In this subsection, we use the COVID-19 epidemic dataset of eight countries to further verify the performance of our model. Table 4 lists the cumulative confirmed cases and deaths in these eight countries, as well as the first and last reporting periods [39].
In this subsection, we use Ada-L 1 MKL-WSVR instead of SVR. Furthermore, we take the data before April 17, 2020 as the training samples to predict and analyze the existing cases of the eight countries from April 28, 2020 to May 17, 2020. The NCDTRM [38] and INCDTGM [39] are used as the comparative models in this experiment.
As shown in  31.19% and 29.98% on average, respectively. This shows the effectiveness of our model in introducing multi-kernel learning and ensemble algorithm. On the whole, Ada-L 1 MKL-WSVR can effectively improve the regression accuracies in prediction of the COVID-19 epidemic than the rest of the models.

V. CONCLUSION
In this paper, a new multi-kernel learning ensemble algorithm, i.e., Ada-L 1 MKL-WSVR, is presented based on the L 1 norm and WSVR with AdaBoost. The L 1 norm of the weights of the combined kernel function is added to the objective function of WSVR, which can effectively select the optimal combined kernel function and its related parameters. Furthermore, we embed FISTA into AdaBoost, rather than a simple combination or the single model. In each iteration, the algorithm simultaneously optimizes and updates the weights D of the combined kernel function and the weights λ of the training samples. Finally, the multiple weakness regressors are integrated into a robust regressor. The numerical experiments are desired to compare the effectiveness and reliability of the algorithm in this paper. However, our algorithm has the higher time complexity than that of some other existing algorithms. In addition, the hyper parameters of our algorithm need to be preseted, and forecasting efficiency change with the hyper parameters.
For future works, it is intended (i) to choose the appropriate initial hyper parameters for the better prediction results, and (ii) to look in some advanced optimization algorithm to improve the computational efficiency of Ada-L 1 MKL-WSVR.