Multiple Kernel SVM Based on Two-Stage Learning

In this paper we introduce the idea of two-stage learning for multiple kernel SVM (MKSVM) and present a new MKSVM algorithm based on two-stage learning (MKSVM-TSL). The first stage is the pre-learning and its aim is to obtain the information of data such that the “important” samples for classification can be generated in the formal learning stage and these samples are uniformly ergodic Markov chain (u.e.M.c.). To study comprehensively the proposed MKSVM-TSL algorithm, we estimate the generalization bound of MKSVM based on u.e.M.c. samples and obtain its fast learning rate. And in order to show the performance of the proposed MKSVM-TSL algorithm for better, we also perform the numerical experiments on various publicly available datasets. From the experimental results, we can find that compared to three classical multiple kernel learning (MKL) algorithms, the proposed MKSVM-TSL algorithm has better performance in three aspects of the total time of sampling and training, the accuracy and the sparsity of classifiers, respectively.


I. INTRODUCTION
Support Vector Machine (SVM) is one of the most widely applied machine learning methods for pattern recognition problems [1]. Besides its good theoretical property in learning rate and consistency [2]- [4], SVM can solve nonlinear classification problem by introducing kernel functions, which is called the kernel method. Although the kernel method has offered plenty of opportunities to solve some complicated problems, it brought lots of interdisciplinary challenges in statistics, optimization theory and applications [5]. And choosing the optimal kernel from a set of candidates and its parameters is a central choice, which usually must be made by a human user using the prior knowledge of data. In other words, the classical kernel-based classifiers are based on a single kernel, and in practice, an ideal classifier is usually based on a combination of multiple kernels.
Therefore, multiple kernel learning (MKL) has been widely concerned and studied. For example, Pavlidis et al. [6] researched gene functional classification from heterogeneous data, which was based on SVM with an unweighted sum The associate editor coordinating the review of this manuscript and approving it for publication was Laurence T. Yang. of heterogeneous kernels. Ben-Hur and Noble [7] presented unweighted pairwise kernels for improving the performance of predicting protein-protein interactions. Lanckriet et al. [8] introduced the method which learns the kernel matrix with semi-definite programming (SDP) to search the optimal of unrestricted kernel combination weights and showed that MKL is comparable with the best soft margin SVM for radial basis function (RBF) kernel. Although SDP with the unrestricted kernel combination weights and nonnegative kernel combination weights generated lots of linear combination algorithms for MKL, the computational complexity is high when the size of training samples is bigger. Then the method of convex combination of multiple kernels appeared in vast literatures in order to reduce the algorithmic complexity of MKL, such as Argyriou et al. [9], [10] proposed the method of choosing appropriate kernel from the convex hull of basic kernels, which are continuously parameterized by a compact infinite set simultaneously, where the chosen kernel can minimize the convex regularization functional. Sonnenburg et al. [11] considered the semi-infinite linear programs (SILP) to obtain the optimal convex combination of kernels by iteratively updating the kernel weights and the vector of dual variables.
Yu et al. [12] studied the semi-infinite programming (SIP) and proved that SIP can be applied well to biomedical data fusion. The research results in [12] showed that least squares SVM (LSSVM) algorithm with SIP has comparable performance and even more efficiently than the conventional MKL method. Unfortunately, this method of convex combination of multiple kernels needs to solve single kernel SVM for many times. To get out of the loop, the method of obtaining the kernel combination weights by using kernel similarity measures was introduced by Kandola et al. in [13]. Igel et al. [14] studied the gradient-based optimization of kernel-target alignment for sequence kernels and applied it to bacterial gene start detection. Cristianini et al. [15] considered the method of maximizing the kernel alignment. In recent years, the MKL method of iteratively updating kernel weights to obtain the optimal kernel combination has been successfully applied in many fields. For example, Chavaltada et al. [16] proposed a method of automated product categorisation by using MKL to improve feature combination in e-commerce. Rahimi and Gönen [17] discriminated early-and late-stage cancers using multiple kernel learning on gene sets, and the performance of their proposed algorithm outperformed other algorithms in terms of predictive power. Wilson et al. [18] applied MKL to genomic data mining and prediction. Lauriola et al. [19] enhanced deep neural networks via MKL, and the method proposed in [19] gave an effective way to design the output computation in deep networks.
However, all the MKL methods mentioned above are still to solve the single kernel SVM for one or more times and recall that the complexity of a single kernel SVM is about O(n q ) [20], where n is the size of training samples and q ∈ [2,3]. Hence, the algorithmic complexity of MKL combining multiple kernels is very high for the big size of training samples. This implies that although MKL method has good learning performance, MKL method is usually very time-consuming and even difficult to implement when the scale of training samples is larger. Then a problem is posed: How to reduce the algorithmic complexity of MKL method (i.e., multiple kernel SVM) and keep its better classification accuracy at the same time?
To solve this problem, we present the idea of two-stage learning for multiple kernel SVM (MKSVM) algorithm in this paper. The two-stage learning consists of the pre-learning and the formal learning. The reasons of introducing two-stage learning are as follows: First, with the advent of the high-tech era, the capacity of data is growing rapidly, and the scale of data is getting larger and larger, so the classical MKL methods usually time-consuming and even difficult to implement in the case of a large training sample size. Second, the larger the amount of data, the lower the value density of data will be, which means that there are many noise samples in big data. A large number of machine learning experiments indicate that the noise samples will not only lead to the increase of storage space, but also affect the accuracy and efficiency of learning. In addition, according to statistical learning theory [1], we know that the most ''important'' samples for classification problem are the samples close to the interface of two classes data. Thus the aim of two-stage learning is to generate the ''important'' samples from the given training set in the stage of formal learning by making use of the information obtained in the stage of pre-learning. To our knowledge, this is the first algorithm of MKL method. For these Markov samples are drawn ergodicly in the stage of formal learning, we consider uniformly ergodic Markov chain (u.e.M.c.) in this paper.
In this paper, we first introduce a new MKSVM algorithm based on two-stage learning (MKSVM-TSL) and then we study the generalization bound of MKSVM algorithm based on u.e.M.c. samples and obtain its fast learning rate. We also compare the proposed MKSVM-TSL with three classical MKL algorithms by numerical studies of various publicly available datasets. We highlight some contributions of this paper as follows.
• The generalization bound of MKSVM based on u.e.M.c.
samples is obtained and the optimal learning rate is established.
• A new MKSVM algorithm, MKSVM-TSL is proposed, and its competitive performance is proved by numerical experiment.
The rest of our paper states as follows. Section II formulates the classical multiple kernel SVM (MKSVM) algorithm. Section III introduces a new MKSVM algorithm based on two-stage learning and analyzes its algorithm complexity. Section IV presents the main theoretical analysis on the learning rate and generalization bound of MKSVM algorithm with u.e.M.c. samples. Section V gives the experimental results of comparing MKSVM-TSL with three classical MKL algorithms. Finally, we conclude this paper in Section VI.

II. MULTIPLE KERNEL SVM FORMULATION
Considering a compact metric space (X , d), let Y ∈ {−1, 1} and ρ be an unknown probability distribution on Z = X ×Y, where we take random variable (X , Y ) form Z. The goal of learning is to find a good prediction rule h : X → Y which will assign values to objects such that if new objects are given, the function h will forecast them correctly. The misclassification error of h is defined by the probability of the event {h(X ) = Y }, i.e., R(h) = P{h(X ) = Y }. In this paper, we assume that h depends on a Reproducing Kernel Hilbert Space (RKHS). Thus MKSVM is given by M different RKHS {H p } M p=1 , each of which associated with a Mercer kernel K p [21]. Let K p : X × X → R be continuous, symmetric and positive semidefinite, i.e., for any finite set of distinct points is positive semidefinite. Such a function is called a Mercer kernel. The RKHS H p associated with the kernel K p is defined to be the closure of the linear span of the set of functions {K p,x = K p (x, ·) : x ∈ X } with the inner product ·, · H p = ·, · K p satisfying K p, Let C(X ) be the space of continuous functions on X equipped with the norm ||f p || ∞ = sup x∈X |f p (x)|. Let κ = sup x∈X K p (x, x) for 1 ≤ p ≤ M , then the reproducing property tells us that ||f p || ∞ ≤ κ||f p || K p , ∀f p ∈ H p . The original optimization problem of MKSVM involving a set of sample z = {z i = (x i , y i )} n i=1 can be stated as follows where C is positive trade-off parameter andξ i are slack variables [22]. In this paper, we set d p = 1/M . The MKSVM algorithm (1) can be rewritten to be a regularized algorithm form: The corresponding expected risk and the empirical risk are defined as (1) can be rewritten as where K p is regularization term and λ is regularization parameter [23]- [24]. The corresponding MKSVM classifier is defined as sign(f z ), where the sign function is defined as sign(f (x)) = 1 for f (x) ≥ 0 and sign(f (x)) = −1 for f (x) < 0.

III. ALGORITHM AND COMPLEXITY ANALYSIS
In this section, we present the MKSVM algorithm based on two-stage learning (MKSVM-TSL) and then we analyze the complexity of the MKSVM-TSL algorithm.
Let n be the size of training samples, which are randomly drawn from the original training set D train . In Algorithm 1, N 1 , kN 2 are the size of pre-learning and the size of formal learning, where k is a positive integer. For simplicity, all the experimental results presented in this paper are based on N 1 = N 2 = n/(k + 1), q = 1.2 and n 1 = 30, where k = 1, 3 and the more detailed descriptions of q and n 1 can be found in [27].
Compared Algorithm 1 with the corresponding algorithm introduced in [28], we can find that although Algorithm 1 is similar to the algorithm presented in [28], the differences are obvious: First, Algorithm 1 is a MKSVM algorithm while the algorithm presented in [28] is a SVM algorithm with a single kernel, which implies that Algorithm 1 extended from D train , then the preliminary learning model f 0 can be obtained by MKSVM algorithm (2) with these samples S r .
(Formal learning) Take randomly a sample z from D train and let it be z * , n 2 ← 0.
, train S mar by MKSVM algorithm (2) and obtain f z . the algorithm presented in [28] from a single kernel to the case of multiple kernels. In other words, the algorithm presented in [28] is a special case of Algorithm 1 proposed in this paper. Second, the total number of training samples of algorithm introduced in [28] is 2n, which implies that compared to the classical SVM with a single kernel based on n randomly training samples, the algorithm presented in [28] is time-consuming. While in Algorithm 1, the total number of training samples is equal to that of the classical MKSVM algorithm as N 1 and N 2 satisfy N 1 + kN 2 = n. This implies that Algorithm 1 improved the algorithm introduced in [28].

B. COMPLEXITY ANALYSIS
From Algorithm 1, we know that N 1 = N 2 = n/(k + 1), and the complexity of a single kernel SVM is about O(n q ) [20], where n is the size of training samples and q ∈ [2,3]. In this paper, we choose the mean weights as the kernel weights of MKLSVM. Therefore, the complexity of MKSVM algorithm is about O(n q ). But in Algorithm 1, we divided n training samples into k + 1 pieces, thus, the complexity of Algorithm 1 is about O (k + 1)(n/(k + 1)) q . If we assume (k + 1) ≈ n γ for any γ > 0, it is obvious that the complexity of our proposed MKSVM-TSL algorithm in this paper is lower than that of the classical MKSVM.

IV. THEORETICAL ANALYSIS OF MKSVM-TSL
In this section we estimate the generalization bound of MKSVM-TSL and establish its learning rate.
By the definitions of p, p 1 in Algorithm 1, we can find that p, p 1 are positive. In addition, the size n of training samples is finite. By the theory of Markov chain [29], we can conclude that the sample sequence S mar = {z 1 , · · · , } generated in Algorithm 1 is uniformly ergodic Markov chains (u.e.M.c.). Then we present the definition of u.e.M.c. as follows: Let (A, S) be a measurable space, define P m (A|ξ i ) as a set of transition probability measures where A ∈ S, ξ i ∈ A. Then a sequence of random variables {ζ t } t≥1 is a Markov chain, if there holds P m (A|ξ i ) := P{ζ m+i ∈ A|ζ j , j < i, ζ i = ξ i }. P m (A|ξ i ) denotes the probability that the state ξ m+i will belong to the set A after m time steps, starting from the initial state ξ i at time i. The fact that the transition probability P m (A|ξ i ) does not depend on the values of ζ j prior to time i is the Markov property, that is P m (A|ξ i ) = P{ζ m+i ∈ A|ζ i = ξ i }. This is expressed in words as ''given the present state, the future and past states are independent''. For two probabilities ν 1 , ν 2 on the measure space (A, S), the total variation distance between the measures ν 1 , ν 2 is defined as Thus the definition of u.e.M.c. can be stated as follows [30]- [32].
Definition 1: [33] Let (·) be the stationary distribution of {ξ t } t≥1 , {ξ t } t≥1 is said to be uniformly ergodic Markov chain for some 0 < α < ∞ and 0 < γ < 1, if there holds To estimate the generalization bounds of MKSVM-TSL, we should estimate the excess misclassification error R(sign(f z ))−R(f c ). Here f c is the Bayes rule, which is defined as f c = sign(f ρ ). f ρ is the regression function of ρ, i.e., f ρ = Y ydρ(y|x), x ∈ X . Zhang [2] established the following relation between excess misclassification error R(sign(f z )) − R(f c ) and excess generalization error E(f z )−E(f ρ ) for convex loss, that is, ( To estimate the right hand side E(f z ) − E(f ρ ) of inequality (3), we also present the following some definitions and assumptions. Definition 2: [34] For any > 0 and F is a subset of metric space. Let N (F, ) is the covering number of the function set F which is defined as for any ∈ N, there exist the minimal disks in F with radius covering F. For some given R ∈ R, we define B R = {f ∈ H K : f K ≤ R} which is in C(X ) as a class of functions. Define N ( ) = N (B 1 , ) as the covering number of B 1 , where > 0. Assumption 1: [35] We say that RKHS has polynomial complexity exponent s > 0 if ln N (B 1 , ) ≤ C s (1/ ) s , The extensively study of the covering number N ( ) can be found in [36]- [37]. Assumption 1 has been proved by Zhou [34]P for the case that if the kernel is C 2n/s on a subset X of R n . And the C ∞ kernel is also valid for any s > 0.
Definition 3: [38] On the space of measurable functions f : X → R, we define the projection operator π as We have that for some 0 < β ≤ 1 and a constant C β > 0, D(λ) ≤ C β λ β for any λ > 0.
c. samples. We have that for any η ∈ (0, 1), R ≥ T , the following inequality holds true with probability at least 1 − η provided that n ≥ 224 0 The proof of Theorem 1 is given in Appendix A. By Theorem 1 and inequality (3), we also establish the following learning rate of MKSVM-TSL.
Theorem 2: Under the same assumptions of Theorem 1, we hvae that for any 0 < η < 1, with probability at least 1 − η, the following inequality holds where C is a constant independent of n. The proof of Theorem 2 is given in Appendix A. In order to understand Theorem 2 for better, we present the following remarks.
Remark 1: (i) From Theorem 2, we have that when β = 1 and s tend to 0, R(sign(π (f z ))) − R(f c ) is arbitrarily close to O(n −1 ), which is same as that obtained in [23], [24] and [39] for the case of randomly independent samples. This means that the leaning rate obtained in Theorem 2 is optimal for the u.e.M.c. samples. In other words, Theorem 2 extended the previously known works on MKSVM algorithm (2) from randomly independent samples [23], [24], [39] to u.e.M.c. samples.
(ii) Compared Theorem 2 with the corresponding result obtained in [28], we can find that Theorem 2 also extended the corresponding result obtained in [28] from a single kernel to the case of multiple kernels. To our knowledge, Theorem 2 is the first work on this topic.

V. NUMERICAL STUDIES
In order to show the performance of MKSVM-TSL, we compare the proposed MKSVM-TSL algorithm with three MKL algorithms, the mean weighted MKSVM based on randomly independent samples [25], the multiple kernel least squares SVM (MKLSSVM) with semi-infinite programming (SIP) based on randomly independent samples [12] and the ratio weighted MKSVM (Rat-MKSVM) based on randomly independent samples [40].

A. DATASETS AND PARAMETERS SELECTION
Our experiments are based on 9 publicly datasets: Image, Census (Census-income), Seismic, Connect4, Isbi4 and TV-News, 1 Cod-rnd, A9a and W8a. 2 In Table 1, we state the information of these datasets. All experiments were run on Intel 2.80GHz E5-1603 v4 CPU with MATLAB 2018a. All the experiments of this paper are based on the following 9 kernels, a linear kernel K (a, b) = a b, three polynomial kernels K (a, b) = (1 + a b) d with d = {2, 3, 4} and five RBF kernels K (a, b) = exp(− a − b 2 /2σ ), where σ is chosen from the set {0.25, 0.5, 1, 2, 4} and a is the transpose of a. The parameter λ of MKSVM algorithm (2) is chosen by 5-fold cross-validation. Besides, all the datasets are normalized before training and testing.

B. EXPERIMENTAL RESULTS
We compare Algorithm 1 with three MKL algorithms, the classical MKSVM, MKLSSVM and Rat-MKSVM in the following three aspects, the accuracy, the total sampling and training time and the sparsity of classifiers. Our experimental procedure can be simply stated as follows: (i) We take randomly n training samples from the original training set D train . Then we obtain the corresponding learning model by the three MKL algorithms with these n training samples, respectively, and we test them on the test samples D test . 1 http://archive.ics.uci.edu/ml/ 2 https://www.csie.ntu.edu.tw/∼ cjlin/libsvmtools/datasets/ (ii) For Algorithm 1, we first obtain a learning model f z according to Algorithm 1, and then we similarly test it on the same test samples D test .
(iii) The procedures (i) and (ii) are repeated for 50 times, and we record the corresponding average accuracy, the total sampling and training time, the average number of support vectors of each algorithm, respectively.

1) COMPARISON OF THE ACCURACY
In Tables 2-3 From Tables 2-3, it is obvious that for n = 8000 (or n = 10000), all the average accuracy of MKSVM-TSL with k = 1, 3 are better than that of MKSVM, MKLSSVM and Rat-MKSVM, and the standard deviations of MKSVM-TSL with k = 1, 3 are smaller than that of other methods except for Image with n = 10000. To present which algorithm is more better by the experimental results shown in Table 2, we take the Wilcoxon signed-rank test with the significance value of α = 0.05 [41] in Table 4.     From Table 4, we can find that our algorithm introduced in Algorithm 1 with k = 1 has better performance compared to the other three algorithms, and Algorithm 1 with k = 1 has better performance compared to Algorithm 1 with k = 3.
In order to show the performance of MKSVM-TSL with k = 1 for better, we present  Tables 5-6 show that the total sampling and training time of MKSVM-TSL with k = 1, 3 is shorter than that of the other three algorithms for n = 8000 (or n = 10000) and the total sampling and training time of MKSVM-TSL with k = 3 is much shorter than that of MKSVM-TSL with k = 1. In particular, the sum of total sampling and training time for all training data of our proposed MKSVM-TSL algorithm with k = 3 is less than one-twelfth of that of MKLSSVM    for n = 8000, and even less than one-fifteenth of that of MKLSSVM when n = 10000.

3) COMPARISON OF SUPPORT VECTOR NUMBERS
The support vector in SVM is the nonzero vector of dual variables. We call the obtained classifier is more sparse  to signify the average number of support vectors of the corresponding algorithms for 50-times experimental results. Since the classifier obtained by MKLSSVM does not have the support vector, we did not present the number of support vectors of MKLSSVM in Tables 7-8. By Tables 7-8, we can find that all the number of support vectors for MKSVM-TSL with k = 1, 3 are obviously smaller than that of MKSVM and Rat-MKSVM, which implies that the classifier obtained by Algorithm 1 is sparser than that of the classical MKSVM and Rat-MKSVM. And the number of support vectors for MKSVM-TSL with k = 3 are much smaller than that of MKSVM-TSL with k = 1.

C. DISCUSSIONS
From the above experiment results, we can find that although our algorithm introduced in Algorithm 1 with k = 1, 3 has much better performance than the other three MKL algorithms, the different value of k is much more important in Algorithm 1. Therefore, we take n = 18000 training samples to explore the performance of MKSVM-TSL with k = 1, 3, 5, 8.

1) COMPARISON OF THE ACCURACY
In Table 9, we use ''MK-TSL-1'', ''MK-TSL-3'', ''MK-TSL-5'', and ''MK-TSL-8'' to denote the (average) accuracy of MKSVM-TSL with k = 1, 3, 5, 8 for n = 18000, respectively. Table 9 shows that there is little difference between the average accuracy of MKSVM-TSL with k = 1, 3, 5. And the average accuracy of MKSVM-TSL with k = 8 are obvious smaller than that of MKSVM-TSL with k = 1, 3, 5. In order to find out which algorithm is more better by the experimental results shown in Table 9, we take the Wilcoxon signed-rank test with the significance value of α = 0.05 [41] in Table 10. From Table 10, we can find that our proposed MKSVM-TSL algorithm with k = 3 has the best performance compared to MKSVM-TSL with k = 1, 5, 8, and MKSVM-TSL with k = 8 has the worst performance compared to MKSVM-TSL with k = 1, 3, 5.   Table 11 shows that the total sampling and training time of MKSVM-TSL with k = 3 is much shorter than that of MKSVM-TSL with k = 1. And almost all the total sampling and training time of MKSVM-TSL with k = 8 is the shortest than that of MKSVM-TSL with k = 1, 3, 5 except for Isbi4 and TV-News. In particular, the sum of total sampling and training time for all training data of MKSVM-TSL with k = 5 is less than a quarter of that of MKLSSVM with k = 1.
Remark 2: By the experiment results of MKSVM-TSL with k = 1, 3, 5, 8, we have that when the size of training samples increase, increasing the value of k appropriately in Algorithm 1 will not only spend less time, but also achieve better accuracy.

VI. CONCLUSIONS
In this paper, we introduced the idea of two-stage learning for MKSVM to improve the generalization ability of the MKL algorithm. To systematically study the generalization ability of MKSVM-TSL, we studied the generalization error of MKSVM with u.e.M.c. samples and established the fast learning rate of MKSVM algorithm with u.e.M.c. samples. We also compare MKSVM-TSL with three classical MKL algorithms which are MKSVM, MKLSSVM and Rat-MKSVM by numerical studies of 9 publicly available datasets. The numerical studies show that the presented MKSVM-TSL not only has the better accuracy, the less total sampling and training time, but also the classifier obtained from MKSVM-TSL is more sparse than that of the classical MKSVM and Rat-MKSVM. And we can deal with big data based on larger value of k. To our knowledge, these studies of MKSVM-TSL in this paper are the first works on this topic.
There exist several open problems that deserves further study along the line of the present research. For example, improving the learning ability of MKSVM-TSL, establishing the generalization bound of MKL algorithm for regression estimation problem, and applying our method to deep neural networks. These problems mentioned above are under our current investigation.

APPENDIX A
In this section, we give the proof of the main results presented in Section IV.
Proposition 1: Let f z be defined by (2), the following inequality holds

Proof: By inequality (3), we can get
Recall f z which has been defined in (2), there holds By the definition of π and for any function f on Z, |π(f (x)) − y| ≤ |f (x) − y| is always true. Using these two inequalities above, the above proposition is gotten.
According to Definition 4, S 3 is bounded by O(λ β ) for 0 < β ≤ 1. Then we have to estimate S 1 and S 2 . Thus we introduce the following lemmas.
By Lemma 1, we have the following relative uniform convergence for u.e.M.c. samples. The detailed proof can be found in [28].
Lemma 2: Under the same conditions of Lemma 1, for any > 0, we have . According to Assumption 2, we have f ∞ ≤ κ f K ≤ R and |g(z)| ≤ R + T := B for any f ∈ B R . By Lemma 2, we have that for any > 0, Moreover, we have |g 1 − g 2 | ≤ f 1 − f 2 ∞ for any g 1 , g 2 ∈ F R and by Definition 2 and Assumption 1, .
Recall that f z is the minimization of the regularized empirical error on B R : .
Set η is equal to the right-hand side of the above inequality, that is . Let R ≥ T and n ≥ 224 0 2 (ln(1/η)) 1+1/s (C 1/s s ), we have that for any 0 < η < 1, the following inequality is valid with probability at least 1 − η/2

Applying Lemma 3, the solution of above equation is given by
Proposition 3: For any 0 < η < 1, we have that the inequality Then we complete the proof of Proposition 3. Proof of Theorem 1: Combining the upper bounds of S 1 , S 2 , S 3 , we have that for any 0 < η < 1, the following inequality