ℓq-Norm Sample-Adaptive Multiple Kernel Learning

Existing multiple kernel learning (MKL) algorithms indiscriminately apply the same set of kernel combination weights to all samples by pre-specifying a group of base kernels. Sample-adaptive MKL learning (SAMKL) overcomes this limitation by adaptively switching on/off the base kernels with respect to each sample. However, it restricts to solving MKL problems with pre-specified kernels. And, the formulation of existing SAMKL falls to an <inline-formula> <tex-math notation="LaTeX">$\ell _{1}$ </tex-math></inline-formula>-norm MKL which is not flexible. To allow for robust kernel mixtures that generalize well in practical applications, we extend SAMKL to the arbitrary norm and apply it to image classification. In this paper, we formulate a closed-form solution for optimizing the kernel weights based on the equivalence between group-lasso and MKL, and derive an efficient <inline-formula> <tex-math notation="LaTeX">$\ell _{q}$ </tex-math></inline-formula>-norm (<inline-formula> <tex-math notation="LaTeX">$q\geq 1$ </tex-math></inline-formula> and denoting the <inline-formula> <tex-math notation="LaTeX">$\ell _{q}$ </tex-math></inline-formula>-norm of kernel weights) SAMKL algorithm. The cutting plane method is used to solve this margin maximization problem. Besides, we propose a framework for solving MKL problems in image classification. Experimental results on multiple data sets show the promising performance of the proposed solution compared with other competitive methods.


I. INTRODUCTION
Kernel methods [1], have been an attractive topic in machine learning [2]- [6]. They introduce nonlinearity to the decision function by mapping the original features to a higher dimensional space. Due to their descent computational complexity, high usability and solid mathematical foundation, they have been widely used for classification [7], clustering [8] and regression [9] tasks in numerous applications, such as pattern recognition [10] and object detection [11].
In many practical applications, data has multiple representations or data sources, which usually contain complementary and compatible information. For example, in the classification task of Oxford Flower17 [12], flowers can be represented by different features, such as color, shape, and texture. It is difficult for us to design an appropriate kernel function for this task. We have multiple kernel candidates because multiple features are derived from images or because different kernel functions (e.g., polynomial, RBF) are used to measure the similarities between samples for given feature The associate editor coordinating the review of this manuscript and approving it for publication was Nikhil Padhi .
representation [13]. It is of vital importance to find the optimal combination of these kernels for this task. This is exactly what multiple kernel learning (MKL) needs to solve.
Recently, MKL has attracted much attention. It not only provides an efficient way to learn an optimal kernel but also builds an elegant framework to integrate complementary information with distinct base kernels extracted from multiple heterogeneous data sources or features. Research on MKL has been flourishing and can be roughly categorized into two aspects. One is to improve the computational efficiency of MKL. Using Semi-Definite Programming or alternating approaches, these methods try to make MKL capable of handling large-scale learning tasks [4]- [6], [14]- [18]. The other one is to improve the classification performance of MKL by exploring the possible combination ways of base kernels [19]- [25].
While MKL has been studied extensively, it is restricted to learning a global combination for the whole input space. Due to the characteristic of data distribution, the set of kernels that are important for discrimination may vary from sample to sample. In the sense that all input samples share the same kernel weights, ignoring the fact that it may be beneficial  to assign different samples with different kernel weights. For instance, the data distribution of different images for a given feature representation (e.g., color) may have a big difference [13]. Therefore, the kernels derived from different data distribution may have different effects on different images. Thus, introducing local learning into MKL for a localized sample-specific combination would achieve better performances and various local MKL methods have been proposed [24], [26]- [31].
Nevertheless, the base kernels may be irregularly corrupted across different samples. For instance, when the input features of samples are contaminated by noise, the kernel combination weights predicted via a parametric model will not be accurate anymore. To handle this problem, Liu et al. [3] proposed a sample-adaptive MKL algorithm (SAMKL), in which base kernels are allowed to be adaptively switched on/off with respect to each sample. Latent binary variables are introduced to each base kernel to decide whether a particular kernel should operate on a particular sample or not. The kernel combination weights and the latent variables are jointly optimized alternately via margin maximization principle.
However, previous MKL methods usually use pre-specified kernels to improve classification performance. In this work, we propose a framework to construct kernels from features extracted by ourselves and derive an efficient q -norm SAMKL problem. The cutting plane method [32] is used to optimize the objective function. Extensive experimental results on multiple data sets exhibit the promising performance of the proposed technique compared with other competitive methods. Fig. 1 illustrates the framework of our work. Given an image data set, we extract features using traditional machine learning methods (HOG [33], SIFT [34], LBP [35], and etc.), and deep learning methods [36]. Then, we construct Gaussian kernels on the normalized feature matrices. After that, we perform MKL algorithms on these computed kernels to get the predicted labels for classification performance evaluation. Precision is used as the metric to evaluate the classification performance in our experiments.
The contributions of this study can be summarized as follows: (a) An efficient q -norm SAMKL is proposed which is much more flexible compared with SAMKL.
(b) The cutting plane method is used to solve this margin maximization problem. By exhibiting a trick on constraints of the objective function, we can achieve comparable computational complexity of SAMKL.
(c) Comprehensive experimental results on multiple data sets demonstrate the effectiveness and efficiency of the proposed q -norm SAMKL.
The rest of this paper is organized as follows. Section 2 provides a brief overview of related work. Section 3 presents the proposed optimization methods for SAMKL. Section 4 shows the extensive experimental results. Finally, Section 5 concludes this paper.

Given a labeled dataset
. Then, the hyperplane (i.e., discriminant function, linear classifier) can be written as [37] f where ω p is the weights of the original feature space corresponding to the p-th feature mapping, γ p is the weight of the classifier induced by the p-th feature mapping, and b is the bias term of the classifier. MKL learns the classifier by maximizing the margin between classes via solving the following quadratic optimization problem.
where H p represents the feature space corresponding to the p-th base kernel, T is the dimensionality of the feature space given by φ p (·), ξ is the slack variables, and C is a regularization parameter.
According to [38], the problem in Eq. 2 is proven to be equivalent to the one in the following equation.
where γ p is the combined weight of the p-th base kernel and controls the smoothness of the kernel function. The primal optimization problem of Eq. 3 is convex and differentiable [6] and it is equivalent to solve the min-max optimization problem of the following dual problem [14].
where α = [α 1 , α 2 , · · · , α n ] are the Lagrange multipliers, 1 is a vector of all ones and (α • y) denotes the component-wise multiplication between α and y. It should be noted when γ ∈ lies in a simplex, i.e., = {γ : m p=1 γ p = 1, γ p ≥ 0, ∀p}, it is a 1 -norm of kernel weights. Correspondingly, when = {γ : ||γ || p ≤ 1, γ p ≥ 0, ∀p}, it is an p -norm of kernel weights and the resulting model is called p -MKL [4]. After obtaining the optimal α, b and γ , we get ω p = n i=1 α i y i φ p (x). The discriminant function can be formulated as The problem in Eq. 4 is usually solved by performing an alternating optimization strategy which consists of solving a canonical SVM optimization problem with given γ and updating γ using the gradient calculated via Eq. 6 with α found in the first step [39]. This MKL framework is called simpleMKL [6].
Most of existing MKL algorithms are restricted to learning a global combination of kernel weights for the pre-specified kernel matrices. That is, all input samples share the same kernel weights, ignoring the fact that samples may have the underlying local structure, which in turn degrades the MKL performance. Therefore, it is reasonable to assign different samples with different kernel weights by suppressing kernels that are irrelevant for learning tasks and selecting kernels that are beneficial for MKL tasks. Based on this idea, many localized MKL algorithms have been proposed. Xu et al. [4] discussed the connection between multiple kernel learning and the group-LASSO regularizer and proposed an efficient p -norm MKL algorithm. The algorithm generalized the formulation of MKL to q -norm MKL by replacing m p=1 γ p ≤ 1 with m p=1 γ q p ≤ 1 where q > 0. This proposed algorithm can be applied to the entire family of q models, besides which the kernel weights can be calculated by a closed-form formulation without employing other commercial optimization software.
However, the base kernels may be irregularly corrupted across samples. To improve this situation, Liu et al. [3] proposed a sample-adaptive MKL (SAMKL) algorithm to localized MKL, where base kernels can be adaptively switched on/off at the example level. The optimization problem of the proposed SAMKL is as follows, where latent binary variables , 0} m with respect to x i are introduced to decide whether a particular kernel should operate on a particular point or not. Specifically, h ip = 1 means that the p-th feature mapping φ p (·) is beneficial for the classification of the i-th sample x i , while h ip = 0 indicates the opposite. The optimization problem of Eq. 7 can be solved by considering a two-stage alternating optimization which consists of solving an MKL problem for different subspaces simultaneously with fixed values of the latent variables and secondly obtaining new values of the latent variables by running an integer program solver. Note that each step of the iteration here solves costly operations (an MKL solver and an integer problem solver) in comparison with the SVM solvers in the other approaches [26]. As can be seen in Eq. 7, the combination of kernel weights falls into the 1 -MKL model. Following our previous analysis, we improve this situation by formulating a closed-form solution for optimizing the kernel weights and derive an efficient q -norm SAMKL algorithm. Besides, the cutting plane method is used to solve this margin maximization problem, and the computational complexity of our algorithm is equivalent to that of Eq. 7.

III. SAMPLE-ADAPTIVE MULTIPLE KERNEL LEARNING
This section introduces the proposed q -norm SAMKL problem. First, the problem formulation of q -norm SAMKL is given. Second, cutting plane based methods are used to optimize the objective function of our proposed problem. A discussion of our work is then provided.

A. PROBLEM FORMULATION
For MKL problems with latent variables, we want to learn a prediction rule of the form where (x; y, h) is a joint feature mapping on data X , labels Y and latent variables H. The objective of latent MKL can be formulated as and where h 0 is a binary vector with all bits set to 1, indicating all feature mappings are beneficial for classification of all samples. m 0 is a pre-specified parameter controlling the deviation of each h i from h 0 .
We generalize the MKL formulation for arbitrary q -norms by regularizing over the kernel coefficients or equivalently. The optimization problem of Eq. (9) can be rewritten in the following functional form We use a classical Lagrangian approach [4], [39]- [41] to get γ . The Lagrangian of the primal is: Setting the partial derivatives w.r.t. γ , we obtain the following condition on the optimality of γ , At optimality, we have these conditions which satisfy the KKT condition: According to (c), we can state for all p that either γ p = 0 and thus ω p = 0 or t p = 0 and thus . Then at optimility, we have t p = 0 following the KKT con- Combining these conditions with (a), γ p can be updated by We optimize the upper bound of the problem in Eq. (9), According to the representer theorem [42], we have and and whereK p is calculated via kron(K Y , K p ) and K Y is a similarity matrix defied on label set Y. Parameter α is defined as [α 11 · · ·α n1 , · · · ,α 1c · · ·α nc ] ∈ R n×c and a p = [K p (: Combining Eq. (19) and Eq. (16), we obtain Combining Eq. (17) and Eq. (19), we obtain min α,H,γ ∈θ

B. OPTIMIZATION
Inspired by the works in [32], we try to solve the q -norm SAMKL by the cutting plane method. In this section, we use a ''n-slack'' formulation to solve the optimization problem. Two different ways of using a hinge loss to covex upper bound the loss is proposed in [43], namely ''margin-rescaling'' and ''slack-rescaling''. Margin-rescaling methond is used in this section. Combing Eq.(23), Eq.(24) and Eq. (25), we obtain the following optimization problem min α,γ ,ξ ≥0 where ξ i is shared among constraints from the same sample.
(y i , y i ) is a function that quantifies the loss associated with predicting y i when y i is the ground-truth. The ground-truth labels are not excluded from the constraints because they correspond to non-negativity constraints on the slack variables ξ i . And ξ i is an upper bound on the empirical risk on the training sample S = {(x 1 , y 1 ), · · · , (x n , y n )} [32], [44]. We Algorithm 1 Cutting Plane for SAMKL With Margin-Rescaling via the n-Slack Formulation Input: C, ε, S. Output: (α, γ , ξ ).
Since the optimization problem in Eq. (26) 1 , h 1 ), · · · , (ȳ n , h n )) ∈ Y n × H n : In the formulation aforementioned, for each possible combination of labels and hidden models ((ȳ 1 , h 1 ), · · · , (ȳ n , h n )), it has only one slack variable ξ that is shared across all constraints. This optimization problem in Eq. 27 is a non-linear integer programming, which can be solved via quadratic programming. We give the cutting-plane algorithm with margin-rescaling via the 1-slack formulation in Alg. 2.
1: W ← ∅ 2: repeat where τ p = γ 2 p α K p α 10: end for 11: W ← W ∪ {((ŷ 1 , h 1 ), · · · , (ŷ n , h n ))} 12: Alg. 2 iteratively constructs a working set W of constraints. In each iteration, the algorithm computes the solution over W (Line 3), finds the most violated constraint (Lines 4-7) and adds it to the working set. The algorithm stops when no constraint can be found that is violated by more than the desired precision ε (Line 12).

C. DISCUSSION
As mentioned before, the n−slack and the 1−slack formulations of our problem are equivalent. Therefore, the objective functions of Alg. 1 and Alg. 2 are equal. That means the theoretical results for those algorithms are consistent. The proof can be found in Theorem 1 mentioned in reference [32]. Unlike in the n−slack algorithm Alg. 1 where the number of constraints increases exponentially in the solving process, only a single constraint is added in each iteration of 1−slack algorithm Alg. 2. So the 1− slack algorithm is more efficient than the n−slack algorithm. For this reason, we implement Alg. 2 to validate the high efficiency and effectiveness of our work in Section 4.
From Fig. 2(a) we can see that the computation time increases linearly with the number of iterations. It can be seen from Fig. 2(c) that the objective function value of the cutting plane based optimization Alg. 2 (denoted as CP-SAMKL) is monotonic, while the alternate coordinate descent-based optimization of Eq. 23 (denoted as ACD-SAMKL) is not. Therefore, the classification performance of CP-SAMKL is much more stable compared to ACD-SAMKL with the number of iteration increases. Besides, from Fig. 2(d), we can see that the CP-SAMKL is easier to converge than ACD-SAMKL. For these reasons, the cutting plane based optimization of SAMKL is used in Section 4.

IV. EXPERIMENTAL RESULTS
All of the experiments were carried out on a computer with a 3.6GHz Intel Xeon E5-1620 CPU and 48GB of memory with Matlab R2014a (64bit).

A. DATASETS
A wide range of image datasets used in our experiment is summarized in Table 1. The number of datasets classes ranges from 2 to 37, the sample number reaches up to 2,600, the views of each dataset scales from 7 to 14. Besides, two datasets used for protein subcellular localization are given in Table 2, including psortPos and plant datasets. These protein datasets have been widely used by MKL algorithms [37] and can be downloaded from website. 1 Caltech256: 2 It is a collection of 256 object categories containing a total of 30,607 images. These categories are grouped by animate and inanimate and other finer distinctions [45]. And the animate objects -69 categories in all -tend to be more cluttered than the inanimate objects, and harder to identify. The air animals of the animate objects are used in our experiment except for iris and hawksbill-101, which are not air animals. That is to say, a subset of Caltech256 with a total of 1,032 samples in 9 classes are used in our experiments. These categories are depicted in Fig. 3(b). Birds200: 3 It is an image dataset with photos of 200 bird species (mostly North American). A total of 882 samples in 15 birds categories are selected in our experiments. These 15 categories are easy to be confused by the human eye and we get this subset by clustering. These categories are depicted in Fig. 3(a) STL-10: 4 It is an image recognition dataset for developing unsupervised feature learning, deep learning, self-taught learning algorithms. A total of 2,600 samples from this dataset are selected, with each of the two classes (dog and cat) has 1,300 samples. Comparison between proposed CP-SAMKL and ACD-SAMKL on 15birds_confuse_hybrid dataset. The regularization parameter C is set to 10 8 and 1 for CP-SAMKL and ACD-SAMKL, respectively. m 0 , ε and the maximum iterations is set to 2, 10 −6 and 500 for these algorithms.
Cifar-100: 5 This dataset has 100 classes containing 600 images each. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a ''fine'' label (the class to which it belongs) and a ''coarse'' label (the superclass to which it belongs). A total of 1,100 samples from this dataset are randomly selected in a balanced manner, with each of the 11 classes has 100 samples. The 11 classes are easy to be confused by the human eye, including aquarium fish, crocodile, dolphin, flatfish, otter, ray, seal, shark, trout, turtle, and whale, and they are depicted as Fig. 3(c).
Vgg Pets: 6 It is a collection of pets, covering 37 different breeds of cats and dogs, with roughly 200 images for each class. A total of 1480 samples from this dataset are selected, with each class has 40 samples.

B. BASELINES
We compare the proposed algorithms with state-of-the-art MKL algorithms.
UMKL: It is a uniformly weighted MKL algorithm. And we implement it based on the LIBSVM 9 package.
SimpleMKL: [6] It is a well-known baseline with max-margin principle. Its Matlab implementation is available from. 10 Its formulation of the MKL problem results in a smooth and convex optimization problem, which is equivalent to other MKL formualtions available in the literature. The main added value of the smoothness of the new objective function is that descent methods become practical and efficient means to solve the optimization problem that wraps a single kernel solver. It provides optimality conditions, analyzes convergence and computational complexity issues for binary classification. q -MKL: [4] It is an efficient algorithm for multiple kernel learning by discussing the connection between MKL and group-lasso regularizer. It calculates the kernel weights by a closed-form formulation, which therefore leverages the dependency of previous algorithms on employing complicated or commercial optimization software. It is a general max-margin MKL framework with q -norm constraint on kernel weights. We consider q = 1, 2, 4 and use its Matlab implementation.
SAMKL: [3] In this algorithm, the base kernels are allowed to be adaptively switched on/off with respect to each sample. A latent binary variable was assigned to each base kernel when it is applied to a sample. The kernel combination weights and the latent variables are jointly optimized via the margin maximization principle.

C. RESULT ANALYSIS
Following [37], F1-score is used to measure classification performance on psortPos data set, while the matthew correlation coefficient (MCC) is used for the plant data set. The results of SAMKL are reported from the original paper [3], while the others are obtained by us running the released code. For these protein data sets, we randomly split the data into 20 groups, with 50%: 50% for training and test. For our proposed algorithm, C is chosen from [10 3 , 10 4 , . . . , 10 12 ] by five-fold cross-validation and m 0 is chosen adaptively in the optimization of the algorithm. As seen in Table 3, our proposed algorithm achieves superior performance to the baselines on the protein data sets.
Precision is used to measure classification performance on the image datasets used in our experiments. Table 4 shows the classification results of the proposed algorithm and the baselines on each data set. Each cell represents mean precision and standard deviation. Boldface means the best one. From this table, we can see that the image datasets using only normal features achieve inferior classification performance compared with that using deep features. It also can be seen   that all the algorithms achieve excellent performance on the datasets using deep features. And the performance achieved by the uniformly weighted MKL is comparable to that of SimpleMKL and Lp-norm MKL. But our proposed algorithm can further improve the classification performance compared to the baselines. Table 5 shows the classification results of the proposed algorithm and the baselines on data sets with rub kernels. These rub kernels are generated by setting a ratio of samples of several views to 0. 30% of the samples are selected randomly and their values of the randomly selected 50% views are set to 0. From this table, we can see that the proposed algorithm can further improve the classification performance compared to the baselines.
The learned latent variable h is shown in 4. The h on each classification task is shown as an n×m matrix, where n and m are the number of training samples and base kernels, respectively. As can be seen, the active latent variables indicating ''1'' are in blue while the others indicating ''0'' are in red. The blue color indicates those latent variables which switch off the base kernels whole weights are nonzeros. As shown, h switches on/off the base kernels differently across training samples. Due to the constraint ||h i − h 0 || ≤ m 0 , ∀i, each row of these matrices has a fixed number of ''0''s. They are 6 and 10 for 15birds_hybird and STL_dogcat, respectively. It also can be seen that the blue area is on the left side of Fig. 4(a) and Fig. 4(b), which means most of the kernels extracted from the normal features are switched off. That means, combining the kernels extracted using deep features, we can get superior classification performance. This rule can be seen from Table. 4. Besides, our proposed algorithm achieves comparable results on the data sets using hybrid features. These experiments preliminarily demonstrate the effectiveness and the properties of the proposed q -norm SAMKL.

D. PARAMETER SELECTION
For the image data sets using only normal features or deep features, m for our proposed algorithm is chosen from [0, 1, 2, 3, 4] by 5-fold cross-validation. For the data sets using hybrid features, m 0 is selected from [0, 2, 4, 6, 8]. The penalty parameter C for our proposed 1-slack CP-SAMKL is set to a fixed value of 10 8 . Each base kernel matrix is normalized to have a unit trace.
We perform 5-fold cross-validation on training data sets to select the regularization parameter C ∈ {10 −1 , 10 0 , 10 1 , 10 2 , 10 3 , 10 4 } for UMKL, SimpleMKL and q -norm MKL. Fig. 5 shows the effect of the iterations on the 1st split of 15 birds data set using hybrid features. And in this setting, the value of regularization parameter C and selected channels m 0 are fixed to 10 8 and 2 for convenience, respectively. As the Fig. 5(a) shows, the classification precision increases as the number of iterations of the proposed algorithm increases. And the performance is relative stable when the iteration is too large. It can seen from Fig. 5(b) that the objective function value increases with the number of iteration monotonically increasing.

V. CONCLUSION
This work proposes an efficient q -norm SAMKL problem which jointly performs MKL and infers the base kernel subsets that are useful for the classification of each sample.
By allowing each sample to adaptively switch on/off each base kernel, q -norm SAML achieves clear improvement over the comparable MKL algorithms in recent literature. In this paper, we solve the optimization problem using cutting plane methods, and construct datasets using mainstream machine learning methods and deep learning methods. Extensive experiments exhibit the effectiveness of our proposed algorithm. Further improving the classification performance of the proposed SAMKL is another piece of our future work.