Multiple Kernel k-Means With Low-Rank Neighborhood Kernel

Multiple kernel clustering algorithms achieve promising performances by exploring the complementary information from kernel matrices corresponding to each data view. Most of the existing methods aim to construct a consensus kernel for the afterward clustering. However, they ignore that the desired kernel is supposed to reveal the cluster structure among samples and thus to be low rank. As a consequence, the corresponding clustering performance could decrease. To address this issue, we propose a low-rank kernel learning approach for multiple kernel clustering. Specifically, instead of regularizing the consensus kernel with low-rank constraints, we use a re-parameterize scheme for the kernel matrix. Meanwhile, the consensus kernel is located in the neighborhood area of the linear combination of base kernels. An alternate optimization strategy is designed to solve the resulting optimization problem. We evaluate the proposed method on 13 benchmark datasets with 9 state-of-the-art algorithms. As is demonstrated by experimental results, our proposed algorithm achieves superior clustering scores against the compared algorithms on the reported popular multiple kernel datasets.


I. INTRODUCTION
Clustering is one of the major research topics regarding semi-supervised and unsupervised learning tasks, which allows the learning models to automatically uncover the patterns and structures of data and categorize these items into different classes accordingly. Owing to the variety of multimedia data sources, multi-view clustering has attracted increasing attention during the past decade. Although single-view algorithms such as [1], [2] can independently obtain several clustering results using distinct features provided by each view, the clustering accuracy is far from satisfactory. Therefore, researchers have made great efforts to design multi-view clustering algorithms to explore feature patterns provided by different views and fuse them to obtain optimal clustering results.
Researches on multi-view clustering methods can be classified into four categories [3]. The first category is co-training approach. It is based on the assumptions that 1) the features provided by each view are self-sufficient for the clustering task and 2) similar feature patterns lead to the same cluster prediction probabilities. Co-training method is a collaborative The associate editor coordinating the review of this manuscript and approving it for publication was Tomasz Trzcinski . learning approach which usually involves two successive stages by firstly applying different algorithms to each view and secondly joining the separate results together [4]- [8]. The second category is multiple kernel clustering algorithm, which employs different kernels to introduce non-linearity to the clustering process [9], [10]. The difficulty of this method is to combine different kernels, with which [11]- [13] and [14] deal by matrix-induced regularization and late-fusion alignment, respectively. These solutions fuse information either in different ways or in different stages of the training process. The third category is multi-view graph learning, where data items are represented by vertices and their relationships are represented by edges of the graph [15], [16]. To formalize, the graph matrix is often constructed via similarity matrix which quantifies the affinity among different objects [17]. Graph fusion is the fundamental task in multi-view graph clustering and it assumes that each single view reveals a fraction of the latent clustering structure, which can be regarded as the consensus graph of different views. Xue et al. [18] explore the fusion method and take into account the learnable weights of similar groups. For simplicity, Nie et al. [19] propose a parameter-free approach to automatically learn the weights of multiple kernels. The fourth category is multi-view subspace clustering, which is under the presumption that VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ high-dimensional data lies aside low-dimensional subspaces and can be represented linearly from the subspaces. Representative researches fall into two general types, namely, low-rank representation (LRR) [20] and sparse subspace clustering (SSC) [21]. Both of these approaches follow a similar algorithmic routine: first by aligning the representation with the original data, then by applying the reconstruction matrix to spectral clustering to acquire the final clustering result. The conformity of the alignment term makes it straightforward to extend to multi-view learning tasks, where alignment operations are optimized synchronically among multiple views [22]- [24]. Multiple kernel algorithms are developed to enlarge the searching space of feasible kernel mappings. One representative work in multiple kernel clustering is demonstrated by optimal neighborhood kernel clustering (ONKC) [25], [26], which extends the multiple kernel k-means clustering (MKKM) method [27] to a general formulation by mediating between the optimization procedure and the clustering process. Specifically, ONKC has delicately designed an approach to go beyond MKKM which merely allocates the optimal kernel using the linear combination of base kernels from different views. In additional to the simple assumption that the ideal kernel can be learned with linearity, ONKC imposes an alternative kernel which lies in the neighborhood of the linearly combined kernel, thus acquiring the property of nonlinearity. This method expands the searching scope of the possible optimal kernel for clustering. In addition, the injection of the alternative kernel builds a bridge between cluster indicating matrix and the coefficient vector of base kernels, because the optimization of the alternative kernel resorts to both the combined kernel and the indicating matrix. Accordingly, the clustering results using the neighborhood kernel as the input representation matrix outperforms the traditional MKKM methods.
The clustering results with neighborhood kernel outperform the traditional MKKM method. However, the only constraint of the neighborhood kernel matrix in ONKC is positive semi-definite (PSD) as a guarantee for its validity as a representative kernel matrix. In [25], The neighborhood kernel matrix combines the two processes (the kernel refinement process and the clustering process) into a unified framework. It can depict the non-linear structures and capture non-uniformly shaped clusters. Nevertheless, it takes no account of the rank constraint, making the clustering structure described by the kernel is indistinct and bear some noise which will inevitably bring distractions during the clustering process.
To address this issue, we present an algorithm named multiple kernel k-means with low-rank neighborhood kernel (MKKM-LR) in this paper. We propose an assumption that the kernel matrix with low-rank property and its rank in the vicinity of the number of clusters can depict the clustering structure, thus performing better clustering tasks. Specifically, we first acquire a linear combination of base kernels by equally assigning weights to each kernel. In the vicinity of this combined kernel, we further implement a neighborhood kernel to approximate the linear kernel, thus imposing property of non-linearity to the similarity matrix. We then re-parameterize the neighborhood matrix using the equivalent factorized formulation to refine the kernel matrix with an auxiliary low-rank constraint. Subsequently, the neighborhood kernel matrix is then applied to traditional MKKM to acquire the cluster representatives, through which we obtain the final clustering results by k-means clustering. We extensively evaluate our method on 13 multiple kernel learning (MKL) benchmark datasets and compare with nine state-of-the-art algorithms for demonstration of its feasibility.
Our contribution in this work can be summarized as follows: • We propose a multiple kernel clustering algorithm by searching the neighborhood of the linear combination of base kernels and re-parameterizing the kernel matrix. Hence, the searching scope of the optimal kernel is enlarged and the kernel matrix naturally acquires the low-rank property.
• A formulation is proposed under the assumption that a kernel with low-rank property can represent the clustering structure thus better serve the clustering task. We design a four-step alternate optimization algorithm with proved convergence to learn the optimal kernel matrix. Theoretical analysis guarantees the low-rank property and reveals a relationship between the optimization process and the clustering process.
• We construct an experimental study on 13 benchmark datasets, covering a variety of applications and compare clustering performance with nine state-of-the-art algorithms. Our method demonstrates superior performance in most datasets with fast convergence. This paper is organized as follows. Section II briefly outlines the kernel k-means task and the basis of our work. Section III formulates the optimization target and designs the four-step optimization algorithm. In Section IV, we analyze the clustering performance, convergence and parameter sensitivity of our method. Section V concludes our work.

II. RELATED WORK A. KERNEL k-MEANS CLUSTERING (KKM)
K -means [28] is one of the most popular clustering methods which can cluster data into disjoint partitions through an iteratively convergent procedure. It has been widely applied in a variety of applications. However, this algorithm is restricted to linearly separable feature space because it is based on the Euclidean distance. To tackle this restriction, some methods are proposed to generalize k-means algorithm by integrating kernel functions [29], which can extend the original input space X to a higher-dimensional linear-separable feature space H. The kernel mapping can not be directly calculated as it is implicit. Thus an indirect way is proposed by calculating the inner product K x, x , as explained below.
Let us denote the collection of n samples as {x i } n i=1 , and denote the feature mapping as φ(x). The partitioning of data , with each cluster signified as C i , and the centroids of clusters are denoted as {c i } k i=1 . We can then formulate the kernel k-means objective as: where is the i-th cluster centroid because it satisfies: and the distance from each data point to its clustering center is measured by the Euclidean distance: Note that the distance measure reduces the prerequisite knowledge of feature mapping φ(a) to the knowledge of dot product φ(a) · φ(b), thus the distance can be measured by the kernel function K (·, ·). Table 1 shows the popularly adopted kernel functions in practice.
Owing to the fact that the discrete S is difficult to solve, a typical approach is to use a relaxed version of the discrete matrix and let it take real values. In specific, by defining U = SL 1 2 and relaxing it to real values, the formula (4) can be expressed as: The optimal U can be obtained by concatenating the k eigenvectors corresponding to the largest k eigenvalues of the kernel matrix K.

B. MULTIPLE KERNEL k-MEANS CLUSTERING (MKKM)
To benefit from different kernels that are suitable for different types of features, the multiple kernel k-means clustering (MKKM) task is proposed by combining different kernels with coefficients. The kernel combination coefficient can be constrained in linear [30], quadratic form [31] or L 2 -norm [32]. This paper is based on the linear form: where {K p (·, ·)} m p=1 denote kernel functions and λ = [λ 1 , · · · , λ p , · · · , λ m ] denotes the coefficient of each base kernel that forms the hyperplane of the combined kernel matrix, which can be optimized through the learning process [30].
Accordingly, by substituting the simple kernel matrix K in (5) with the combined matrix K λ x i , x j expressed in (6), the optimization target [30] for MKKM is: The optimization problem of (7) can be divided into two steps: i) Optimizing U with λ fixed. With λ given, the combined kernel K λ can be regarded as a predefined kernel and the overall optimization problem is equivalent to: which can be easily solved by the standard KKM optimization. ii) Optimizing λ with U fixed. With U given, the coefficient vector λ can be optimized by solving: this leads to a sparse solution of λ.

C. OPTIMAL NEIGHBOURHOOD MULTIPLE KERNEL CLUSTERING (ONKC)
By taking into account the nonlinearity of the optimal kernel matrix, [25] proposes the concept of the neighborhood kernel which resides alongside the linear combination of base kernels. The incorporation of the neighborhood kernel B can be fulfilled using the alignment with the linear kernel K λ = m p=1 λ p K p . Concurrently, the optimal kernel is adapted to learn the cluster representation U in a kernel kmeans structure. The overall framework can be implemented by the following optimization problem.
where U is jointly optimized with the neighborhood kernel matrix B and the coefficient vector λ. VOLUME 9, 2021 It is worth mentioning that the neighborhood kernel B is utilized to acquire the optimal kernel that better serves the clustering task. However, the above learning task in (10) neglects the constraints of the neighborhood kernel which could possibly refine the clustering structure. We observe that the only restriction of the optimal kernel B is PSD, which is a very general requirement for the kernel matrix. This would result in abundant noise in the acquired kernel, leading to unsatisfactory clustering results. To address this issue, we propose a low-rank optimal neighborhood kernel clustering for multiple kernel learning to refine the optimal kernel matrix and improve the representative capability upon multiple base kernels.

III. METHOD
In this section, we reformulate the clustering structure adopted in [25] and make some adjustments to the neighborhood kernel using the re-parameterization technique for further refinement III-A. Based on our formulation, a four-step alternate optimization algorithm in Section III-B is conducted efficiently with proved convergence.

A. THE PROPOSED FORMULATION 1) RE-PARAMETERIZATION
In order to tackle the problems mentioned in Section I, our motivation is to refine the optimal kernel matrix by re-parameterizing the optimal kernel while maintaining it in the neighborhood of the linearly combined kernel. Meanwhile, the linear kernel is one of the optimization variables, varying in coordination with the neighborhood kernel matrix. In addition to the PSD constraint of the optimal kernel, which is acquired in the neighborhood of the linear kernel, the re-parameterization strategy further imposes a low-rank characteristic on the kernel matrix, making the optimal kernel naturally possess a low-rank clustering structure.
Since the kernel matrix B is rigorously restricted to be PSD, it can be further re-parameterized as B = W W , where W is an orthogonal matrix and is a diagonal matrix with diag( ) = [σ 1 , σ 2 , · · · , σ c ], {σ i } c i=1 are the top c eigenvalues of B. As rank{B} = rank{W W } ≤ rank{ } = c, the rank of kernel B is constrained to be less than c. By assigning a c (c n) to the re-parametric expression, we constrain the kernel matrix to be low-rank with the regularized mapping matrix W and the eigenvalue matrix . We further enforce our re-parameterization integrity to be in the locality of the linear kernel matrix K λ , which lies in the hyperspace of the multiple base kernels. This can be formulated as:

2) MATRIX-INDUCED REGULARIZATION
The minimization of matrix-induced regularization term increases the diversity and diminishes the redundancy of the multiple kernels [11]. Specifically, to utilize the information provided by multiple kernels, the similarity measure of different kernels is applied with a pre-defined criterion M K p , K q , and it is a consensus that higher value represents higher similarity. By minimizing the term µ p µ q M K p , K q , where µ p and µ q are factors of weights for K p and K q , we can radically and effectively avoid duplicate large weights being assigned to homogeneous kernels. Based on this, the regularization term can be formulated as: To merge the clustering process into the optimal kernel learning task, we consolidate the neighborhood alignment structure and the kernel regularization term with the MKKM framework. Thus we can obtain the following formulation: where M ∈ R m×m is the similarity criterion matrix with assignment function M K p , K q = Tr K p K q . ρ and δ are trade-off factors for neighborhood kernel alignment and kernel information enhancement term respectively.

B. OPTIMIZATION ALGORITHM
We provide a four-step iterative optimization approach to approximate the optimal target in (13) with three variables fixed while the remaining one is optimized. The detailed procedure is derived below.

1) OPTIMIZING WITH λ, U AND W FIXED
Given {λ p } m p=1 , U and W, the linear kernel K λ in (13) can be calculated by K λ = m p=1 λ p K p , and the optimization target can be equivalently expressed by minimizing the following formula w.r.t. : Denoting P = W K λ − 1 ρ I n − UU W, (14) can be further converted to: Note that for any quadratic optimization problem in the following form: the optimal solution is: for i varying from 1 to c.

2) OPTIMIZING U WITH , λ AND W FIXED
With , {λ p } m p=1 and W given, the optimization problem of (13) w.r.t. U becomes: which matches the form of the typical KKM problem (shown in (5)), and the optimal U can be comprised by taking the top k eigenvectors.

3) OPTIMIZING W WITH , λ AND U FIXED
With , {λ p } m p=1 and U fixed, optimizing (13) w.r.t. W can be reformulated as follows: Denoting Q = K λ − 1 ρ I n − UU , the optimization problem in (19) can be simplified as: Likewise, the maximization of Tr W QW w.r.t W is similar to the kernel k-means task and satisfies the following equation: where β i is the largest c arbitrary eigenvalues of Q.

4) OPTIMIZING λ WITH , W AND U FIXED
Given , W and U, the optimization problem in (13) w.r.t {λ p } m p=1 can be reformulated as, where z = [z 1 , · · · , z m ] with z p = ρ Tr W W K p . This is a standard quadratic programming problem with equality and inequality linear constraints.
The overall optimization procedure is demonstrated in Algorithm 1. It is worth noticing that as the four iterative optimization processes perform with three other variables fixed, each subproblem is strictly convex w.r.t. the target variable. As a result, the optimization target will decrease monotonously for each iteration. Moreover, the whole optimization problem is lower-bounded. The convergence of the algorithm is accordingly verified.
To summarize, as the optimization of the neighborhood kernel matrix depends on both the linear kernel and the cluster representation matrix, the clustering structure and the refinement of the kernel matrix are bonded together. With the help of the iterative learning process, this joint effectuation of mutual optimization spreads across all learnable variables, making it a unified framework for the multiple kernel clustering task.

C. DISCUSSION AND EXTENSION
In this section, we mainly discuss the relationship and difference between our proposed method and other analogical algorithms. We then explore some potential extensions of the proposed method. In the end, the computational cost is analyzed.
Our algorithm adopts a re-parametric scheme by separately and iteratively optimizing two instead of one single variable to delicately learn the optimal kernel matrix. With the orthogonal constraint of one of the multiplier component W, the kernel matrix is representative and distinctive. Additionally, re-parameterization naturally acquires the low-rank property, which will benefit the clustering task. In this case, the representation matrix is expected to be of rank k (number of the cluster), and distinctive in each row vector. Similar to [11], [12], the matrix-induced term regulates the weights of kernels and utilizes mutual information of multiple base kernels, thus reducing redundancy while boosting the diversity of the representation matrix. One of the differences between our algorithm and the most existing MKKM methods is that we use a neighborhood matrix (as was adopted in [25]) to expand the searching space of the optimal kernel. The re-parametric strategy further provides guidance to the rank of the optimal kernel, which in turn assists the optimization of the linear kernel.
In our algorithm, the rank c of the optimal kernel is a hyper-parameter with c ∈ { n 10 , 2n 10 , . . . , n}, where n is the data count. The best choice of the rank of the optimal kernel is an open question, and it is readily extensible to an adaptive parameter version. We also observe that by setting the rank to multiples of the number of clusters, the clustering results reach a local maximum compared to the rank numbers nearby.
MKKM-LR's effectiveness benefits from the low-rank regulation of the optimal kernel matrix, which is ensured by the re-parametric scheme. In addition to the computational cost of the standard MKKM algorithm, the optimization of our formulation further requires the computation of the orthogonal multiplier matrix W. While updating the diagonal matrix , the computational cost is O(n 2 (k + c) + nc 2 ), where k is the number of clusters and c is the rank of . Optimizing U requires O(n 3 ) to take eigenvalue decomposition. Optimizing W requires O(n 3 + n 2 k). And the quadratic programming costs O(2n 2 c + nc 2 + m 3 ) computational time while optimizing λ. Overall, the computational complexity is comparable with the existing MKKM, which is O(n 3 ). In the next section, experimental results demonstrate superior performance in numerous benchmark datasets.

IV. EXPERIMENTS AND ANALYSIS A. EXPERIMENTAL SETTINGS 1) DATASETS
The benchmark datasets used in our experiments are Pro-teinFold, Digital, Flower17, psortPos, Plant, warpAR10P, Cornell, Heart, warpPIE10P, Washington, YALE, Orl and Texas. These datasets are in different varieties and cover a wide range of real-world applications, such as images and protein sequences. The detailed information of these benchmark datasets is listed in Table 2. The number of classes in each dataset is given in advance and the base kernels are normalized beforehand.

2) COMPARED ALGORITHMS
In our experiment, MKKM-LR is compared with several state-of-the-art multi-view clustering algorithms listed below.

1) Average Multiple Kernel k-means (A-MKKM):
By uniformly weighting all base kernels, a new kernel is generated and taken as the input of the kernel k-means.

2) Single Best Kernel k-means (SB-KKM):
Each single kernel is separately tested on the kernel k-means and the best result is reported. (7), MKKM performs kernel k-means and updates kernel coefficients alternately to obtain the best combinations. 4) Robust multiple kernel k-means (RMKKM) [33]:

3) Multiple Kernel k-means (MKKM): As shown in
RMKKM proposes to replace the sum-of-squared loss with an L 2,1 -norm to improve the robustness of MKKM. 5) Co-regularized Spectral Clustering (CRSC) [34]: CRSC provides a co-regularization way to perform spectral clustering. 6) Robust multiview spectral clustering (RMSC) [35]: RMSC constructs a transition probability matrix from each single view, and uses them to recover a shared low-rank transition matrix for clustering. 7) Multiple kernel k-means with matrix-induced regularization (MKKM-MR) [12]: MKKM-MR learns the optimal combination weights by introducing a matrix-induced regularization to reduce the redundancy and enhance the diversity among the selected kernels. 8) Optimal neighborhood kernel clustering with multiple kernels (ONKC) [25]: While sufficiently considering the correlation between base kernels, ONKC also allows the optimal kernel to reside in the neighborhood of linear combination of base kernels and effectively enlarges the region from which an optimal kernel can be chosen.

9) Robust Multiple Kernel Clustering (RMKC) [36]:
RMKC learns a robust yet low-rank kernel for clustering by capturing the structure of noises in multiple kernels.

B. RESULT ANALYSIS
In our experiments, the clustering accuracy (ACC), normalized mutual information (NMI), and purity are used to evaluate the clustering performance of these ten algorithms.
According to the clustering results and convergence analysis listed in Table 3 and Figure 1, we can draw the following conclusions: • The proposed algorithm outperforms on 11 out of 13 benchmark datasets (as shown in red) with dominant superiority, substantially exceeding the second best methods. In particular, our algorithm's accuracy (55.38%) exceeds the second-best algorithm ONKC (47.69%) on the dataset warpAR10P by 16.1% in proportion. Besides, there are more than 5% increment w.r.t. the best performance achieved by previous methods on datasets such as ProteinFold, Orl and Cornell.
• In datasets Washington and Texas, where the proposed method does not fulfill the best performance in accuracy, those that achieve the best ACC score fail to sustain comparable performances in the other two criteria, namely, NMI and purity. However, our method demonstrates a stable and well-balanced performance throughout all these three criteria, with NMI the best in Texas, the second in Washington and purity the best both in Texas and Washington.
• Without manifest deficit in any criterion, the proposed algorithm can improve ACC, NMI and purity simultaneously, validating the robustness of our method.
• The proposed algorithm has a fast convergence speed (see Figure 1), It generally can reach its predefined convergence condition within 3 iterations, which brings down the time cost by a large margin and improves the efficiency of the algorithm. In summary, the magnificent improvement of the proposed method verifies the effectiveness of multiple kernel k-means clustering with low-rank neighborhood kernel, which employs the re-parametric scheme to subtly and skillfully optimize the optimal kernel matrix, naturally acquiring the low-rank property while maintaining its clustering structure.

C. PARAMETER SENSITIVITY
The proposed algorithm has three hyper-parameters ρ, δ and c, as discussed in Section III-C. ρ and δ are trade-off parameters for neighborhood matrix alignment term and kernel information diversity enhancement term, c is set with different percentages of the sample number. In our experimental settings, we investigate the influence of two hyper-parameters upon the clustering result at each time, as demonstrated in Figures 2, 3, 4. The vertical axis shows the accuracy influenced by every two parameters pairwise. From those figures, we can conclude that our algorithm exhibits competitive while stable clustering performance across a wide range of hyper-parameters. In addition, we observe   that for most benchmark datasets, our algorithm achieves the best clustering performance when the parameter c is small, which means the kernel matrix is low-rank. This proves the effectiveness of our proposed method.

V. CONCLUSION
In this paper, we propose a novel multiple kernel k-means clustering with low-rank neighborhood kernel. This algorithm learns a low-rank optimal kernel matrix with a distinct clustering structure while effectively improves the clustering presentation matrix with the learned neighborhood matrix. Extensive experiments verify the effectiveness of the proposed method with notably rapid convergence speed. In the future, we will explore a heuristic or adaptive method to automatically choose the best rank c of the optimal kernel matrix, further releasing a hyper-parameter in the algorithm.