Discriminative Multiple Kernel Concept Factorization for Data Representation

Concept Factorization (CF) improves Nonnegative matrix factorization (NMF), which can be only performed in the original data space, by conducting factorization within proper kernel space where the structure of data become much clear than the original data space. CF-based methods have been widely applied and yielded impressive results in optimal data representation and clustering tasks. However, CF methods still face with the problem of proper kernel function design or selection in practice. Most existing Multiple Kernel Clustering (MKC) algorithms do not sufficiently consider the intrinsic neighborhood structure of base kernels. In this paper, we propose a novel Discriminative Multiple Kernel Concept Factorization method for data representation and clustering. We first extend the original kernel concept factorization with the integration of multiple kernel clustering framework to alleviate the problem of kernel selection. For each base kernel, we extract the local discriminant structure of data via the local discriminant models with global integration. Moreover, we further linearly combine all these kernel-level local discriminant models to obtain an integrated consensus characterization of the intrinsic structure across base kernels. In this way, it is expected that our method can achieve better results by more compact data reconstruction and more faithful local structure preserving. An iterative algorithm with convergence guarantee is also developed to find the optimal solution. Extensive experiments on benchmark datasets further show that the proposed method outperforms many state-of-the-art algorithms.


I. INTRODUCTION
Data representation is a fundamental topic in machine learning, pattern recognition and data mining. Previous studies have shown that the performance of many learning tasks, such as clustering and classification, can be largely improved with more faithful and compact representation. Matrix factorization techniques have been widely used to obtain low dimensional representations. Several methods have also been developed such as Singular Value Decomposition (SVD), Principle Component Analysis (PCA), Non-negative Matrix The associate editor coordinating the review of this manuscript and approving it for publication was Wentao Fan . Factorization (NMF) [1]- [3]. By keeping the two latent factors be non-negative, NMF leads to the well known part-based representation, which not only provides better performance in face recognition and document clustering but also enables better semantic interpretation. However, NMF only works in the original non-negative space. As one of the most important extension of NMF, Concept Factorization (CF) [4] inherits the merit of non-negative representation and conducts factorization in any data space such as the Reproducing Kernel Hilbert space (RKHS). It has also been pointed out that the structure of data within proper kernel space may become much clear than in the original feature space [5]. Therefore, concept factorization can discover more meaningful concepts and lead to better learning performance compared with NMF [4]. That is also the primary advantage of CF over NMF.
In recent years, various concept factorization methods have been further developed. Graph regularized concept factorization methods [6]- [11] extracts the concepts of data which are consistent with the manifold geometry by exploiting the graph Laplacian as additional regularization terms for smoothness. Sparse concepts can also be obtained with the locality-constraints [12], [13]. Semi-supervised concept factorization methods [14]- [18] have also been proposed by using the available supervised information to guild the factorization process. Most recently, multi-view concept factorization methods [19], [20] have also been proposed to handle the complementary information from multiple views. Most of existing works on CF only handle data with single kernel. However, CFs methods still face with the problem of the design or selection of proper kernel function in practice. By leveraging a predefined set of candidates kernels from different functions or views, the Multiple Kernel Clustering (MKC) methods are with great potential to alleviate the effort for kernel designing or integrating complementary information [21]. It is natural to extend existing single kernel clustering methods into multiple kernel scenario. The typical methods include K-means based [21]- [27], self-organizing map (SOM) [28], maximum margin clustering based [29]- [31], local learning-based [32], spectral clustering based [33]- [40] and subspace clustering based [41]- [45] algorithms. Compared with the single kernel counterpart, MKC should take special effort to handle the additional data problems such as noisy and incomplete kernels [24], [27], [44], [46]- [51]. Moreover, only a few efforts [45], [52], [53] have been taken to incorporate the local geometric structure of data for MKC. In addition, It has been shown that the disciminant information is also important for the learning tasks [54], [55].
To alleviate the effort for kernel designing and make full use of complementary information, it is imperative to learn an appropriate kernel efficiently to make the performance of concept factorization more stable or even better across multiple different kernels. In this paper, we present the novel Discriminative Multiple Kernel Concept Factorization (DMKCF) for data representation. To achieve this, we first combine multiple base kernels with linear weights to approximate the unknown proper kernel matrix. We then replace the data matrix in kernelized CF with the combined kernel matrix and get the multiple kernel concept factorization (MKCF). Specifically speaking, for each data point in each base kernel, we construct a local clique comprising this data point and its neighboring data points identified by the base kernel. We use a local discriminant model for each local clique from each base kernel to evaluate the representation performance of samples within the local clique. We then integrate the local models of all the local cliques from all the base kernels into a global model to approximate the underlying local and discriminant structure of data. We incorporate the induced Multiple Kernel Local Discriminative regularization on orthogonal non-negative low-dimensional representation into the above MKCF learning procedure. We then derive the corresponding multiplicative update rules to reduce the objective function monotonically and obtain the unique solution for the proposed DMKCF model. Extensive experimental results on benchmark data sets well demonstrate the effectiveness of the proposed method over state-of-the-art multiple kernel learning algorithms.
It is worthwhile to highlight several properties of the proposed DMKCF method.
• The proposed method avoids the problem of kernel selection in concept factorization by integrating multiple candidate kernels under the framework of multiple kernel clustering.
• The proposed method globally integrates the local discriminant models from all the local cliques and all the base kernels to approximate the underlying local and discriminant structure of multiple kernels, which is further used to regulate the procedure of concept factorization. The proposed method extracts the concepts with respect to the local structure and thus data samples associated with the same concept can be well clustered.
• We propose an effective iterative strategy with multiplicative updating rules to obtain the optimal unique solution, and provide the proof of rigorous convergence and correctness analysis of our method.
The rest of the paper is organized as follows. The preliminaries on non-negative matrix factorization and concept factorization are introduced in Section 2. Section 3 introduces the proposed Discriminative Multiple Kernel Concept Factorization method. The optimization algorithm is presented in Section 4. Extensive experimental results on clustering are presented in Section 5. Finally, we provide some concluding remarks and suggestions for future work in Section 6.

II. RELATED WORK A. NON-NEGATIVE MATRIX FACTORIZATION
Given a data matrix X = [x 1 , · · · , x n ] ∈ R d×n , each column of X is a sample vector. By solving the following optimization problem, NMF [1], [2] aims to extract two non-negative matrices W ∈ R d×c and V ∈ R n×c whose product can well approximate the original matrix X.
It can be seen that each data vector x i is approximated by a linear combination of the columns of W, weighted by the components of V, i.e. x i = solving: Besides, it can be easily verified that the kernelized concept factorization can be written as where K ∈ R n×n is the kernel matrix, and K = XX T for the linear case. It has been shown that the optimal value of U and V in the kernel concept factorization model can be obtained by the following multiplicative update rules: For the kernel matrix with negative entries, the multiplicative update rules become where K + = (|K| + K)/2, K − = (|K| − K)/2, and we further denote

III. DISCRIMINATIVE MULTIPLE KERNEL CONCEPT FACTORIZATION
In this section, we extend kernel concept factorization to automatically learn an appropriate kernel from the convex linear combination of several pre-computed kernel matrices within the multiple kernel learning framework. We also present the multiple kernel local discriminative regularization to capture the local structure of multiple base kernels. i . To combine these kernels and also ensure that the resulted kernel still satisfies Mercer condition, we construct an augmented Hilbert spaceH = ⊕ m i=1 H i by concatenating all feature spaces φ µ (x) = [µ 1 φ 1 (x); µ 2 φ 2 (x); . . . ; µ m φ m (x)] T with different weight µ i (µ i ≥ 0), or equivalently the importance factor for kernel function K i . It can be verified that clustering in feature spacẽ H is equivalent to employing the following combined kernel function [32]K It is known that the convex combination, with µ (µ i ≥ 0), of the positive semi-definite kernel matrices {K i } m i=1 is still a positive semi-definite kernel matrix. By replacing the single kernel in Eq. (3) with the combined kernel, we present the multiple kernel concept factorization by solving: where is the kernel Gram matrix of the i-th predefined kernel function over the unlabeled dataset X, and (K µ ) ab = K µ (x a , x b ) is the kernel matrix of the consensus kernel function K µ (·, ·).

B. LOCALIZED DISCRIMINATIVE MULTIPLE KERNEL REGULARIZATION
In this subsection, we propose a new Local Discriminant Multiple Kernel regularization to utilize both manifold information and discriminant information for multiple kernel clustering. We extract a local clique, for each data point from each base kernel, comprising of this data point and its neighboring points. We build a local discriminant model such local clique for better data separation and representation. We integrate all the local discriminant models for each point and each base kernel and get the localized Discriminative Multiple Kernel regularization. Given a centered data set consisting of n data points the goal of clustering is to find a disjoint partitioning {π j } c j=1 of the data where π j is the j-th cluster. We define the cluster indicator matrix defined as P = [p 1 , p 2 , . . . , We then introduce the scaled cluster indicator matrix as Y = where |π j | is the sample size of the j-th cluster π j . Denote g j = x∈π j x |π j | as the mean of the j-th cluster. Therefore, we can define the within-cluster scatter, between-cluster scatter, and total scatter matrices as S w = c j=1 It has been pointed out that tr(S w ) captures the intra-cluster distance, and tr(S b ) captures the inter-cluster distance. And we have S t = S w + S b . For high-dimensional data, a reliable estimation of the total scatter (covariance) matrix can be obtained by adding additional regularization and we have S t = XX T + γ I d , where I d is the identity matrix of size d and γ > 0 is a regularization parameter.
Intuitively, to better cluster the data, the distance between data from different clusters should be as large as possible while the distance between data from the same cluster should be as small as possible [56]. Inspired by Fisher criterion and the discriminant clustering [57], the optimal scaled cluster assignment matrix Y * can be obtained by minimize the following linear discriminant model By using the Woodbury identity, the above problem can be equivalently reformulated as [57], where H = I− 1 n 1 n 1 T n is the centering matrix and the equation H = H T = HH holds, the kernel matrix K = X T X for linear kernel function.
Given the p-th kernel candidate matrix, we consider a local clique N p i comprising τ data points including the i-th sample and its τ − 1 nearest neighbors determined by the kernel matrix K p , and employ a local kernel discriminant model to evaluate the clustering results for the data points. Let denote Y p be the scaled partition matrix determined by kernel matrix K p and Y p (i) ∈ R τ ×c be the local scaled cluster assignment matrix for the i-th clique with K p . The localized discriminant model can be written as where L p i = (H τ K p i H τ +γ I τ ) −1 is the local Laplacian matrix. It can be seen that a larger local discriminant score indicates that the samples in the local clique from different clusters are better separated.
Moreover, we denote L p as the aggregated Laplacian matrix induced from K p , which can be obtained by where S p (i) ∈ R n×k is the local selection matrix with its element (S p (i) ) jj = 1 if the j-th sample is the j -th neighbor of the i-th sample determined by K p ; (S p (i) ) jj = 0, otherwise. The overall clustering results can then be obtained by globally optimizing the local discriminant models of all the local cliques.
Considering the fact that different kernels have different local neighborhoods, it is desired to aggregated these aggregated local discriminant models. Inspired by the linear combination of multiple kernel learning, we also introduce the multiple kernel aggregated Laplacian by the linear combination of these kernel-specific Laplacian matrices It is believed that the above Laplacian matrix well capture the local information and discriminant information in multiple kernels. To further improve the performance for the task of clustering and the learning of concept factorization, we further replace the unknown scaled partition matrix with the non-negative low-dimensional representation and propose the novel Local Discriminative Multiple Kernel Regularization, which can be formulated as To efficiently address the constraint V T V = I, we relax the equation condition by integrating a penalty term into optimization problem and get where ξ the a regularization parameter.

C. LOCALIZED DISCRIMINATIVE MULTIPLE KERNEL CONCEPT FACTORIZATION
Based on the multiple kernel concept factorization in Eq. (9) and the localized discriminative regularization in Eq. (18), we propose the novel Discriminative Multiple Kernel Concept Factorization (DMKCF) method for data representation and clustering, which can be formulated as follows.
The objection function in Eq. (19) contains five terms, where the first three terms are the kernel concept factorization, the fourth term is the local discriminative multiple kernel regularization and the last term is the orthogonal constraint for unique solution. VOLUME 8, 2020 It can be seen that the consensus kernel K µ is generated from the linear combination of base kernels. Instead of using the consensus kernel K µ to extract the local structure, we also use each base kernel to capture the kernel-level local discriminant structure, where the discrete neighborhood structure of base kernel will not be changed during the learning procedure. Finally, the consensus local structure L µ is also generated from the linear combination of base graph Laplacian . As a result, the integrated graph Laplacian L µ will not be affected by the discrete neighborhood structure change of K µ .

IV. OPTIMIZATION
Because the optimization problem in Eq. (19) comprises three different variables, it is hard to derive its closed solution directly. Thus we derive an alternative iterative algorithm to solve the problem, which converts the problem with a couple of variables (U, V, µ) into a series of sub problems where only one variable is involved. The convergence and complexity analysis are further presented.
A. UPDATE U When other variables are fixed, the rest optimization problem with respect to the variable U can be formulated as follows It can be seen that Eq. (20) is similar with Eq. (3). Therefore, Eq. (20) can be updated by the following multiplicative update rule for non-negative K µ For the kernel matrix K µ with negative entries, the multiplicative update rules become where The optimization problem with respect to the variable µ can be formulated as follows where A is a diagonal matrix with its diagonal element It can be seen that Eq. (23) is a quadratic programming problem with linear constraints which can be solved by existing offthe-shelf packages.

C. UPDATE V
When other variables are fixed, the rest optimization problem with respect to the variable V can be formulated as follows Eq. (24) is a quadratic programming problem with non-negative and orthogonal constraints. We can derive the similar multiplicative update rule for the positive only kernel matrix K µ and Laplacian matrix L µ as [6] For the kernel matrix K µ or Laplacian matrix L µ with negative entries, we first introduce the following notations ip , and we then get the following multiplicative rule to update V In summary, we present the iterative updating algorithm of optimizing Eq. (19) in Algorithm 1. In this subsection, we will investigate the convergence of Algorithm 1. Here, we first use the auxiliary function approach [1] to show that the objective function in Eq. (24) with respect to V can be reduced monotonically.

Algorithm 1 The Algorithm to Solve Eq. (19)
In the following, we will present 2 theorems, which guarantee the convergence of Algorithm 1.
Theorem IV-D3: Let Then the following function

is an auxiliary function for L(V). Furthermore, it is a convex function in V and its global minimum is
Proof: See Appendix. Theorem IV-D4: The objective function in Eq. (24) will be non-increasing under the update rule in Eq. (26).
Proof: By Lemma IV-D2 and Theorem IV-D3, we can The convergence of DMKCF under the update rules in Algorithm 1 can be summarized as follows. For fixed µ t and U t in the t-iteration, the objective function of the rest sub-problem w.r.t V in Eq. (24) will be non-increasing under rules in Eq. (25) or Eq. (26). The proof can be found in Theorem IV-D4. For fixed µ t and V t in the t-iteration, the sub problem w.r.t the variable U in Eq. (20) is exactly the same as the standard concept factorization model in Eq. (3). Thus, the multiplicative update rules in Eq. (21) or Eq. (22) are also the same as the standard concept factorization model in Eq. (3), and will reduce the objective function in Eq. (20). Please see [58] for details. For fixed U t and V t , the rest objective function w.r.t µ in Eq. (23) will also be decreased by the quadratic optimization tools. In summary, the objective function in Eq. (19) will be non-increasing under the alternative optimization step w.r.t. U, V and µ t . Since the objective function in Eq. (19) is obviously lower bounded, the overall optimization problem in Eq. (19) converges.

E. ALGORITHM COMPLEXITY ANALYSIS
In this subsection, we discuss the computational complexity of our proposed algorithm, and use the big O notation to express the complexity.  + m 3 ). If the updating procedure stops after t iterations, the overall cost of the multiplicative updating is O(tn 2 (m + k) + tm 3 ). Because n m and n τ , the total cost of DMKCF is O(mn + n 2 mt). It can be seen that the computational complexity of DMKCF is linear with the number of kernels and iterations, quadratic with the number of samples.

V. EXPERIMENT
In this section, to evaluate the effectiveness of our proposed MKC algorithm, especially the effectiveness, four experiments are designed. In the first experiment, we construct a synthetic data set to test the robustness against noise and VOLUME 8, 2020 Clustering comparison of all these 10 algorithms. We report the best results in terms of ACC/NMI/Purity respectively, from multiple random initializations and parameters.
outliers of the proposed neighbor kernel. Second, we compare our proposed algorithm with nine state-of-the-art MKC algorithms on real-world data sets to evaluate its performance. Then, we test the sensitivity of the algorithm against the main hyperparameters. Finally, we apply neighbor kernels to the existing MKC algorithms and test the capacity of the proposed kernel on enhancing the performance of these methods.

A. DATA SETS
We perform experiments on 10 different public datasets, including 3 image ones (USPS49, PIE, COIL20),3 text corpora ones (RELATHE,BBC,K1b) and 4 biological ones (Prostate,ALLAML,SMKCAN,CLLSUB). They have been widely used to evaluate the performance of different clustering methods. The detailed statistics information and data dimensionality of these datasets are summarized in Table 1.

B. COMPARED ALGORITHMS
To demonstrate how the clustering performance can be improved by the proposed approach, we compared the results of the following state-of-the-art multiple kernel clustering algorithms, which include: 1 It is a co-training multiview spectral clustering proposed by [33].
• RMSC. 3 The RMSC (Robust Multiview Spectral Clustering) is proposed by [39]. We first transform the kernels into probabilistic transition matrices following [39], and then apply RMSC to get the final clustering results.
• LKGr. 8 It learns a low-rank kernel matrix from the neighborhood of candidate kernels. [59].

D. EXPERIMENTAL RESULTS
The results of all clustering algorithms depend upon the initialization [56]. For all the clustering algorithms, VOLUME 8, 2020  we independently repeat the experiments 20 times with random initializations to reduce the statistical variation.
For each clustering algorithm, we report the best results for each parameter corresponding to the best objective values in terms of ACC/NMI/Purity, respectively, from twenty rounds of random initializations in Table 2. We also report the averaged results over all these 10 data sets in the last row of Table 2. It can be seen that our method consistently outperform other state-of-the-art multipl kernel clustering algorithms. Moreover, it can be seen that our method achieves 14.79%, 31.15% and 13.93% improvement in terms of ACC/NMI/Purity, respectively on the averaged results. These results can well demonstrate the effectiveness of the proposed method.
For each clustering algorithm, we also calculate the the mean ACC/NMI/Purity from twenty rounds of random initializations for each parameter and then we additionally report the best mean ACC/NMI/Purity together with the standard deviation corresponding to the optimal parameter and the p-value of the paired t-test against the best results in Table 3, 4, 5. Thus, each cell in Table 3, 4, 5 include the best mean ACC/NMI/Purity, the standard deviation and the p-value. The best one and those having no significant difference (p > 0.05) from the best one are marked in bold. Again, we can observe that our method outperforms better than other MKC algorithms in most cases. And the improvements in most cases are also significant.
For all these compared multiple kernel clustering algorithms, we can observe that the ACC/NMI/Purity in Table 2 corresponding to the best objective values are generally higher than the mean ACC/NMI/Purity in Table 3, 4, 5. Since it is still a nontrivial task to obtain the globally optimal solutions for the clustering algorithms, it is reasonable to choose the clustering results from the optimal initialization corresponding to the best objective values in practical clustering applications [56]. VOLUME 8, 2020

E. PARAMETER SENSITIVITY
In the subsection, we investigate the sensitivities of three parameters λ, γ , and ξ in our method. Figure 1 plots the clustering accuracy(ACC) with different values of these parameters on K1b, SMKCAN, USPS49 respectively. These figures show that our proposed algorithm is not very sensitive to λ, γ and ξ within relative wide ranges.

F. CONVERGENCE ANALYSIS
Here we take the BBC data set as an example to empirically investigate the convergence behavior of the proposed method. In this experiment, we fix the three parameters λ = 1, γ = 1, and ξ = 1, and set the maximal iterations to be 100 for simplicity. We first provide the change of the objective function value of our proposed method with increasing number of iterations in Figure 2(a). Like [61], we then show how the value of U, V get close to the optimal value U * , V * with respect to the iteration number t on this data set by computing ||U t − U * || F , ||V t −V * || F , in Figure 2(b) and Figure 2(c), separately. Although it is still not easy to provide theoretical results on the convergence rate of the proposed optimization schema, we further provide more empirical results on the sequences of Figure 2(e) and Figure 2(f) to demonstrate the convergence rate as suggested [62]. It can be seen that the objective function indeed decreases its value with the updating rules on this data set. It can also be seen that both U t , V t sequences can converge within a small number of iterations, which also verifies the effectiveness and correctness of the optimization scheme.

VI. CONCLUSION
In this paper, we propose a novel discriminative multiple kernel concept factorization method for data clustering and representation. Our method inherits the merit of concept factorization and extends to handle the problem of kernel design or selection. Our method also extracts the kernel level local discriminant model with global integration and builds the local multiple discriminant regularization to further capture the local discriminant structure of data. An iterative algorithm with convergence guarantee is also developed to find the optimal solution. Extensive experiments on 10 benchmark datasets further show that the proposed method outperforms many multiple clustering algorithms.

APPENDIX PROOF OF CONVERGENCE
Lemma VI-1: [63] For any nonnegative matrices A ∈ R n×n , B ∈ R k×k , S ∈ R n×k ,S ∈ R n×k , and A, B are symmetric, then the following inequality holds The objective function with respect to V in Eq. (24) can be rewritten as By applying Lemma VI-1, we have Moreover, by the inequality a ≤ a 2 +b 2 2b , ∀a, b > 0, we have the following inequality To obtain the lower bound for the remaining terms, we use the inequality that z ≤ 1 + log z, ∀z > 0, then By summing over all the bounds in Eq. To find the minimum of J (V, V ), we take and the following Hessian matrix of J (V, V ) is a diagonal matrix with positive diagonal elements. Thus J (V, V ) is a convex function of V. By setting Hence, we can obtain the global minimum of J (V, V ) according to Eq. (26). In July 2013 and July 2014, he was a Software Engineer with Alibaba Group. He is particularly interested in the following topics: clustering with noise and heterogeneous data, ranking for feature selection, active learning, and document summarization. He has published more than 40 papers in top conferences and journals, including KDD, IJCAI, AAAI, ICDM, TNNLS, TKDE, SDM, and CIKM. XI ZHANG received the bachelor's degree in financial management from Beijing Wuzi University, China, and the master's degree in library and information studies from Heilongjiang University, China. She is currently an Assistant Professor with the Institute of Scientific and Technical Information of China. She has previous experience in management of sci-tech information and data services.