Robust Similarity-Based Concept Factorization for Data Representation

Non-negative matrix factorization (NMF) known as learnt parts-based representation has become a data analysis tool for clustering tasks. It provides an alternative learning paradigm to cope with non-negative data clustering. In this paradigm, concept factorization (CF) and symmetric non-negative factorization (SymNMF) are two typically important representative models. In general, they have distinct behaviors: in CF, each cluster is modeled as a linear combination of samples, and vice versa, i.e., sample reconstruction, while SymNMF built on pair-wise sample similarity measure, is to preserve similarity of samples in a low-dimensional subspace, namely similarity reconstruction. In this paper, we propose a similarity-based concept factorization (SCF) as a synthesis of the two behaviors. This design can be formulated as: the similarity of reconstructed samples by CF is close to that of original samples. To optimize it, we develop an optimization algorithm which leverages the alternating direction of multipliers (ADMM) method to solve each sub-problem of SCF. Besides, we take a further step to consider the robust issue of similarity reconstruction and explore a robust SCF model (RSCF), which penalizes the hardest pair-wise similarity reconstruction via $l_\infty $ . Thus, RSCF enjoys similarity preservation, robustness to similarity perturbation, and ability of reconstructing samples. Extensive experiments validate such properties and show that the proposed SCF and RSCF achieve large performance gains as compared to their counterparts.


I. INTRODUCTION
Non-negative matrix factorization (NMF, [1], [2]) approximates a non-negative data matrix by a product of two lowrank non-negative matrices, namely, the basis matrix and the coefficient matrix. Of them, the coefficient matrix provides an effective way to represent original high-dimensional data. More importantly, NMF is known as learning parts-based representation, i.e., the parts composed of the whole, which is line with human intuition.
Building upon this intuition, NMF has become an important data analysis tool for various tasks such as clustering tasks [3], [4], recommendation systems [5], [6], The associate editor coordinating the review of this manuscript and approving it for publication was Hao Luo . speech separation [7]. Specifically, in theory, a square loss NMF is also equals to K -means [8], which is a classical clustering method. Thus, the learnt basis matrix could be interpreted as the clusters while the coefficient matrix indicates the cluster identity of each sample. Obviously, NMF is a linear method based on original feature space. To avoid this issue, many nonlinear methods such as kernel NMF [9] have been explored. A representative method is concept factorization (CF, [10]), which assumes that each cluster is modeled as a linear combination of samples, and each sample is modeled as a linear combination of clusters. The assumption can be called sample reconstruction. Besides, the similarity measures are critical for many clustering tasks, such as community detection [11], person re-identification and image retrieval. A typical similarity-based method is symmetric NMF (SymNMF, [12]), which is on the basis of similarity measurement to preserve the similarity among samples via the so-called similarity reconstruction. That is, the coefficients learned in the low-dimensional subspace have the same similarity relationships among original samples. For sake of this character, SymNMF has achieved sound clustering performance. Kang et al. [13] proposed a self-tuning similarity learning method relaying on the kernel-based embedding, called SLKE. They argued that similarity preserving could help the new representation relay on the overall relations among the data. SLKE could get well clustering performance. SLKE and SymNMF emphasize similarity information while CF has the ability to reconstruct data. That inspires us whether we can combine the advantages of the two ideas to achieve higher clustering performance. Thus, this work tries to a synthesis of such two ideas by jointly integrating both sample reconstruction and similarity reconstruction into a uniform formulation.
Different from SymNMF, which shows the potential of similarity preservation, graph-based NMF methods regard the manifold structure within dataset as the regularizer to achieve the identical goal. For instance, graph regularized NMF (GNMF, [14]) utilizes the Laplacian regularization term to smooth the learnt low-dimensional representation. According to [15], LCCF implements a locally invariant CF method. LCCF seems the joint product of both the similarity preservation and sample reconstruction. In fact, the graph regularizer behaves as the sample smooth rather than similarity reconstruction. Thus, GNMF and LCCF are different from our goal.
Besides the mentioned issues, another key focus is the robust issue. In general, most existing works focus on sample reconstruction distortion in the case samples are corrupted with the noises [16]. To name a few, Kong et al. assumed the noises obeying the Laplacian distribution thus equipping NMF with the L 21 loss and yielding robust NMF namely L 21 NMF [17]; Liutkus et al. thought the noise subject to isotropic Cauchy distribution and thus developed CauchyNMF [18]; based on Gaussian kernelization correntropy induced metric, Du et al. proposed CIMNMF [19]. Distinct from them, we study the robust issue about the similarity reconstruction distortion.
To consider the aforementioned insights, this paper proposes a similarity-based CF method (SCF) and its robust variant (RSCF). In particular, SCF intends to reconstruct the similarity of the reconstructed data meanwhile preserve the representation of original data information. This provides a synthesis of two behaviors, i.e., sample reconstruction and similarity reconstruction. Fig. 1 and Fig. 2 show the ability that the similarity of reconstructed samples by SCF is close to that of original samples. To achieve the robust issue about similarity measurement, we introduce l ∞ norm instead of L 2 norm to measure the distortion levels of similarity reconstruction errors, which considers the similarity reconstruction in the worst-case assumption. This way complements existing robust NMF methods. To solve both methods, we develop an  To sum up, this study has the following contributions: 1) A concept factorization method based on similarity reconstruction is proposed, called SCF. It reconstructs the pairwise similarity to obtain better data presentation. 2) We incorporate the l ∞ norm with proposed SCF to enhance the robustness to tiny distortions of similarity reconstruction. 3) Since the objective functions of SCF and RSCF are fourth-order functions which is non-convex, we develop an optimization algorithm based on ADMM to solve them.
The rest parts fall into four sections as below. In section II, we have a brief review to the most related work including NMF, CF, robust method, and l ∞ norm. Our proposed methods and the corresponding optimization algorithms are presented in the section III. Section IV details clustering experiments conducted on several facial datasets. In the final, we draw a conclusion in section V.

II. RELATED WORK A. NON-NEGATIVE MATRIX FACTORIZATION
NMF is a classical method for clustering which uses a non-negative constraint purely additive perception process to characterize how data consists of parts. NMF is in efforts to factorize the high-dimensional data by the linear representation of clusters. Lee and Seung [1] utilized Euclidean distance VOLUME 8, 2020 to measure the quality of reconstruction. A standard NMF can be formed as following: where X = [x 1 , x 2 , · · · , x n ] ∈ R d×n is the non-negative input data and each sample is x i ∈ R d×1 (i = 1, 2, 3, · · · , n). U ∈ R d×r and V ∈ R r×n are two low rank metrices. The optimization problem of Eq. (1) is non-convex for joint U and V which is difficult to obtain a global optimal solution using a nonlinear optimization method. However, for subproblem about U or V only, it is still a convex optimization problem (i.e., it is convex when one variable is fixed and another variable is optimized). Lee and Seung used the multiplicative update rule (MUR, [2]) to update U and V alternately. The nonincreasing update rules are: It is shown that the update rules can ensure the convergence of the algorithm. More detail can be found in [20].

B. CONCEPT FACTORIZATION
Xu and Gong thought the non-negative matrix factorization method is only suitable for original features [10]. For highly non-linearly distributed data sets, the NMF does not perform well while the kernel method can be used to improve the matrix decomposition results [21]. The proposed method called concept factorization(CF) can apply kernel method to matrix factorization. The CF considered the basis matrix as a linear combination of data and the objective function can be shown as: If we regard XW as a whole, it can be interpreted as the clustering center of clustering task in the objective function Eq. (4) while H is the corresponding indicator matrix. With the analogous MUR algorithm [22], the corresponding update criterion of variable is given as in which K = X T X .
C. l ∞ OPTIMIZATION l ∞ is utilized in our methods thus we review it here. l ∞ function measures the maximum fitting errors rather than a square loss which can help yield the global minimum rather than a local one. Thus, l ∞ is very useful for geometric problems. Indeed, l ∞ minimization problem is to measures worstcase distortion. Aghazadeh et al. [23] argued that l ∞ is more robust to small perturbations. They designed a datadependent hash function for retrieval and achieve the state-ofart performance. Richard Hartley and Frederik Schaffalitzky claim that l ∞ cost function is significantly simpler than the l 2 cost [24]. They found that l ∞ minimization optimization can come down to minimizing a cost function with a single minimum (local or global) on a convex domain. The analysis on multi-view triangulation and motion recovery problems demonstrates that the optimal problems of small dimension solution are reliable. Assume that z is a signal, an efficient representation x can be solved by [25]: in which A is transformation matrix and ε is the error tolerance. The minimization problem of Eq. (7) can be arisen as signal processing, wireless communications and various areas. If C is designed as a domain {υ| υ < ε}, a constrained form of Eq. (7) is transformed as arg min x,c∈C After that, we enforce the equality constraint introducing Lagrange multiplier ϕ and yield a saddle-point formulation problem: arg min where a, b = a T b. The Primal-Dual Hybrid Gradient (PDHG) method is a powerful optimization scheme [26] that breaks complex problems into simple sub-steps to solve the problem in Eq. (9). Before applying PDGH to the optimization problem, we need to evaluate a proximal minimization first: arg min x,c∈C where κ is a positive penalty parameter. A low computation cost described in [27] can perform the proximal operator update to solve Eq. (10).

D. ROBUST METHODS
In the real world, the data cannot be in a flawless case that is pure or has slight noise. Some unwished outliers always affect the model performance even inducing unreasonable results. Assume the given data goes with an unknown corruption matrix, we take the NMF as an example and obtain an object function with corrupted data [28] : where E = X − UV is the reconstruction error matrix. If the ξ is tiny even 0, the limited impact may be neglect. However, an overrun ξ can severely impact the performance even cause disastrous consequences. To descent this deficiency, many robust methods are proposed to increase the immunity to outliers. Many robust methods are based on a hypothesis that the corruption data is sampled from some distributions. For example, L 2,1 NMF [17] considered that the corruption existed in data space followed the Laplace distribution and L 2,1 norm could respond it effectively. CIMNMF [19] based on the correntropy induced metric applied a Gaussian kernel method. Cauchy NMF [18] employed the isotropic Cauchy distribution to overcome the sensitivity of outliers.
Recently, Shen et al. explored a non-statistic-based method called tanhNMF [28]. They rethink a hyperbolic tangent (tanh) function to evaluate the reconstruction error. It is also an approximate truncation method. All above are devoting to seek a proper metric to reduce the abnormal data sensitivity.

III. PROPOSED METHOD
This section introduces our similarity-based concept factorization (SCF) and its robust version based similarity construction (RSCF). Besides, we provide an efficient optimization algorithm based on ADMM to solve them.

A. SCF
As discussed above, CF is to model each cluster as a linear combination of samples and each sample as a linear combination of cluster centers. This implies the following data reconstruction assumption: where is an implicit nonlinear mapping function which maps original samples into a infinite-dimensional feature space in the RKHS. The objective function of CF is formulated as: As one knows, similarity preservation or manifold structure within dataset plays a core role in clustering tasks [29]. The representative methods include spectral clustering [30] and symmetric NMF (SymNMF, [12]). Of them, Sym-NMF minimizes similarity reconstruction error to learn cluster identities, and outperforms spectral clustering in most situations. In fact, SymNMF implies that the similarity of the learnt representation in a low-dimensional subspace should approach to that of original samples. Thus, the insight of similarity reconstruction can be represented as: where K ∈ R n×n is the similarity matrix of n samples. The construction of K is an important aspect of affecting the model performance. In general, a good similarity can greatly improve the effectiveness of the model. This scope is not our focus, thus we do not plan to detail how to build a ''perfect'' similarity. In our model, we use classical k nearest neighbor to metric the similarity [31]. According to two distinct behaviors as above, i.e., data reconstruction and similarity reconstruction, can we couple them each other into a unified reconstruction error like either Eq. (12) or Eq. (14)? The answer is yes! We couple them with the assumption that the similarity of the reconstructed samples is close to that of original samples. The insight can be written as: In terms of Eq. (15), the reconstruction error is evaluated by the Fronbenius norm, and we yield our similarity-based concept factorization (SCF) as follows: From Eq. (16), the equivalent form is recast as: The minimization of Eq. (17) implies that the matrix ( (X ) − (X ) WH ) ( (X ) + (X ) WH ) tends to zero. Due to the non-negativity of all the factors including (X ), W and H , (X ) + (X ) WH is always positive and cannot be a zero matrix. According to algebra theory, to make Eq. (17) to be zero, the (X ) − (X ) WH could be zero or belong to the null space of (X ) + (X )WH . Interestingly, Eq. (17) expands the solution space, which might contain much more useful information for clustering tasks. This character might be the reason for the effectiveness of SCF.

B. OPTIMIZATION ALGORITHM
Obviously, the optimization problems of Eq. (16) on W and H are non-convex owing to their fourth-order terms. Empirically, using MUR to optimize Eq. (16) will fail to converge. To overcome the difficulty, we introduce two variables to reduce Eq. (16) to several optimization subproblems, and then utilize the ADMM based algorithm [32] to solve them.
By introducing two variables J and M , the optimization problem Eq. (16) becomes: We can write the augmented Lagrangian function [33] of this optimization problem as: where Y 1 and Y 2 signify the Lagrange multipliers, and both λ and γ are the positive penalty parameters. According to Corollary 1, the optimization sub-problems with respect to each variable are convex. Corollary 1: Eq. (19) is a convex sub-optimization problem about W , H , J , M respectively, which can be wriitten as follows: The proof can be found in Appendix A.
We now describe the specific optimization process for each variable.

1) OPTIMIZATION W AND H
Corollary 2: According to Eq. (20), the gradient of L (W , H ) with respect to W and H is Lipschitz continuous with corresponding Lipschitz constant P W = (λ + γ ) H H T 2 and P H = (λ + γ ) W T W 2 , respectively. We deduce the proof of Corollary 2 in Appendix B. According to Corollary 1 and Corollary 2, we can use the optimal gradient method(OGM, [34]- [36]) to solve them. In summary, we sum up the whole optimization procedure of W and H into Algorithm 1 and Algorithm 2, respectively.

2) OPTIMIZATION J AND M
According to Eq. (21), it is easy to find that the optimization problem about J is a quadratic minimization problem, which usually has an analytic solution. First, we can get the corresponding partial derivative: Lagrangain multipliers Y 1 k and Y 2 k after k iterations, positive penalty parameters λ and γ , the stopping criterions η, maxiter; Output: The basis matrix W ; 1: after k iterations, positive penalty parameters λ and γ , the stopping criterions η, maxiter; Output: The indicator matrix H ; 1: By setting the partial derivative to 0, we can obtain: where I is the identity matrix and (·) −1 represents the inverse of matrix.
Analogously, the derivative for M is Then, we have the following closed-form solution: The Lagrangian multipliers in ADMM can be updated by dual ascent methods. Thus, Y 1 and Y 2 have the following two update procedures: During the optimization process, the two penalty parameters often constantly increase to the infinity. The generallyused approach is to multiply by a constant s > 1. In this paper, λ and γ have λ ← sλ and γ ← sγ , respectively.
The whole optimization for our SCF is summarized into Algorithm 3. We must be honest with a truth that Algorithm 3 cannot guarantee that the objective function is monotonically non-increasing at each iteration round. The main reason is that the Lagrangian multiplier increases, which causes the objective function to oscillate.

C. ROBUST SCF
In SCF, the similarity matrix is built on the k nearest neighbors, thus it is easily influenced by the noisy corruptions in real world. To be special, the paired similar samples could have small similarity value. In general, these values violate the whole data distribution. Thus, we recover the similarity values in the worst-case situation.
Given Z is the gap between the similarity of original samples and reconstructed samples, that is, Z = K − H T W T KWH . Different from Eq. (16), minimizing the worstcase reconstruction error implies to minimize the maximum reconstruction distortions. Thus, we rewrite Eq. (16) as: Eq. (28) caters to the Chebyshev norm [37], i.e., the l ∞ norm. As usual, the l ∞ norm of a matrix is different from that of a vector. Then, the matrix Z can be reshaped as a long vectorZ . To this end, the objective function Eq. (28) can be reformulated as: where · ∞ is the above Chebyshev norm. To optimize Eq. (29), we introduce another two auxiliary variables to obtain the following problem: We adopt the ADMM framework like SCF to solve the optimization problem Eq. (30). We enforce the equality constrains by introducing three Lagrange multipliers 0 , 1 and 2 . The augmented Lagrangian function can be recast as: where ρ, λ and γ are the positive penalty parameters which also increase constantly with a step s. The update procedures are described as follows. VOLUME 8, 2020

1) OPTIMIZATION ON Z
Fixed other variables, the optimization problem corresponding to Z can be evaluated by a proximal problem of l ∞ norm which can be written as: For the purpose of optimizing Eq. (31), we consider reshape K , J , M and 0 to their column vector representatioñ K ,J ,M and˜ 0 , respectively. An equivalent optimization can be written as: We obtain the saddle point of Eq. (33) by using the proximal optimization scheme based on a sorting algorithm. After the sub-problem is optimized,Z is restored to the original matrix with the same size as Z . We use ' ' signifies the elementwise product and the sorting-based algorithm for optimization is summarized in Algorithm 3.

2) OPTIMIZATION ON J AND M
We can also write the sub-problems with respect to J and M as: We can get the gradients of them as: and then, we set the Eq. (35) and Eq. (36) to 0. The close solutions can be obtained: In summary, the whole procedure of optimizing RSCF is detailed in Algorithm 4.

Input:
Given nonnegative matrix X = [x 1 , x 2 , · · · , x n ] ∈ d×n , positive penalty parameters ρ, λ and γ , increment step s; Output: A basis matrix W and a indicator matrix H ; 1: Initialize 0 = 0 n×n , 1 = 0 n×n , 2 = 0 n×n , J = I n×n , M = I n×n , k = 0, randomly initialize W k , H k ; 2: while not convergent do 3: ρ ← sρ; 13: λ ← sλ; 14: γ ← sγ ; 15: end while 16   Additively, some criteria are set as the following to conduct fair comparation experiments: 1) For all the methods, original images have been subjected to the same normalization process and each image is reshaped to a vector. The basis and coefficient matrices of such compared methods are initialized with those of NMF.
2) To evaluate clustering performance of the proposed methods, we conduct the clustering experiments with the cluster number ranging from 2 to k. For a fixed cluster number k, we randomly select all the images from the k categories. We conduct 10 times experiment for each cluster number to eliminate the effects of randomization.
3) The average clustering accuracy and normalized mutual information is employed to measure the clustering performance evaluation metrics by using the predicted cluster identities and the ground truth. All results are in percentage.

A. DATASETS DESCRIPTION
Our experiments are conducted on the five facial datasets. A brief description is listed in Table 3. We can enjoy the important properties of these datasets as follows: Yale dataset: The Yale face dataset [45] is created by Yale University and contains 15 people. Each person has 11 facial images and it is total of 165 pictures. The Yale dataset is consisted with different expressions (happy, normal, sad, sleepy, surprised, or wink), poses (with or without glasses) and lighting (center-light, left-light, or right-light). All images are cropped to 32 × 32 size in our experiment.
YaleB dataset: The Extend Yale B dataset [46] has 38 categories with 168 × 192 images size. It contains 9 different poses and 64 illumination conditions. All the images are captured in various laboratory-controlled lighting conditions and different facial expressions. An image with background illumination was also captured for every subject in a specific pos. With same operation as [47], this dataset now has 38 individuals and around 64 near frontal images under different illuminations per individual and each image is cropped to 32×32 size. We choose the top 20 classes in our experiments.

UMIST:
The UMIST face database [49] is consists of 564 images of 20 people and has 220 × 220 pixels in 256 shades of grey. Each category covers a range of poses from profile to frontal views. In our experiment, we pick a subset of left profile facial images of top 10 classes and resize to 40 × 40 size.
ORL: The ORL(Olivetti Research Laboratory) database of Faces includes 400 images in 40 distinct categories with grayscale 60 × 112 size. The images were taken at different times, varying the lighting, facial expressions (open / closed eyes, smiling / not smiling) and facial details (with / without glasses). All the images were taken against a dark homogeneous background with the subjects in an upright, frontal position (with tolerance for some side movement) [50]. In our experiments, we cropped each image to 32 × 32 size.
MNIST: The MNIST dataset is a classic dataset in the field of machine learning [51]. It consists of 60,000 training samples and 10,000 test samples. The MNIST is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. Each sample is a 28 × 28 pixel grayscale handwritten digital picture. Our clustering experiments are conducted in the test set.
CIFAR-10: CIFAR-10 is a color image data set closer to pervasive objects [52]. It is a small data set for identifying universal objects. A total of 10 categories of RGB color pictures: airplane, automobile, bird, cat, deer, dog, frog, horse, ship and truck. The CIFAR-10 dataset consists of 60,000 32 × 32 color images in 10 classes, with 6,000 images per class. In our experiment, we use the test images and convert them to grayscale pictures.

B. EVALUATION METRICS
After we obtain the clustering results, the Accuracy and Normalized Mutual Information, two frequently-used external measures, are applied to evaluate the clustering performance by comparing the clustering result with ground truth of each samples.
For each image x i , r i and g i are the corresponding clustering result and ground truth, respectively. The clustering accuracy is signed as: where n is the total number of samples. The permutation mapping function map (r i ) convers a clustering assignment to its equivalent real assignment and the Kuhn-Munkres algorithm [53] can help to find the best mapping. δ (a, b) is an indicator function that if a = b is 1, otherwise equals 0.
Obviously, a higher clustering accuracy corresponds to a better performance. Normalized Mutual information (NMI) is used to measure degree of the coincidence between the obtained results and the ground truth clustering. Assume that C and C are the set of clusters,the Mutual information (MI) between them can be defined as follows: where p (c i ) and p c i are probabilities that a data point arbitrarily selected from the data set belongs to the cluster c i and c i , respectively. p c i , c j is the joint probability that the arbitrarily selected data point belongs to the cluster c i as well as c i at the same time. In our experiments, we use the normalized mutual information as follows: where H (C) and H C are the entropies of C and C , respectively. NMI ranges from 0 to 1. Similarly, the lager NMI, the better clustering performance. If the two sets of clusters are identical, NMI = 1; when the two sets of clusters are independent, NMI = 0.

C. RESULT AND ANALYSIS
To show the statistically significant clustering results, we conduct experiments ten times on different categories of seven datasets to eliminate the influence of random initialization. We let the average clustering results be the ultimate clustering results. Then clustering ACC and NMI are applied to evaluate clustering performance. The average clustering ACC and NMI of all compared methods on seven datasets are listed in Table 4 ∼ Table 8. It is worth pointing out several observations: 1) From Table 4 ∼ Table 10, the performance of kernelbased similarity methods is better than the classical clustering methods or graph-based approaches. It can be attributed that the kernel-based similarity retains the global relations of datasets. Our SCF is able to preserve reconstruction relationship measured by kernel-based similarity.
2) The NMF variants based on similarity reconstruction is always superior to the standard NMF. This illustrates constraints on data reconstruction or similarity can effectively improve the performance of NMF. In our model, we consider to reconstruct the similarity of raw data meanwhile obtain the data reconstruction and our SCF is surpass to other compared methods in most cases. 3) In some datasets, SCF is slightly higher performance than a few methods such as PNMF. However, as the results in Table 4 ∼ Table 10, it is not difficult to find that with the comparable performance, our SCF has a lower standard deviation. A lower standard deviation means more stable performance. This also illustrates the efficacy and stability of our algorithm from another aspect. 4) All the compared methods achieve the lower results on CIFAR-10. This results from two reasons. First, CIFAR-10 is the color image datasets with RGB

5)
In comparison, we do not compare deep clustering methods. This is because they have powerful performance due to advanced deep representation [54], [55]. This work seeks to refine CF to better promote data representation of NMF, thus our compared methods are based on NMF. As shown in Table 9, the proposed methods achieve satisfactory performance as compared to its counterparts.

D. ANALYSIS ON RSCF
As we know, the outliers will affect data reconstruction. However, they will destroy similarity reconstruction as well. In this subsection, we will discuss the robust of baseline methods including CIMNMF, CauchyNMF, L 2,1 NMF, tanhNMF, tanhNMF + and our SCF. We conduct our robust experiments also on the three datasets including Yale, UMIST and ORL. We set default parameters of baseline methods according to the description in original papers. As for SCF and RSCF, we set the parameters by Table 1 and Table 2, respectively. In addition, we consider the performance of robust methods after the manual corruption on three facial datasets including Yale, UMIST, PIE. The generated corruption data take the same ways as [28] that randomly select 10% ∼ 50% of images and conducts pixel perturbation and key part occlusion simultaneously on these images. We perform corruption on images of the maximum k cluster and select r scale images in Yale to disturb the pixel area of the mouth by using the 8 × 20 random corrupted weight matrix, while disturb the eyes for UMIST and ORL with a 6 × 20 random corrupted weight matrix. The other experiment settings are the same. Fig. 3 ∼ Fig. 5 plots a few images with/without manual corrupted images.   Fig. 12 illustrate the performance of compared methods while Table 11 ∼ Table 13 shows the average clustering ACC and NMI result versus various proportion of corrupted images. Some facts can be obtained from the shown result in figures and tables as following: 1) Our RSCF achieve the satisfactory performance against other robust methods. A higher baseline measured by ACC and NMI is attained. Our model has certain robustness advantages compared with other models.  2) The clustering performance of our RSCF shown in Fig. 6 ∼ Fig. 12 also surpasses the other robust methods. It benefits from two aspects. First, our RSCF inherits the advantages of SCF which can reconcile data and similarity reconstruction. Second, the l ∞ norm handles the small errors well which provides robustness. In the figures, we also notice that, in some datasets, such as YaleB, there are slightly surpassing other robust methods. We suppose that after some distortions have been removed, l ∞ norm has a limited effect on residual error. 3) Table 11 ∼ Table 13 describe the clustering performance on different corrupted images ratio. As r goes up, the performance of robust methods does not fluctuate largely. However, our RSCF is still superior to other robust methods algorithms and this indicates   RSCF is less insensitive to the outliers. From these results, we can know that our method is robust to some degree. Fig. 12 shows, RSCF is slightly superior to the other compared robust NMF methods on CIFAR-10. This could be because the gray images lose some important   information. This make robust methods difficult to perform clustering tasks even if they are insensitive to noises. Table 11 ∼ Table 13, it is worth noticing that our SCF seems to achieve similar clustering performance to the robust method VOLUME 8, 2020    minimization can suppress the maximum reconstruction error.

V. CONCLUSION
In this paper, we propose a similarity-based concept factorization method called SCF. Our SCF considers similarity reconstruction accompanied by data reconstruction. However, the non-convex of objective function is hard to be optimized. We employ the ADMM optimization framework and OGM algorithm to deal with it. Afterwards, RSCF enjoys the benefit from the insensitivity of l ∞ norm to cope with tiny distortions of similarity reconstruction, which provides a worst-case perspective. Clustering experiments conducted on five popular face datasets with two metrics and 20 methods verify that SCF and RSCF are superior to the compared methods including the variants of NMF and CF.

APPENDIXES APPENDIX A PROOF OF COROLLARY 1
Proof: Given two Lagrange multipliers Y 1 and Y 2 , the Eq. (15) is a convex sub-optimization problem for the four primal variables W , H , J , M can be formed as the equivalent problem: Obviously, when we optimize one of the variables while the others are fixed, there have the following two sub-optimization problem: Now we demonstrate the two sub-optimization problem is convex.
Theorem 1: A twice continuously differentiable function f is convex if and only if the second partial derivatives ∇ 2 f is the positive semidefinite(PSD) Hessian matrix.
The proof of Theorem 1 can be found in [56].
∇ W L (W , H ) and ∇ H L (W , H ) are the gradient with respect to W and H . They have the following form: Then we can get the second partial derivatives with respect to W and H : It is clearly that ∇ 2 W L (W , H ) and where > 0 is referred to a Lipschitz constant for the function f .
Gowers [57] proved the Theorem 2. For any two negative matrices H 1 , H 2 ∈ R k×n , according to the compatibility of the matrix norm, we have: Where the inequality holds because Frobenius norm compatibility. From Theorem 2, we can obtain that ∇ H L (H ) is Lipschitz continues and the Lipschitz constant is (λ + γ ) W T W 2 . Similarly, ∇ w L (W ) is also Lipschitz continues and the Lipschitz constant is (λ + γ ) H H T 2 . This completes the proof.