Robust Graph Regularized Nonnegative Matrix Factorization

Nonnegative Matrix Factorization (NMF) has become a popular technique for dimensionality reduction, and been widely used in machine learning, computer vision, and data mining. Existing unsupervised NMF methods impose the intrinsic geometric constraint on the encoding matrix, which only indirectly affects the base matrix. Moreover, they ignore the global structure of the data space. To address these issues, in this paper we propose a novel unsupervised NMF learning framework, called Robust Graph regularized Nonnegative Matrix Factorization (RGNMF). RGNMF constructs a sparse graph imposed on the basis matrix to catch the global structure and preserve the discriminative information. And it models the local structure by building a k-NN graph constrained on the encoding matrix, which gains the compact representation. Consequently, RGNMF not only respects the global structure, but also depicts the local structure. In addition, it employs such a <inline-formula> <tex-math notation="LaTeX">$\text{L}_{2,1}$ </tex-math></inline-formula>-norm cost function to decompose the basis matrix and encoding matrix that its robustness can be improved. Further, it imposes the <inline-formula> <tex-math notation="LaTeX">$\text{L}_{2,1}$ </tex-math></inline-formula>-norm constraint on the basis matrix to choose the discriminative feature. Hence, RGNMF can gain the robust discriminative representation by combining structure learning and <inline-formula> <tex-math notation="LaTeX">$\text{L}_{2,1}$ </tex-math></inline-formula>-norm constraints imposed on the basis matrix and encoding matrix. Extensive experiments on real-world problems demonstrate that RGNMF achieves better clustering results than the state-of-the-art approaches.


I. INTRODUCTION
As a popular technique for data representation, nonnegative matrix factorization (NMF) has been successfully applied in computational intelligence [1], [2], machine learning [3], [4] and data mining [5], [6], [7]. Its power lies in its ability to give meaningful decompositions of data into two nonnegative matrices. Compared with other matrix factorization techniques, NMF can achieve relatively good performance. Moreover, the basis vectors obtained by NMF provide interpretability and clear physical meaning from the nonnegative view. Hence, more and more researchers have paid close attention to NMF. Specifically, given a nonnegative matrix Y = [y 1 , y 2 , . . . y n ] ∈ m×n consisting of n samples, NMF The associate editor coordinating the review of this manuscript and approving it for publication was Berdakh Abibullaev .
focus on finding two low-dimensional nonnegative matrices A and X so that the product of A and X approximates Y . The objective function of NMF with the square of the Frobenius norm is defined as follows [8]: where A represents a basis matrix with m rows and l columns. X ∈ l×n is usually a coefficient matrix (or an encoding matrix) and also considered as the new representation of data with respect to the new basis A. By alternately optimizing A and X in (1), one can gain their optimal solution. Although NMF can provide solid mathematical theory and encouraging performance, it suffers from three limitations in real-world applications. First of all, it ignores the intrinsic geometric structure. It has been shown that the intrinsic geometry of the data distribution is essentially useful for many practical problems [9], [10], [11]. secondly, it fails to apply the supervised information of data. The supervised information can be used to learn more discriminative features [12], [13]. Finally, it is sensitive to noise and outliers [14]. To address these problems, recently, a number of improved NMF algorithms have been proposed and successfully applied in various fields.
According to whether the supervised information is used, we divide existing NMF algorithms into three categories: supervised, semi-supervised and unsupervised. Supervised NMF methods make use of labeled data to explore the discriminative representation that enforces the separability between classes and promotes the performance of NMF to some extent. Zafeiriou et al. [15] introduced well-known linear discriminant criterion into NMF by using the data labels, which increased the separability of samples from different classes and enforced the compactness of samples from the same class. Guan et al. [16] employed the label information of data to formulate between-class k-nearest neighbor (k-NN) and within-class k-NN scatters and then incorporated them into NMF. Similar to [16], An et al. [17] expanded the local regions of the between-class neighborhood and reduced the local regions of the withinclass neighborhood to extract discriminative features of data. Li et al. [18] exploited the data labels to respectively construct a neighbor graph and a penalty graph for capturing both the within-class compactness and the between-class distinctness of data. Nikitidis et al. [19] sought the discriminant projection by incorporating subclass-based constraints into the loss function of NMF. Generally, it is difficult to obtain the class labels of the whole data, but relatively easy to gain a small number of data labels or pairwise constraints. Liu et al. [20] proposed a semi-supervised NMF, which explicitly imposes the label information on the encoding matrix as additional hard constraints to improve the discriminative ability of the representation. Different from [20] that uses a few data labels to guide matrix decomposition, Zhang et al. [21] integrated pairwise constraints in the form of must-link and cannot-link in the objective function of NMF to find an appropriate indicator matrix. Wang et al. [22] propagated both cannotlink and must-link constraints to unlabeled samples for constructing a new data weight matrix. Li et al. [23] applied the labeled and unlabeled data to gain the discriminative representations by investigating the block-diagonal structure learned by the label information. Jiao et al. [64] proposed a novel semi-supervised NMF method by merging the hypergraph regularizer and class label into NMF.
Generally, it is very expensive to gain the class label of data. For example, In the diagnosis of tuberculosis, it is difficult for a doctor to judge whether the tumor is negative or positive according to a medical image of the lung. Therefore, unsupervised NMFs are more suitable for solving real-world problems than supervised and semi-supervised NMF approaches [24]. Dirichlet matrix factorization [25] exploits matrix factorization to enhance the prediction and the reliability of recommender system, so as to achieve a better performance for recommender system. Sparseness constraint-based NMF [26] incorporates sparseness constraints into NMF to explicitly control the sparsity of the factor matrices. In various applications, data are usually contaminated by noise and outliers. To address the issue, Kong et al. [14] minimized the L 2,1 -norm loss to factorize the original matrix into the two matrices which has been shown to handle noise and outliers. Studies [10], [28] have shown that manifold learning technique can be applied to improve the performance of NMF. Robust manifold NMF (RMNMF) [27] depicts the local structure and relaxes the nonnegativity of the basis matrix for seeking the proper factorization. NMF with Adaptive Neighbors (NMFAN) [29] alternately seeks the similar matrix, the basis matrix and the encoding matrix by constructing an adaptive k-NN graph. Low-rank matrix factorization [30] exploits the k-NN graph regularization and low-rank factorization to find these two factor matrices. General subspace constrained NMF [31] regularizes NMF with various subspace constraints formulated into a certain form. Graph regularized low-rank NMF (GNLMF) [32] incorporates the local structure into the nonnegative low-rank matrix factorization framework to get an effective low-rank data representation. Zhang et al. [33] exploited the manifold regularization and matrix factorization to simultaneously solve the affinity matrix and the encoding matrix. Yi et al. [34] constructed a sparse graph and a k-NN graph and merged them into NMF for seeking two factor matrices. Peng et al. [55] proposed a novel robust log-norm regularized sparse NMF method (RLS-NMF). RLS-NMF formulates L 2,log -shrink operator as the solution to the L 2,log -(pseudo) norm, which makes the data with noise subtraction nonnegative. With L 2,log -shrink operator, it develops multiplicative updating rules to gain a robust parts-based representation by sparser solutions. Yu et al. [65] exploited the correntropy measure in the loss function and constructed a hypergraph to preserve the high-order geometric information of the data, which improved the performance and robustness of NMF.
The above-mentioned NMF-based methods are linear, and effectively decompose the data located in the linear space. Recently, with the successful application of deep learning, many researchers have combined deep learning technique with NMF to handle the matrix factorization problem of nonlinear data. Trigeorgis et al. [56] proposed a deep semi-NMF via graph regularization technique, which can learn such hidden representation and interpret clustering according to different unknown attributes of a given data set. Deep NMF based on autoencoders (DNMF) [57] is applied to improve nonlinear data-driven fault detection by combining deep autoencoders into NMF. Zhao et al. [58] proposed a deep NMF framework based on underlying basis image learning, which is used to extract features that reflect the depth positioning features of samples. Sparse dual graph-regularized DNMF [59] respects the geometric structures of feature manifold and data manifold to seek the data information of hidden layers so that it learns a sparse and compact representation. Semi-supervised graph regularized DNMF (SGDNMF) [60] employs a small number of labels to learn a representation from the hidden layer of the deep network. Furthermore, SGDNMF introduces bi-orthogonal constraints on two factor matrices into NMF framework to make the solution unique. By revealing the hierarchical semantics of the input data, Huang et al. [61] proposed a new collaborative deep matrix factorization framework to seek the hidden representation of different attributes.
Obviously, supervised and semi-supervised NMF approaches can depict the global structure with the labeled data. Because of the lack of the supervised information, however, unsupervised ones pay more attention to the local structure of data or the sparsity of the factor matrix to gain the compact representation. Consequently, they fail to respect the global structure so that the discriminative power of the representation is limited to some extent. In addition, many of them constrain the geometrical information on the encoding matrix, which can find a proper encoding matrix. For many tasks, however, especially feature extraction, clustering and classification, the projection matrix constructed by the base matrix is used to project the original data into various subspaces [35], [36], [37]. Clearly, if the geometric constraint is imposed on the encoding matrix, the influence of data geometry on the base matrix is indirect. Recent studies have shown that the sparse graph constructed based on sparse representation technique can capture the discriminative information [38], [39]. For example, by constructing sparse graph, sparsity preserving projections (SPP) [40] not only naturally preserves the global structure of data, but also contains discriminative information. It has been shown that the performance of the learning model can be markedly enhanced if the discriminative information is exploited and the geometric structure is respected [16], [17], [41].
To address the above-mentioned problems, we propose a novel unsupervised NMF framework, called Robust Graph regularized NMF (RGNMF) which combines a spare graph and a k-NN graph construction into matrix factorization to learn a discriminative representation. To be specific, RGNMF captures the global structure and preserves the discriminative information by constructing the spare graph imposed on the basis matrix. And the new model constructs the k-NN graph constrained on the encoding matrix to depict the local geometric structure, which can learn the compact representation. Further, RGNMF exploits the L 2,1 -norm loss function to seek the basis matrix and encoding matrix so that it is insensitive to noise and outliers. Also, it imposes L 2,1 -norm constraint on the basis matrix to choose the discriminative feature. Hence, the proposed algorithm can gain a more discriminative representation for subsequent tasks by combining structure learning and L 2,1 -norm constraints imposed on the basis matrix and encoding matrix. In addition, an optimization scheme is developed to alternately solve such two metrices. Its convergence is proved theoretically and experimentally.
The proposed RGNMF has the following four contributions: (1) Different from existing approaches that ignore the geometric structure or only considers the local structure, our RGNMF algorithm not only respects the global structure, but also depicts the local structure. Moreover, it characterizes local and global structures as two regularization terms integrated into its loss function for respectively seeking the basis matrix and encoding matrix. Hence, RGNMF is particularly suitable for solving real-world problem via learning the intrinsic structure. (2) RGNMF constructs a sparse graph imposed on the basis matrix to catch the global structure and preserve the discriminative information. And it models the local structure by building a k-NN graph constrained on the encoding matrix, which gains the compact representation. Thus, RGNMF cannot only find more discriminative representations, but also project the new samples into the low-dimensional subspace by the learned basis. (3) RGNMF employs such a L 2,1 -norm cost function to decompose the basis matrix and encoding matrix that its robustness can be improved. Further, it imposes L 2,1 -norm constraint on the basis matrix to choose the discriminative feature. Obviously, the proposed algorithm can gain the robust data representation by combining structure learning and L 2,1 -norm constraints imposed on the basis matrix and encoding matrix. This indicates that RGNMF can naturally be used as a preprocessing technique for subsequent tasks, such as classification and clustering. (4) The power of our RGNMF algorithm lies in its ability to integrate feature learning, representation learning and L 2,1 -norm constraints into a general framework. Such a framework can be easily spread to supervised and semisupervised scenarios. This naturally brings about wider application of RGNMF. We organize this paper as follows: In Section 2, we briefly review several algorithms closely related to our work, such as KLS-NMF, GNLMF, and ENMF methods. In Section 3, our algorithm is introduced and the convergence proof of our optimization scheme is described in detail. Section 4 presents the experimental results and analysis. Finally, we conclude this paper in Section 5.

II. RELATED WORKS
In this section, we briefly review existing methods closely related to our work. These algorithms aim to respect the local geometric structure of data in the unsupervised scenario.

A. NMF WITH LOCAL LEARNING
Recent studies have shown that the local structure of data can be employed to enhance the quality of the learned representation [42], [43]. However, NMF fails to discover the intrinsic structure in its model. To this end, Cai et al. [10] proposed a graph regularized NMF (GNMF) to respect the intrinsic geometry of data. GNMF depicts the local structure by setting up a nearest neighbor graph and integrate it into NMF to seek 86964 VOLUME 10, 2022 the compact representation. RMNMF [27] and MNMFL 2,1 [28] extend GNMF by exploiting the L 2,1 -norm loss function to replace the least square error function. different from the above approaches that depict the local structure by building a nearest neighbor graph, a kernel local similaritybased NMF algorithm (KLS-NMF) [9] is proposed, which merges kernel local similarity learning and self-expressive property into matrix factorization for clustering. Specifically, KLS-NMF introduces self-expressive property to formulate a new basis matrix. Thus, the data matrix Y can be approximately expressed as the product of Y and A. KLS-NMF formulates the weight of the similarity between samples as the product of the basis matrix and encoding matrix. Moreover, it enforces an orthogonality constraint on the encoding matrix for enhancing the clustering performance. KLS-NMF solves the following problem: where Tr(•) denotes the trace function of a matrix and Tr(A T MX) is a regularizer of the local similarity. M is a metric matrix with S ij = x i − x j 2 2 and Z =AX T denotes a similarity matrix. To address the nonlinear problem, KLS-NMF introduces the kernel function into the model (2) and thus obtains the following loss function: (3)

B. GRAPH REGULARIZED LOW-RANK NMF
Li et al. [31] incorporated NMF and the graph regularizer into a low-rank recovery algorithm for finding the essential representation of the data and thus proposed a graph regularized NLMF (GNLMF). Specifically, GNLMF decomposes its model into two subproblems: low-rank recovery and matrix factorization. It first applies the low-rank recovery technique to remove blur or noise in the original data, which optimizes the following objective subproblem: where G is the low-rank part of Y and B denotes the blur or noise. After obtaining G, GNLMF encodes the geometric information by constructing the nearest neighbor graph and extends the objective function of NMF. Thus, the cost function of GNLMF is defined as the following optimization problem:

C. ELASTIC NMF
To address the problem of noise and outliers in the data, Xiong et al. [3] proposed a novel graph-regularized ENMF, which is adapted in Frobenius norm and L 2,1 -norm. The elastic loss is used to fit matrix factorization for giving the data and is defined as follows: where according to the scale parameter δ, the first term is called L 2 pseudo loss and the last term is called L 1 pseudo loss. ENMF also considers the geometric structure of data by constructing the affinity graph and merges it into the elastic loss function as the regularization term. Thus, it optimizes the following objective function: min The above-mentioned algorithms improve the performance of NMF from different perspectives. However, these algorithms usually impose the intrinsic geometric information on the encoding matrix which indirectly affects the basis matrix. In addition, they fail to find the discriminative mapping, which is beneficial to obtain a better subspace representation. To address these issues, we propose a novel algorithm called RGNMF in this paper.

III. ROBUST GRAPH REGULARIZED NONNEGATIVE MATRIX FACTORIZATION (RGNMF) A. MODEL FORMULATION
As analyzed above, the intrinsic geometric and discriminative information of the data space plays an essential role in finding such a more discriminative representation [12], [18], [23]. In fact, the quality of the representation obtained by most approaches is relatively poor due to the omission of the important information. Clearly, if the proposed algorithm satisfies the following three conditions, it can improve the quality of the data representation.
(1) It should meet the local invariance. In other words, if two samples are neighbor in high-dimensional space, their corresponding representations are also neighbor in the low-dimensional space. (2) It should be able to discover the global structure, which is used to enhance the discriminative power of the representation. (3) It should choose the discriminative feature and make the model more robust. When the data are projected into the low-dimensional space, one needs to preserve the local structure for finding the compact representation. To this end, samples from the high-dimensional space are close, their representations in the latent space should be close. For example, if two samples y i and y j are close, their corresponding low-dimensional representations x i and x j should be close. Therefore, this neighbor relationship is achieved by the following problem: where the graph Laplacian L S can be computed by D-S. Accordingly, any diagonal element of the diagonal matrix D is the sum of the corresponding column (or row) and the similarity matrix S can be obtained by Gaussian kernel function. Clearly, the first condition can be realized by optimizing the problem (8). Because of the complexity of data structure in real-world applications, it is difficult to enhance the discriminative power of the representation by considering only one structure. For example, the face images of one person forms a subspace, and those of different persons consist of multiple subspaces. Because of the influence of illumination and posture, the data structures composed of these face images are very complex. If only the local structure is exploited, all images are not easy to be segmented into corresponding subspaces [44]. Hence, the global structure needs to be taken into account. Modified sparse representation (MSR) technique has been proved to be able to capture the global structure of data and find the discriminative mapping, which is helpful for the subsequent classification and clustering tasks [38], [40], [45]. Hence, we introduce MSR to respect the global structure and find the discriminative mapping by constructing the sparse graph. Specifically, we use as few samples as possible to reconstruct each sample y i . We look for a sparse reconstructive weight z i for each sample y i by solving the following L 1 -norm optimization problem: min where , . . . , z in ] T denotes an ndimensional reconstructive vector where the i-th element is zero, and · 1 is the L 1 -norm. 1 ∈ n is a column vector whose entries are 1. The problem (9) is the widely used selfexpression property of data, which represents each sample as a linear combination of other samples in the same group. It can reflect the intrinsic geometric characteristics of data and preserve potential discriminative information [38], [40], [68]. Lin et al. [69] exploited the self-expression property of data to seek a smooth node representation, so as to achieve multi-view graph clustering. The alternating direction method of multipliers (ADMM) [49] has been proven to be effective in solving L 1 -norm optimization problems [50], [51], [52]. Thus, we exploit it to solve the problem (9) with regard to Z . Following methods [50], [51], [52], we make use of ADMM to solve the sparse weight Z . We first transform the problem (9) into the following equivalent problem: min We define an augmented Lagrange formula to solve the problem (10): where C and H are two Lagrange multipliers, µ >0 denotes a penalty parameter. Since there are two variables Z and V in the problem (11), we need to solve them alternately. Hence, we design the following three steps to deal with these two variables.
Step 1: Computing Z . When V is fixed, Z is updated by solving the problem we define the following Lagrange function to obtain Z : where η and ς i are two nonnegative Lagrange multipliers. We take the partial derivative of Eq. (13) with respect to z i and set it to 0, thus obtaining: We further simplify Eq. (14) and gain: where According to KKT condition, we have ς ij z ij = 0 and thus get where (τ ) + = max(0, τ ). Without loss of generality, we assume that δ i1 , δ i2 , . . . , δ in are sorted from small to large. If k elements of the optimal z i are nonzero, then we get z ik > 0 and z i,k+1 = 0 in the light of Eq. (17). It follows that Step 2: Computing V . When Z is fixed, V is updated by solving the problem We update V by solving the optimization problem Therefore, we can gain the closed solution of V : where denotes the shrinkage operator [53].
Step 3: Computing two Lagrange multipliers and the penalty parameter. C, H and µ can be updated as follows: where ρ and µ max are two nonnegative constants. We summarize the process of solving problem (9) in Algorithm 1. The theoretical results of ADMM [49], [50] have proved that the iterative process of solving two variables is convergent. As formulated in Algorithm 1, we exploit the ADMM method to iteratively solve two variables. Hence, Algorithm 1 converges.
After obtaining the optimal weight vector z i , the sparse reconstructive weight matrix is denoted as Z = [z 1 , z 2 , . . . , z n ]. As demonstrated in [38], [40], and [45], the sparse graph has three important advantages: 1) Since z i is constructed by exploiting all the samples, it characterizes the global structure. 2) Although there are no class labels available, the discriminative information can be naturally preserved in the weight matrix. 3) Because of its sparsity, the weight matrix Z is robust to noise and outliers in the data. Since NMF is a popular dimensionality reduction technique, A is also used as a projection matrix [10], [18], [22]. Our proposed RGNMF aims to project samples with the same structure into the same cluster and samples with different structures into different clusters. Hence, it can preserve the discriminative information through the basis matrix. To achieve this, after obtaining the reconstructive weight matrix Z , we can optimize the following problem to gain the basis matrix: For the convenience of calculation, we simplify (26) as follows: where the i-th entry of the n-dimensional column vector e i is 1, and the other entries are 0. L Z = I − Z T − Z + Z T Z . By minimizing the problem (27), we respect the global structure and preserve the discriminative information. Therefore, we can fulfil the second property.
Discriminative Ridge Machine (DRM) [54] is a supervised classification method by in introducing a discriminant ridge regression. DRM makes use of data labels to investigate the between-class scatter and within-class scatter, respectively. Therefore, it can accurately derive class information and obtain an appropriate representation model by taking into account the discriminativeness between classes. Although DRM and our RGNMF can learn a discriminative representation, they have two main differences: (1) With data labels, DRM can learn the global discriminativeness by maximizing between-class separability. Since our RGNMF is an unsupervised algorithm, we solve the problem (9) to respect the global structure. After arriving at the reconstructive weight matrix, our algorithm can learn the global discriminativeness by minimizing the problem (26). It is worth noting that, when DRM does not use data labels, it uses the same method as RGNMF to learn the global structure.
(2) DRM seeks the local discriminativeness by maximizing within-class similarity. Different from DRM, our RGNMF considers the local geometrical structure by minimizing the problem (8). Actually, the difference between DRM and RGNMF in learning the local discriminativeness is a typical difference between supervised and unsupervised methods. NMF and its variants usually employ the least squares loss function to minimize the two nonnegative factors. However, their performance declines when the data are contaminated by noise and outliers. On the other hand, the L 2,1 -norm loss function can effectively deal with contaminated data [27], [28], [32]. Therefore, we introduce a L 2,1 -norm loss to handle the problem of noise and outliers as where the L 2,1 -norm of the matrix A is equivalent to Tr(A T PA), that is, A 2,1 = Tr(A T PA). The diagonal matrix P can be expressed as [P ii ] = 1/( a i 2 ). In (28), the first term is used to enhance the robustness of matrix factorization, and the second term is used to choose discriminative features [43], [46]. The advantage of using the L 2,1 -norm objective function is that our RGNMF can well address the problems of noise and outliers. Such an advantage has been proved by many NMF-based methods [7], [14], [27], [28]. To demonstrate the robustness of the L 2,1 -norm objective function, we display the experimental results of the L 2,1 -norm and L 2 -norm objective functions on the toy data set. The toy data set consists of ten data points, two of which are outliers. The experimental results are shown in Fig. 1. When the data set is contaminated by the outlier y r , the residual o r = y r − Ax r 2 2 is larger than those of other data points. Thus, the outlier can easily control the L 2 -norm objective function. As can be seen from the left panel of Fig. 1, the L 2 -norm results are closer to the outliers than the L 2,1 -norm results. Further, we assign larger values to two outliers. In other words, two outliers are farther away from other points. We can observe from the right panel of Fig. 1 that the L 2 -norm results are greatly affected. However, the L 2,1 -norm results seem to remain unchanged. This indicates that the L 2,1 -norm objective function is more robust than the L 2 -norm objective function.
According to the above formulation, the loss function of our RGNMF can be defined as where α, β and λ are three nonnegative parameters. By optimizing the problem (29), the above three conditions can be met. It is worth noting that although the k-NN graph and L 2,1norm were introduced into existing approaches [10], [27], [28], our algorithm has the following three differences from them: (1) Unlike other methods that use only a single graph, our RGNMF algorithm simultaneously constructs two graphs, namely, the sparse graph and the k-NN graph. The former graph is used to respect the global structure, and the latter one is used to preserve the local structure. Our algorithm characterizes local and global structures as two regularization terms integrated into our objective function. Hence, RGNMF takes advantage of complex data structure, which is particularly suitable for solving real-world problem. (2) Different from the existing methods of imposing the discriminative constraint on the coding matrix, our algorithm constructs the sparse graph imposed on the basis matrix to catch the global structure and preserve the discriminative information. And it models the local structure by building the k-NN graph constrained on the encoding matrix. Thus, RGNMF cannot only find more discriminative representations, but also project the new samples into the low-dimensional subspace by the learned basis. (3) Our algorithm lays special stress on the joint L 2,1 -norm optimization of the cost function and the basis matrix. The L 2,1 -norm loss function is insensitive to noise and outliers. And the L 2,1 -norm optimization on the basis matrix is applied to choose the discriminative feature. In a word, the proposed algorithm can gain the robust data representation by combining structure learning and L 2,1 -norm minimization of the cost function and the basis matrix. This indicates that our algorithm can naturally be used as a preprocessing technique for subsequent tasks, such as classification and clustering.

B. OPTIMAL SOLUTION FOR TWO FACTOR MATRICES
To gain the optimal solution of the two matrices A and X , we update one matrix by fixing the other. To this end, the loss function of RGNMF in (29) can be expressed as where we apply Tr(U ) = Tr(U T ) and Tr(UH) = Tr(HU) to merge similar terms. The diagonal entries of two diagonal matrices P and Q are computed as and Considering inequality constraints A ≥ 0 and X ≥ 0, two Lagrange multipliers ψ ik and φ kj need to be set for these two variables A ik and X kj . Therefore, we can introduce the following Lagrange problem with = [ψ ik ] and = [φ kj ]:

L = Tr(YPY T ) − 2Tr(YPX T A T ) + Tr(AXPX T A T ) + αTr(A T YL Z Y T A) + βTr(XL S X T ) + λTr(A T QA)
We calculate the partial derivatives of (33) concerning two variables A and X as follows: We set ψ ik A ik = 0 and φ kj X kj = 0 by applying the KKT conditions and arrive at two equations with regard to A ik and X kj .
Hence, we gain the following solutions of A and X : Obviously, the solutions of A and X are an iterative updating process. When such an updating process stops, their optimal solutions can be obtained. We formulate the iterative updating process for A and X in Algorithm 2.

C. PROOF OF CONVERGENCE
We need to prove that the objective function in (30) is nonincreasing under the updating rules in (38) and (39). The presented proof will adopt an auxiliary function similar to that used in NMF and its variants. Therefore, we introduce the definition of the auxiliary function.
Definition 1: Auxiliary Function) Give any two function

an auxiliary function of H (b).
Proposition 1: Suppose G(b, b ) is the auxiliary function of H (b), the function H is decreasing via the following optimization problem: Proof of Proposition 1: As seen from (38) and (39), the updating rules for A and X need to be executed alternately. We first demonstrate the convergence of the updating rule for A in (38). To this end, we fix three variables fix X , P, and Q, and define a proper auxiliary function for A. For any element A ik in A, we make use of F(A ik ) to denote the part of (30) with regard to A. F(A ik ) is formulated as: Proposition 2: The function is an auxiliary function of F(A ik ).
Proof of Proposition 2: It is easy to check that G(A, A) = F(A). According to Definition 1, we need to verify G(A, A t ) ≥ F(A). Consequently, we compute the first-order and the second-order derivative of F(A) about A, respectively: With (43) and (44), it is easy to gain a Taylor series problem of F(A), From (42) and (45), we can observe that G(A, A t ) ≥ F(A) is equivalent to We can gain the following inequality according to A ≥ 0 and X ≥ 0: (48) and By combining (47)- (49), (42)

holds and G(A,A t ) ≥ F(A).
We finish the proof of Proposition 2. VOLUME 10, 2022 Theorem 1: When the updating rule of (38) is used to solve the matrix A, the value of our objective function in (30) is decreasing.
Proof of Theorem 1: We substitute (42) into (40) to gain the following equation with the help of Propositions 1 and 2: It is easy to see that (50) and (38) are consistent. We obtain that G (A, A t ) is an auxiliary function of F(A). Thus, the value of F(A) is decreasing under (38) when the other variables are fixed.
Similar to the proof of the updating rule for A in (38), we prove that the value of F(X) is decreasing, in which F(X) is the objective function in (30) by fixing A, P and Q. F(X) can be described as is an auxiliary function of F(X) only related to X. Proof of Proposition 3: Since G(X,X) = F(X) is evident, we just need to prove G(X , X t ) ≥ F(X). Similar to the proof of Proposition 2, we formulate the first and the second derivative of F(X): With (53) and (54), the function F(X) can spread the following Taylor series problem: To verify G(X , X t ) ≥ F(X), we need to demonstrate that the following inequality holds: According to A ≥ 0 and X ≥ 0, the following inequality holds: (57) and The sum of the left-hand side of (57) and (58) is greater than or equal to the sum of their right-hand side. Hence, we have G(X , X t ) ≥ F(X). Proposition 3 is proven.
Theorem 2: When the updating rule of (39) is used to solve the matrix X, the value of our objective function in (30) is decreasing.
Proof of Theorem 2: We substitute (52) into (40) to gain the following equation with the help of Propositions 1 and 3: Because of Proposition 3, the value of F(X ) is decreasing under (39) when the other variables are fixed. We have completed the proof of theorem 2.
Theorem 3: When the updating rules of (38) and (39) are used to solve the nonnegative matrices A and X , the value of our objective function in (30)

is decreasing. RGNMF remains stable if and only if A and X reach the local optimal value.
Proof of Theorem 3: Assuming that Algorithm 1 is executed to the (t+1)-th iteration, we gain the following inequality with Theorem 1: where X t , P t and Q t denote the values of the t-th iteration.
When the values of A t+1 , P t and Q t are fixed, we can gain the following in equality via Theorem 2: According to (42) and (43), we have that is where U = Y -AX whose i-th column vector is u i . From Lemma 1 in [46], the following inequation holds: Further, we have ).
(65) 86970 VOLUME 10, 2022 We get the following inequality by combining (63) and (65): Therefore, According to the above analysis, we have Similar to the proof of (67), we gain Hence, Theorem 3 has proved and thus Algorithm 1 is convergent. When A and X are updated by exploiting (38) and (39), they remain stable if and only if they reach the local optimal value.

D. COMPUTATIONAL COMPLEXITY
We use a big O notation to describe the complexity of the proposed algorithm. It is important to note that we need to compute two weight matrices Z and W . It cost O(mnlog(m)) to compute Z . In fact, Solving Z results in a sparse representation issue, which can be efficiently solved by off-theshelf algorithms. RGNMF also needs O(mn 2 ) to construct the nearest neighbor graph.
From Steps 4 and 5 in Algorithm 1, we need to calculate YP, AXP, XS and ZY T A. Since G is a diagonal matrix, it cost O(mn) and O(lmn) to compute YP and AXP, respectively. RGNMF costs O(ln 2 ) and O(lmn) to compute XS and ZY T A, respectively. Thus, the complexities to update A and X are O(mn+lmn) and O(ln2+lmn), respectively. In addition, since both P and Q are diagonal matrixes, they cost O (mn+n) and O (ml+l), respectively. Assuming that our proposed method converges after t iterations, the overall cost for RGNMF is O (tmnl +t(mn+n) +t(ml+l)+ mnlog(m)+ mn2).

IV. EXPERIMENTAL RESULT
Existing state-of-the-art algorithms can achieve relatively good performance. However, they depicted the graph structure constrained on the encoding matrix, which only indirectly affects the base matrix. In this section, we investigate our RGNMF by conducting extensive experiments.

A. DATA SETS
To make fair comparison with other methods, we use eight commonly-used data sets to evaluate the performance of our algorithm and other related methods. These data sets are composed of three biological data sets, 1 i.e., Carcinom, 1 http://featureselection.asu.edu/datasets.php  [62]. In addition, we use a big data set MNIST 4 to evaluate the performance of all algorithms. The data contained in the above-mentioned data sets are real-world and used by the state-of-the-art algorithms [7], [23], [28], [41], [46], [66], [67]. ORL and Georgia are two relatively new data sets. Details of these data sets are described in Table 1.

B. EVALUATION MEASURES
Two widely used measures, accuracy (ACC) and normalized mutual information (NMI) [10], [18], [20], are adopted to compare the performance of our RGNMF and other related approaches. The values of ACC and NMI are in the interval [0,1]. The closer their values are to 1, the better the learning performance of this algorithm. ACC is applied to measure the percentage of correct groups obtained by one algorithm and formulated as where c i denotes the clustering label and r i is the ground truth label provided by the data set. Here, η(c i ) aims at mapping the clustering labels to the ground truth labels. The indicator function ω(α, β) equals 1 if α = β and equals 0 otherwise. NMI is used to measure the similarity between two sets of clusters and is depicted as follows: where n i denotes the sample number of the cluster C i (1 ≤ i ≤ K ) provided by the clustering algorithm andn j is the number of samples belonging to the j-th ground-truth class (1 ≤ j ≤ K ), and n ij denotes the number of samples that are in the intersection between the cluster C i and the j-th class.

C. COMPARED METHODS
A baseline and state-of-the-art NMF approaches are exploited to compare with our RGNMF. Here, we briefly summarize them as follows: (1) NMF [8]: Standard NMF is considered as a baseline.
(2) L 21 NMF [14]: A L 2,1 -norm loss function in NMF is exploited to decompose the input matrix into tow nonnegative factor matrices. (3) GNMF [10]: It takes advantage of the local structure of data to guide the process of matrix factorization. (4) MNMFL 21 [28]: It merges the local structure into L 21 NMF as a regularizer to enhance the robustness of GNMF. (5) ONMFS [47]: It presents nonnegative PCA algorithm to solve the orthogonal NMF with global approximation guarantees. (6) NMF-LCAG [34]: It integrates graph construction into the processes of matrix factorization and thus the graph structure changes during the NMF procedure. (7) ENMF [3]: Its loss function is intercalated between Frobenius norm and L 2,1 -norm and adds the local structure of data as the regularization term. (8) GLNMF [31]: It uses low-rank recovery technique to obtain the low-rank part of the raw data and then incorporates the local structure into NMF for factorizing the low-rank data. (9) CHNMF [65]: It exploits the correntropy measure in the loss function and constructed a hypergraph to preserve the high-order geometric information of the data.
(10) DSNMF [56]: it outlines a deep framework to learn such hidden representation and to interpret clustering according to different unknown attributes of a given data set.

D. COMPARISON OF CLUSTERING PERFORMANCE
Because nine compared approaches are unsupervised, clustering comparison is performed on the whole sample space. For fair comparison, we exploit a random scheme to initialize two nonnegative factors A and X . Following [3], [23], and [28], the clustering number K is set to the real number of classes belonging to one data set and the classical K -means is applied to cluster the learned representation X for obtaining the clustering results. MNMFL 21 and GNMF have the regularization parameter λ and the nearest neighbor size p. As described in their experiments, λ changes within {10 −3 , 10 −2 , 10 −1 , 10 0 , 10 1 , 10 2 , 10 3 } and the best values of ACC and NMI are shown. In addition to the above two parameters, GNLMF has α and r that need to be assigned in advance. According to [31], α and r are assigned 10 −4 and 0.1×min(m, n), respectively. There are four regularization parameters, α, β, λ, and γ in NMF-LCAG. The values of α, β, λ, and γ are searched from {10 −3 , 10 −2 , 10 −1 , 10 0 , 10 1 , 10 2 , 10 3 } and the best values of ACC and NMI are shown. For ENMF, its three parameters δ, α, and β are searched from {10 −3 , 10 −2 , 10 −1 , 10 0 , 10 1 , 10 2 , 10 3 }. In addition to p, the proposed RGNMF has three regularization parameters, α, β, and λ. Three parameters are selected from the grid {10 −3 , 10 −2 , 10 −1 , 10 0 , 10 1 , 10 2 , 10 3 }. For GNMF, MNMFL 21 , NMF-LCAG, ENMF, GNLMF and RGNMF, the neighborhood size p is set to 5 in all experiments. Experiments are performed 20 runs with different initial points on each data set. We show the average clustering result for each algorithm. Finally, following state-of-the-art methods [3], [28], [47], [48], we set the dimension l to the number of clusters in the experiments. The values of ACC and NMI on eight data sets used in this paper are demonstrated in Tables 2-3. From Tables 2-3, a few  interesting observations can be gained. (1) Obviously, our RGNMF performs differently on different data sets. For example, on Face94 data set, the ACC value of RGNMF is 1.28% higher than that of the second best GLNMF. On LUNG data set, the NMI value of RGNMF is 10.58% higher than that of the second best GLNMF. For larger MNIST data set, our RGNMF is superior to these methods compared in this paper. Obviously, it can be seen that our algorithm always achieves relatively better performance than other ten compared approaches. Thus, RGNMF can find more discriminative representations for the real-world data. Intuitively, the constructed k-NN graph and sparse graph that are imposed on the basis matrix and the encoding matrix play an essential role in finding the discriminative representation. (2) CHNMF is inferior to our RGNMF. The reason is that CHNMF constructs the Hypergraph to preserve the local structure of the data, while ignoring the global structure. It performs better than other methods on ORL, CBCL, and Face95 data sets. CHNMF can also provide the comparative performance on Face94, LUNG, TOX_171 and MNIS. However, its performance is lower than that of MNMFL 21 on Grimace, and Carcinom data sets. In a word, although CHNMF is slightly worse than our algorithm, it is a very good method. DSNMF also achieves good clustering performance. For example, it outperforms other methods on Face94 and Grimace data sets. For ORL, Face95, LUNG and MNIST data sets, however, CHNMF is superior to DSNMF. Generally, deep methods suffer from two limitations: one is that they usually involve a large number of parameters and are easy to fit; the other is that they require very high training cost in running time and space [69], [70]. (3) Although MNMFL 21 , ENMF and GNLMF form the remaining five algorithms including NMF, L 21 NMF, ONMFS, NMF-LCAG and GNMF, they are inferior to the proposed method. The reason is that three algorithms pay attention to the local structure and ignore the global structure. In addition, their regularizers are imposed on the encoding matrix, which fails to directly affect the base matrix. We can see that the three algorithms perform almost well on most data sets. However, both MNMFL 21 and GNLMF perform better than ENMF on the Carcinom data set. (4) GNMF outperforms NMF, L 21 NMF, and NMF-LCAG, because it constructs a nearest neighbor graph to encode the geometric information of the data space. However, its performance is relatively worse than MNMFL 21 , ENMF, GNLMF and RGNMF. As demonstrated in [28], the L 2,1 -norm is helpful to distinguish noise and outliers around clusters and enhance the clustering accuracy. GNMF incorporates the local structure into the least square loss function of NMF. Thus, its performance is limited to some extent. (5) As we can see, the performance of NMF-LCAG is lower than that of other manifold learning approaches, even though it constructs the graph to consider the intrinsic geometric structure. The reason is that NMF-LCAG exploits self-expressive coefficients to depict the local structure of data, which may not be conducive to improving clustering performance. In addition, the optimal solution cannot be gained, since two regularization terms are controlled by one parameter λ [18]. (6) As a baseline, NMF is simple to perform, but it usually performs worse than other eight methods. Thus, it is necessary to introduce different regularization terms into NMF to improve the performance.
The performance of ONMF is better than that of L21NMF in six data sets and NMF in all data sets, which validates that orthogonal constraint in NMF performs well for clustering tasks [3]. (7) It is worth noting that NMF performs almost as well as manifold learning-based algorithms on the Face94 and TOX_171 data sets, but it is much worse than RGNMF, especially on the TOX_171 data set. This indicates that is not enough to improve the performance of NMF only by encoding the local structure, but also by considering other information, such as the global structure.

E. PARAMETER SENSITIVENESS
As mentioned in the previous section, in addition to p, there are three regularization parameters α, β, and λ in RGNMF. Clearly, if all three parameters are set to 0, our RGNMF is converted to L 21 NMF [14]. If α and λ are set to 0, RGNMF is changed to MNMFL 21 [28]. Consequently, our RGNMF is more general than L 21 NMF and MNMFL 21 . In other words, MNMFL 21 and L 21 NMF are two special cases of our RGNMF. To demonstrate the influence of three parameters α, β and λ on the performance of our algorithm, we perform the sensitivity experiments on ORL, CBCL and LUNG data sets. We can gain similar insight on the remaining data sets and thus leave out them. We can observe from Figs. 2-4 that these three parameters have a significant impact on the performance of RGNMF. Actually, such a significant impact also exists in other methods. As seen in experiments, the impact of parameters on the performance of the algorithm is different on different data sets. For example, α, β and λ have a relatively little impact on RGNMF on the ORL data set. When three parameters vary from 0.001 to 1000, the performance of RGNMF has little change. Clearly, the performance of the proposed method is relatively stable with respect to three parameters. It can be seen from Fig. 3 that RGNMF can achieve consistently good performance when α and β vary respectively in [10 −3 , 10 2 ] and [10 −2 , 10 1 ] on the CBCL data set. Compared with α and β, λ has a relatively little influence on RGNMF. RGNMF becomes stable and obtains the good performance when λ is less than 100. As we can see from Fig. 4, β has a relatively large influence on RGNMF on the LUNG data set. RGNMF performs relatively poorly when β is less than 900. However, when β is greater 900, RGNMF becomes very stable and outperforms other algorithms. Similarly, others manifold learning methods, including GMNF, NMF-LCAG, MNMFL 21 , ENMF and GNLMF, perform relatively worse when the regularization parameter of the affinity graph in their objective function is less than 900. The five methods can get the best performance when the value of this parameter is set to 1000. For example, on the LUNG data set, the best ACC and NMI values of GLNMF are 80.79% and 56.9%, respectively. It is worth noting that GMNF and MNMFL 21 have the same clustering accuracy. The best ACC and NMI of the two algorithms are 79.8% and 56.27%, respectively. If the regularization parameter of the local structure in our objective function is set to 1000, the best ACC and NMI of RGNMF are 86.21% and 67.48%, respectively. Obviously, NMI of RGNMF is about 11% higher than that of GLNMF.

F. CONVERGENCE ANALYSIS
To intuitively understand the convergence of our RGNMF algorithm, we discuss the empirical results on its convergence. The convergence curves of RGNMF are shown in Fig. 5 on eight data sets. In each subfigure of Fig. 5, the y-axis denotes the value of the objective function and the x-axis represents the number of iterations. We can observe that RGNMF usually converges within 150 iterations. This indicates that our algorithm converges relatively fast.

V. CONCLUSION
In this paper, we proposed a novel unsupervised NMF algorithm, called RGNMF. Different from existing approaches that ignore the geometric structure or only considers the local structure, first of all, a sparse graph is constructed to model the global structure imposed on the basis matrix, and a nearest neighbor graph is constructed to respect the local structure constrained on the encoding matrix. Secondly, RGNMF exploits the L 2,1 -norm loss function to seek the basis matrix and encoding matrix, which can avoid the interference of noise and outliers. Thirdly, it enforces a L 2,1 -norm regularization on the basis matrix to choose the important features of samples. Hence, the discriminative representations of data are simultaneously learned by explicitly exploiting the intrinsic structure and L 2,1 -norm minimization. Finally, the optimization approach is developed to seek two factor matrices. And its convergence is proven. Experiments on eight real-world data sets demonstrate that RGNMF is better than the other eight algorithms.
In the next work, we will focus on the following questions: 1) There are three parameters α, β and λ which control the smoothness and the sparsity of the new model. Obviously, the proper values of the three parameters are very important to our algorithm. However, it is not clear how to select parameters effectively in theory. 2) Although it is difficult to obtain all the class labels of data, some of the class labels are available. We will consider representation learning from labeled and unlabeled data and naturally extend our RGNMF to semi-supervised scenarios.