Adaptive Graph Regularized Low–Rank Matrix Factorization With Noise and Outliers for Clustering

Clustering, which is a commonly used tool, has been applied in machine learning, data mining and so on, and has received extensive research. However, there are usually noise and outliers in the data, which will bring about significant errors in the clustering results. In this paper, a robust clustering model with adaptive graph regularization (RCAG) is proposed, on which, sparse error matrix is introduced to express sparse noise, such as impulse noise, dead line, stripes, and $\ell _{1}$ norm is introduced to alleviate the sparse noise. In addition, the $\ell _{2,1}$ norm is also proposed mitigating the effects of outliers, and it has rotation invariance property. Therefore, our RCAG is insensitive to data noise and outliers. More importantly, the adaptive graph regularization is introduced into the RCAG to improve the clustering performance. Aiming at the optimization objective, we propose an iterative updating algorithm, named the Augmented Lagrangian Method (ALM), to update each optimization variable respectively. The convergence and time complexity of RCAG is also proved in theory. Finally, experimental results on fourteen datasets of four application scenarios, such as face image, handwriting recognition and UCI, elaborate the superiority of proposed method over seven existing classical clustering methods. The experimental results demonstrate that our approach achieves better clustering performance in ACC and Purity, which is a little less impressive in other ways.


I. INTRODUCTION
Clustering is the process of dividing the object set into multiple classes composed of similar objects. The cluster generated by clustering is a set of data objects, which are related to objects in the same cluster but distinct from objects in other clusters. Furthermore, data clustering is a valuable data analysis tool in machine learning and data mining. However, reducing the influence of noise and outliers in the clustering of data is a major research topic.
In recent years, varieties of clustering methods have been proposed, such as K -means [1], spectral clustering [2]- [4], NMF [5]. K -means aims to learn c cluster centroids that minimize the within cluster data distances. Spectral clustering The associate editor coordinating the review of this manuscript and approving it for publication was Amir Masoud Rahmani . is a kind of clustering method based on graph theory, which achieves the purpose of clustering the sample data by clustering the feature vectors of the Laplace matrix of the sample data. It is a low-dimension embedding of the affinity matrix between samples. There have been a lot of researches on clustering. Reference [6] proposed a new clustering method that toke sample invariance as priori. Reference [7] proposed a subspace clustering based on Structured AutoEncoder (StructAE). Reference [8] proposed to project raw data into one space in which the projection embraces the geometric consistency (GC) and the cluster assignment consistency (CAC), and didn't need to make intensive parameter selections. Reference [9] built the theoretical connection between Frobenius-norm-based representation (FNR) and nuclearnorm-based representation (NNR). This paper mainly studies the application of matrix factorization to clustering. VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Some forms of matrix factorization include low-rank representation (LRR) [10], principal component analysis (PCA) [11] and singular value decomposition (SVD) [12]. In the past decades, many clustering methods based NMF have been proposed [13], [14]. Non-negative matrix decomposition was proposed by [15]. It makes all the components after decomposition nonnegative, and simultaneously realizes the nonlinear dimension reduction. The general form of NMF is: X ≈ WH T s.t W ≥ 0, H ≥ 0 [16]. Non-negative matrix factorization is the most popular methods in this branch. In the clustering setting of NMF, H ∈ R n×c is the clustering assignment matrix and W ∈ R m×c is the clustering centroids, where c is the number of clusters. Noting that the result of clustering with non-negative matrix factorization can be obtained by performing K -means or other clustering methods on H .
Previous studies have proposed various regularized NMF for clustering [17]- [19]. For instance, many researches have been proposed NMF extensions with graph regularization. The graph regularization constraint of data cluster assignment matrix is applied to obtain the geometric structure of data [18], [20]- [22]. Reference [23] proposed sparse dual graph-regularized nonnegative matrix factorization, and revealed the inherent geometric structure and distinguishing structure of data space and feature space. Reference [24] introduced hypergraph Laplacian regularization to consider the intrinsic geometrical structure and introduced 2,1 norm to reduce effects of the noise and outliers. Reference [25] used hypergraph regularization to preserve the high-order manifold structure. In the above work, the affinity matrix usually adopts a predefined model, which may not be optimal in practical applications. This can lead to lower-quality diagrams being built and parts of the work not dealing with noise issues.
In order to improve the performance of NMF, many variants with various regularization have been proposed and various methods are proposed to solve the noise and outliers [26]. Huber loss was proposed to handle non-Gaussian noise and outliers, sparse terms and regularization terms were introduced to enhance the sparsity of the matrix and capture the data manifold structure [27]. Reference [28] focuses on the complex noise problem by using finite mixture of exponential power (MoEP) distributions. These work to deal with noise or outliers, but also do not use adaptive graph regularization to adjust.
Nevertheless, clustering based NMF still exist the following problems: (1) The traditional matrix factorization clustering method is easy to be dominated by noise and outliers to produce large errors. (2) The quality of the original graph-based NMF will be affected if the distance between the calculated data samples is not accurate enough.
To address the problems mentioned above, we propose an adaptive graph regularization clustering (RCAG). In our model, adaptive graph regularization is introduced to improve the accuracy of clustering. In addition, we alleviate the influence of noise and outliers by taking 2,1 norm function [29], and 1 norm is used to alleviate the influence of sparse corruption. Figure 1 shows the algorithm process of RCAG. The main contributions of the paper are summarized as follows: (1) We propose a joint learning framework for clustering.
By which, the adaptive graph regularization, sparse error matrix and nonnegative low-rank matrix decomposition are integrated into a unified objective function shown by Equation 3. (2) In order to get better clustering performance, adaptive graph regularization is introduced. It is parameterinsensitive, scale-invariant, and simple operation. (3) To address the problem of data being corrupted by noise and outliers, we utilize the 1 norm and 2,1 norm to alleviate the influence of sparse noise and outliers, such as impulse noise, dead line, stripes and image occlusion, respectively. (4) In order to solve the optimization problem, an effective algorithm RCAG described by Algorithm 1 based on the Augmented Lagrangian Method (ALM) is developed. More specifically, the convergence analysis of the designed optimization algorithm is presented from both theoretical perspective shown in Theorem 4 and experimental perspective shown in Figure 3. The rest of this paper is organized as following. Section II gives the definition of algorithm related symbols and adaptive graph regularization. We propose a new robust clustering frame (RCAG) for clustering and present the theoretical properties of our proposed RCAG approaches in Section III. The experimental environment, experimental procedure, experimental results and analysis are introduced in Section IV. Section V finally concludes RCAG algorithm and gives the future study direction.

II. RELATED WORK
In this section, we introduce the related work of adaptive graph regularization learning.

A. NOTATIONS
The notations description in Table 1. Matrices are written in capital letters (e.g., X ), and vectors are written in bold lowercase letters. X .j denotes the j-th column, X i. denotes the i-th row and X ij denotes the entry at the j-th column and i-th row of X .

B. ADAPTIVE GRAPH REGULARIZATION
There are many graph regularized NMFs in existence, but most do not capture the structure of the data effectively. First, most graph constructors need to calculate the distance between the data samples. But if the calculated distance is not accurate enough, then you get a picture of very poor quality.
Then, once the graph is built based on the wrong calculation, it stays the same in subsequent steps. Therefore, the input graph is not optimal, and the clustering performance of NMF will be affected. Therefore, it is very necessary to build a high-quality graph.
We suppose that the probability of each sample x i being connected to its neighbor x j is z ij , where z ij is an element of the expected similarity matrix Z . Obviously, we suggest that the similar sample pair with small distance x i − x j 2 2 should be assigned a high probability z ij [30]. Therefore, we have the following objective function to optimize the Z which meets our assumption: where α is the regularization parameter.

III. PROPOSED METHOD
In this section, we introduce the proposed RCAG method. Firstly, the method is presented in three terms. Secondly, the optimization and algorithmic code of RCAG is presented. Thirdly, the time complexity and convergence analysis are presented.

A. MODEL OF RCAG
To solve the sparse noise, outliers and improve clustering performance, this paper proposes the novel robust clustering algorithm model. This model is divided into three terms. As shown in the following Equation 3. By combining robust matrix factorization with 2,1 , 1 and adaptive graph regularization, the proposed RCAG can be formulated as (3) adaptive graph regularization

1) OUTLIERS REMOVING REGULARIZATION
To solve some data entries that are corrupted by outliers, X ≈ WH T +S is low-rank matrix factorization reconstruction loss term. The Frobenius norm is known to be sensitive to noise and outliers. In order to enhance the robustness, 2,1 norm is adopted to measure the loss of matrix factorization.
2,1 adds the 2 norms of all columns of a matrix. 2,1 is rotational invariant for rows: XR 2,1 = X 2,1 for any rotational matrix R [31]. Rotational invariance is a fundamental property of Euclidean space with 2 norm. Since the influence function of 2,1 norm is bounded, this means that the effect of outliers on 2,1 norm can be controlled [32]. But 2 norm is unbounded, so 2,1 norm is more robust than 2 norm.

2) SPARSE NOISE MATRIX REMOVING REGULARIZATION
Noise matrix removing term is to solve the sparse corruption, we introduce a sparse error matrix S ∈ R m×n which constrains by 1 where is the sum of the absolute values of the elements in the vector. 1 norm can reduce the impact of sparse noise. This term can remove impulse noise, dead line and stripes.

3) ADAPTIVE GRAPH REGULARIZATION
The third term is adaptive graph regularization, and it is parameter-insensitive, scale-invariant, and simple operation. Because adaptive graph regularization only contains the parameter of the number of nearest neighbors. When each point is scaled, z ij stays the same, which make it scale-invariant. Adaptive graph regularization only involves the basic operations of addition, subtraction, multiplication, and division, which make it simple operation [33]. Adaptive graph regularization is described in detail in Section II-B.

B. SOLUTION OF RCAG
In recent years, many methods have been proposed to solve this type of optimization problem, such as ALM [34] and LADM [35], etc.. First of all, the introduction of three auxiliary variable E = X − WH T − S, G = S and F = H . In this paper, the objective function Equation 3 can be changed to the following: Then, the above objective function of the problem can be obtained by ALM [34]. The ALM of the Equation 4 is C 1 , C 2 , C 3 are the Lagrange multiplier. Namely, the loss relative to one variable is minimized and other variables are fixed. There are seven variables in total, and the following is the iterative update method.
Update E: Fix other variables to update E by asking the following question: In order to solve Equation 6, we need the following Theorem 1.
Theorem 1: ( [21]) Given a matrix A = [a 1 , . . . , a n ] ∈ R m×n and a positive scalar λ, hence Q * is Equation 7 of the optimal solution, And the i-th column of Q * .
The Equation 6 can be written as follows: arg min where Y = X − WH T − S + C 1 µ . According to Theorem 1, the solution of the Equation 6 is as follows: where y i is the i-th column of Y . Denote , we can deal with following problem individually for each i: Denote d i ∈ R n is a vector with the j-th element as d ij = d x ij + β γ d hf ij , then the above problem can be rewritten as follows: Update H : Fix other variables to update H by asking the following question: Let Rewrite optimization problem: According to the constraint conditions, the Equation 15 can be written as: Theorem 2 was applied to solve this problem.
So the solution to update H is: where U and V are left and right singular values of SVD decomposition of N . Update W : Fix other variables to update W by asking the following question: This is a classical regression problem. Let The solution of W becomes Update S: Fix other variables, variable S can be obtained by solving the following problem: S can be obtained via the soft thresholding [34], [36] as follows: Update F: Update F by fixing other variables to get the following problem: The solution of F is: Update G: Update G by fixing other variables to get the following problem: Let The solution of G is: Update parameter C 1 , C 2 , C 3 , µ: After the variables are updated, these ALM method parameters need to be updated.
The procedure of the algorithm is described in Algorithm 1.

C. THEORETICAL ANALYSIS OF RCAG
In this section, the validity of the method is proved by analysing the time required for calculation. We then pointed out the advantages of RCAG over other approaches.

1) COMPUTATION TIME
The computation complexity of E includes the calculation and update of Y is O(mn + c 3 ) and O(mn), respectively. We need O(mn 2 ) to obtain the matrix Z . The computation complexity of W is O(c 2 ). The computation complexity of G is O(c 2 ).

Algorithm 1: The Proposed RCAG Algorithm
Input: Data set X ∈ R m×n , the number of data clusters c, µ, ρ, maximum number of iterations +mc·max(m, n)) The main computation complexity of H includes the calculation of N and its SVD decomposition, which are O(m 3 ) and O(nc 2 ), respectively.
The overall cost for each iteration is O(m 3 + mc 2 + mnc + mn + mc · max(m, n)). The computational complexity of RCAG is in polynomial time.

2) CONVERGENCE ANALYSIS
The convergence of ALM has been proved in [35]. However, there are seven variables in the paper: W , H , E, Z , G, F, and S. Also, the objective function Equation 4 is not absolutely smooth. These factors do not guarantee that our method is convergent. Fortunately, Theorem 3 proves three sufficient conditions for convergence [10].
Theorem 3: Shaped like L(x) = f (x)+λh(x) can be solved by ALM method. The three conditions to be satisfied by ALM method convergence are as follows: (1) Parameter λ of ALM problem is needed to be upper bounded.
(2) Original data matrix is full column rank.
(3) The optimality gap produced in each iteration step is monotonically decreasing, namely the error ϑ k is monotonically decreasing.
Theorem 4: Algorithm 1 is convergence. Proof: According to Theorem3, our proposed objective function solving method satisfies the above conditions: (1) Parameter µ of Equation 5 is needed to be upper bounded.
(2) Data matrix X is full column rank.
(3) The optimality gap produced in each iteration step, i.e., ϑ k = (E k , W k , S k , H k , F k , Z k , G k ) − arg min E,W ,S,H ,F,G 2 F , monotonically decreases, where E k , W k , S k , H k , F k , Z k , G k represent the value of E, W , S, H , F, Z , G at the k-th step, respectively.
The first two conditions have been met [10]. But the third condition is hard to prove in theory. Nevertheless, we can prove the third condition experimentally. The value of the objective function in the k iteration is F . It can be seen from Figure3 that the value of the objective function dramatically decreases and then converges rapidly to a stable value, indicating that the third condition has been satisfied to some extent. The convergence of the third term can be proved by [37]. In conclusion, the convergence of the algorithm is guaranteed.

3) ADVANTAGES OF RCAG
From a theoretical point of view, RCAG combines matrix factorization, 2,1 , 1 norm and adaptive graph to data clustering.
-Interpretability: Different from other matrix factorization methods, NMF decomposes a non-negative data matrix into two non-negative matrices (one is the basis matrix and the other is the coefficient matrix) from the perspective of "individuals constitute the whole and parts constitute the whole". Since the basis matrix and coefficient matrix obtained by the non-negative matrix decomposition method are non-negative, the results of the decomposition are highly interpretable.
-Robustness: RCAG is effective to remove spares noise in a dataset, and it can handle the influence of outliers. 1 norm is introduced to alleviate the sparse noise. 2,1 norm is also introduced to handle outliers, and it has rotation invariance property. Therefore, our RCAG is insensitive to data noise and outliers.

IV. EXPERIMENTS
In this section, we evaluate the clustering quality of RCAG over fourteen datasets of four types of datasets, including ACC, NMI, and Purity.

A. EXPERIMENTAL ENVIRONMENT
All algorithms were implemented using Matlab R2014a. The experiment was performed on a computer with 3.2GHz Intel Core CPU, 8.0GB RAM and the Windows 7 operating system.

B. DATASETS DESCRIPTION
There are in total fourteen datasets of four types of datasets used in our experiments. Our experiment was carried out on four datasets: face image dataset, UCI dataset, handwritten recognition dataset. Table 2 summarizes the characteristics of these datasets used in the experiments.  • ORL face database consists of a series of face images taken by Olivetti laboratory in Cambridge, UK, with 40 objects of different ages, genders and races.
• Digit database of 250 samples from 44 writers. We selected 1797 instances from 10 writers.
• MNIST contains 70,000 handwritten pen digits. We used a subset of 1,884 in our experiment.
• USPS is composed of 9,298 handwritten digit images. Each image is represented by a 256 dimensional vector.
• Ionosphere consists of 351 instances and 34 attributes.
• Wine contains 13 properties and three types of wine.  • GLIOMA consists of 50 instances and 4434 attributes.

C. EVALUATION METRIC
The three evaluation indexes of ACC NMI and Purity were used to evaluate the performance of experiment, and the specific definitions of these evaluation indicators are as following: Accuracy (ACC) is defined as where map(r i ) is a permutation mapping function for permuting the cluster labels r i to match the equivalent labels in the dataset. n denotes the number of the data points, r i is the cluster predicted label of x i and l i is the corresponding true cluster label. δ(x, y) is the delta function that if x = y equals 1, otherwise it equals 0. The normalized mutual information (NMI) measure between two index sets is defined as following: H (X ) is given by: where p(i) = |X |/N is the probability that an object selected randomly from X falls into class X i . The mutual information (MI) between the grand-truth label (Y ) and cluster results (C) is given by: where The cluster Purity is defined as following: where n j i represents the number of data points of the i input class assigned to the cluster C j (1 ≤ j ≤ c), c is the number of clusters.

D. EXPERIMENTAL PROCESS
In this section, we introduce the process and results of the comprehensive experiment.
1) The K -means clustering method was one of the most widely used clustering methods. 2) Robust principal component analysis (RPCA) was the most widely used dimension reduction techniques. 3) Nonnegative matrix factorization (NMF). 4) Graph regularized nonnegative matrix factorization (GNMF) took into account the nonlinear structure of the data. 5) Robust manifold nonnegative matrix factorization (RMNMF) was an improved graph-based which uses 2,1 norm to improve robust. 6) Robust graph regularized nonnegative matrix factorization (RGNMF) introduced a sparse error matrix and apply the 1 norm to solve unreliable regularization. 7) Low-rank representation with graph regularization (LRRGR) proposed a low-rank representation method that incorporates graph regularization. Table 3, 4, 5 tabulates the clustering results of different clustering methods. As we can see, our method achieves good performance on most datasets. For example, on the face image dataset, our method was superior to other comparison algorithms in terms of ACC, NMI and Purity. According to Figure 9 to 12, in some cases, our method dose not achieve the best performance. For example, in handwritten recognition dataset, our method is slightly lower than other comparison methods in ACC, NMI and Purty. The reason may be that the dependence of factorization may cause these steps to lose some important connection. Our method does not perform well on the Biomedical dataset, probably because the Biomedical dataset is too complex and needs preprocessing due to its high dimension. VOLUME 8, 2020   For the rest, our method achieves good performance over the comparison methods, which demonstrates the necessity and advantage of the introduced 2,1 norm, 1 norm and adaptive graph regularization. 2,1 norm and 1 norm makes our methods robust to outliers and noise.

2) EXPERIMENTAL RESULT
The advantages of RCAG are shown in the following two aspects: (1) The objective function uses the 2,1 norm as the discrepancy measure, which alleviates the outlier problems common in other clustering methods [42]. And we also apply 1 norm to sparse error matrix to alleviate the impact of sparse noise on clustering.
(2) The adaptive graph regularization can improve the clustering precision. And it is parameter-insensitive, scale-invariant and simple operation.

3) PERFORMANCE ON CORRUPTED DATA
In this subsection, we consider the dataset with corruption, such as sparse noise and outliers. For this purpose, we use ORL dataset and artificially vanish 20%, 40% entries.    Table 6, 7, it is seen that RPCA achieve the best performance in few case. While the RPCA performs best in many cases with 20% corrupted data, in most cases our approach is second best. Meanwhile on 40% corrupted data, RCAG outperforms RPCA.

4) PARAMETER SENSITIVENESS
In this subsection, we test the influence of our method with respect to parameters. Parameter λ is selected by searching from [0.0001, 100], and parameter β varies from [0.0001, 100]. The choice of γ and α is based on [33].
Form Figure 13, we can observe that our method is not very sensitive to the choice of λ, and is sensitive to the choice of β. The RCAG achieves good performance when β varies from 1 to 100.

E. DISCUSSION
In the above subsections, several experiments on different types of datasets have been performed to show the efficiency of our proposed RCAG. VOLUME 8, 2020   The adaptive graph regularization clustering method performs better than the baseline K -means. Adaptive graph regularization not only capture the global structures of data, but can preserve the local geometric structures, i.e., preserve nonlinear structure.
Compared with NMF and GNMF, RPCA, etc., our proposed model uses 2,1 norm and 1 norm. 2,1 norm keeps feature rotation invariance and can alleviate the influence of outliers. 1 norm reduces the influence of sparse noise includes impulse noise (salt and pepper), dead line, stripes [43].

V. CONCLUSION AND FUTURE WORK
In this paper, a low-rank matrix factorization model with noise and outliers based on adaptive graph regularization is proposed. Our model can not only solve sparse noise and outliers, but also improve clustering performance by adaptive graph regularization. We introduce spares error matrix S and 1 norm to solve sparse noise problem. By using the sparse error matrix, a large amount of data can be reconstructed to obtain robust decomposition results. In addition, the 2,1 norm is applied to the matrix decomposition, and a robust solution for outliers. Therefore, RCAG approximates the clean data reconstructed from sparse outliers, constrains outliers by 2,1 norm, and constrains sparse noise by 1 norm to achieve robustness. It proposes an iterative updating method to optimize problem, and proved to be convergent. Experimental results show the effectiveness of the RCAG. Nevertheless, there are still some limitations to our approach. For example, because of the shrinkage effect, 1 norm usually results in a biased estimator which affects the accuracy of the matrix rank approximation.
Next, we will carry out follow-up work to study various epidemic regulation to improve clustering performance [44], [45]. Such as normalized ε-penalty solves sparse corruption, include impulse noise, deadline and stripes [46], side information and low rank constraint [47]. Normalized ε-penalty can can replace the 1 norm, and can enhance the sparsity in both the intrinsic low-rank structure and the sparse corruptions. Semi-supervised clustering is realized by adding side information. Adding a kernel norm constraint of the objective functions.