l2,p-Norm Based Discriminant Subspace Clustering Algorithm

Discriminative subspace clustering (DSC) combines Linear Discriminant Analysis (LDA) with clustering algorithm, such as K-means (KM), to form a single framework to perform dimension reduction and clustering simultaneously. It has been verified to be effective for high-dimensional data. However, most existing DSC algorithms rigidly use the Frobenius norm (F-norm) to define model that may not always suitable for the given data. In this paper, DSC is extended in the sense of <inline-formula> <tex-math notation="LaTeX">$l_{2, p}$ </tex-math></inline-formula>-norm, which is a general form of the F-norm, to obtain a family of DSC algorithms which provide more alternative models for practical applications. In order to achieve this goal. Firstly, an efficient algorithm for the <inline-formula> <tex-math notation="LaTeX">$l_{p}$ </tex-math></inline-formula>-norm based KM (KM<inline-formula> <tex-math notation="LaTeX">$_{\mathrm {p}}$ </tex-math></inline-formula>) clustering is proposed. Then, based on the equivalence of LDA and linear regression, a <inline-formula> <tex-math notation="LaTeX">$l_{\mathrm {2,}p}$ </tex-math></inline-formula>-norm based LDA (<inline-formula> <tex-math notation="LaTeX">$l_{\mathrm {2,}p}$ </tex-math></inline-formula>-LDA) is proposed, and an efficient Iteratively Reweighted Least Squares algorithm for <inline-formula> <tex-math notation="LaTeX">$l_{\mathrm {2,}p}$ </tex-math></inline-formula>-LDA is presented. Finally, KM<sub>p</sub> and <inline-formula> <tex-math notation="LaTeX">$l_{2, p}$ </tex-math></inline-formula>-LDA are combined into a single framework to form an efficient generalized DSC algorithm: <inline-formula> <tex-math notation="LaTeX">$l_{2,{p}}$ </tex-math></inline-formula>-norm based DSC clustering (<inline-formula> <tex-math notation="LaTeX">$l_{2,{p}}$ </tex-math></inline-formula>-DSC). In addition, the effects of the parameters on the proposed algorithm are analyzed, and based on the theory of robust statistics, a special case of <inline-formula> <tex-math notation="LaTeX">$l_{2,{p}}$ </tex-math></inline-formula>-DSC, which can show better robustness on the data sets with noise and outlier, is studied. Extensive experiments are performed to verify the effectiveness of our proposed algorithm.


I. INTRODUCTION
Cluster analysis is a basic method for multivariate statistical analysis, and it is also an important part of unsupervised pattern recognition. The nature of clustering is to assign the data with similar patterns to the same cluster by exploring the intrinsic class structure of the data. In the past decades, a great variety of clustering methods have been proposed. For example, the well known K-means (KM) algorithm [1], [2], which is simple in principle, easy to implement, and widely used; spectral clustering algorithm [3], [4], which can effectively cluster the data with manifold structure, obtain wide attention from many scholars, and has become a milestone in the development of cluster analysis; the k-centroid [5], the affine propagation [6] and the sparse subset selection [7] algorithms, which can not only cluster the data, but also find The associate editor coordinating the review of this manuscript and approving it for publication was Shagufta Henna. the representative elements of each class, are also indispensable clustering methods.
In recent years, with the rapid development of information technology, in many fields such as information retrieval, image processing and computational biology, a large number of high-dimensional data emerge every day. For example, in web text data mining, if we use a vector space model to describe each document, the dimensions tend to exceed 5000 dimensions because the word vocabulary is often large. In genomics, DNA microarray data measures the expression levels of thousands of genes in a single experiment. Gene expression data usually contain a large number of genes (dimensions), but a small number of samples [32]. The high dimension of data significantly increases the complexity of time and space of the clustering algorithm. Many dimensions in data are not helpful or may even worsen the performance of the subsequent clustering algorithms. How to effectively and efficiently cluster high-dimensional data has VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ obtained wide attention from many scholars, and many methods have been proposed to handle it. In supervised learning, An effective way to handle high dimension data is dimension reduction including feature selection and feature extraction. However, feature selection in clustering is more challenging for both effectiveness and efficiency since the class labels are unavailable to guide the search for discriminative feature subsets. A commonly alternative method is feature weighting clustering [8]- [10], in which, feature selection can be performed by ranking the obtained feature weight. Feature selection clustering has the advantages of interpretability of the selected features and easy implementation on a database [17]. However, the rigidity of original dimension do not have enough flexibility to handle clusters which extends along a mixture of directions [17]. Unlike feature selection clustering, feature extraction-based clustering transforms data to achieve dimension reduction, in which new features can be generated. Previous feature extraction-based clustering methods often projected the data onto a low-dimensional subspace by unsupervised dimension reduction method such as principal component analysis (PCA) [11] or manifold learning algorithm [12] before the clustering [3], [4], [13], [14]. Due to the inherent separation between subspace selection (via dimension reduction) and clustering, the projection may not necessarily improve the separability of the data for clustering [19]. Linear discriminant analysis (LDA) is a classical supervised method for feature extraction and dimension reduction [15]. It computes an optimal linear transformation matrix by minimizing the within-class distance of the data set while maximizing the between-class distance in the linearly transformed low dimensional space simultaneously. In some recent work [16]- [18], LDA (or its variants) was combined with clustering process to improve the performance of clustering. This method, which is called discriminative subspace clustering (DSC) in this paper, uses LDA to project data onto the optimal transformation subspace while completes the data clustering in the low dimension transformation subspace, and optimizes these two processes alternately to perform the clustering and dimension reduction of data simultaneously. The experimental results have shown the superiority of the DSC in comparison with other popular clustering algorithms in the same period [16]- [18]. DSC was first proposed by De la Torre in 2006 [16]. However, the LDA used in the DSC algorithm proposed by De la Torre only uses the between-class information in data, and is not a full LDA [17]. Ding and Li proposed combining another form of LDA, which involves not only between-class information but also within-class information in data, with the classical KM clustering algorithm to perform DSC [17]. The DSC algorithm proposed by Ding and Li has the advantage of simple principle, and is very easy to handle. In [18], Ye et al. interpreted DSC as a clustering algorithm that can adaptively perform clustering and distance metric learning simultaneously. In [19], Ye et al. also proposed discriminative k-means (DKM), which transforms DSC into a kernel KM clustering with a special kernel matrix.
The experimental results in [19] show the superiority of the DKM algorithm in comparison with some manifold learning based clustering algorithms. In addition, an outstanding feature of the DKM algorithm is its high efficiency for high-dimensional and small-sample data. However, the computation a kernel matrix of on a data set with a large sample number is very time-consuming. So DKM is inconvenient to use on the data set with a large sample number. In order to extend the scope of application of DSC, fuzzy versions of DSC were proposed by extending the membership function to fuzzy (or soft) case [20]- [22]. The fuzzy versions of DSC can use the fuzzy membership function to characterize the degree to which data belongs to some class expediently, and reflect the internal structure of the data [21]. Additionally, Hou et al. replaced the LDA in the DSC method with a general form of maximum margin criterion (MMC) [24], avoiding the singularity problem of the scatter matrix implicitly, and proposed a variant of DSC: discriminative embedded clustering (DEC) algorithm [23]. In [25], DC was shown equivalent to another version of discriminative clustering framework [26], [27], in which the cost function of support vector machine is used for linear classification as a clustering criterion, and extended to sparse version and multi-label version further. Seeing that when the data are large and high-dimensional, most DSC algorithms are time-consuming, recently, using QR decomposition-based Linear Discriminant Analysis, Zhi et al. proposed an efficient DSC algorithm, which makes DSC can be used on more large data set [28].
Many DSC algorithms have been proposed, and lots of experimental results have shown the effectiveness of DSC. However, the definitions of the loss functions in most of the existing DSC algorithms are rigidly in fact based on the F-norm of matrix (Corresponding to the vector form, it is the Euclidean norm.), which limits the use of DSC. Since the l 2,p -norm of a matrix is a general form of its F-norm, and it has been used in many machine learning problem [29]- [33] successfully recently, in this paper, a generalized DSC is proposed by extending the DSC algorithm [17] in the sense of l 2,p -norm. The main contributions of this paper are summarized as follows: (1) The Euclidean norm based KM clustering process is generalized in the sense of l p -norm, and an efficient l p -norm based on KM (KM P ) algorithm is presented. (2) LDA is generalized in the sense of the l 2,p -norm based on the equivalence between LDA and linear regression (LR) [38], and an l 2,p -norm based LDA (l 2,p -LDA) algorithm is proposed. The main difference between l 2,p -LDA and LDA lies in the first stage. The first stage of l 2,p -LDA can be transformed into exactly an Iteratively Reweighted Least Squares (IRLS) problem [37], which can be solved efficiently using LSQR algorithm [38]. (3) Combining the proposed l 2,p -LDA and KM p algorithm into a single clustering framework, a generalized DSC algorithm, which is called l 2,pnorm based discriminative subspace clustering (l 2,p -DSC), is presented. (4) An intelligent version of KM p algorithm is used to initialize l 2,p -DSC to avoid falling into the local optimum. The presented theories and algorithms are evaluated through experiments on twelve synthetic and real-world data sets.
The rest of this paper is organized as follows. The notations and definitions used in this paper are introduced in Section II. Some related works including classical KM, intelligent KM (IKM) based on anomalous cluster [41], [42], LDA, DSC [16], [17] and some variants and extensions of DSC are reviewed in Section III. IKM is extended in the sense of l p -norm, LDA is extended in the sense of l 2,p -norm based on the equivalence between LDA and linear regression, and l 2,p -norm based DSC algorithm is proposed in Section IV. Section V includes the experimental results and the conclusions are given in Section VI.

II. NOTATIONS AND DEFINITIONS
This section summarizes the notations and definitions of the norms used in this paper. Matrices are written as boldface uppercase letters. Vectors are written as boldface lowercase letters. For a matrix M ∈ R n×m , its i-th row and j-th column are denoted by m i and m j respectively. The l p -norm of a vector v ∈ R n is defined as: where v i is the i-th component of the vector v. When p = 1 and 2, the l p -norm of a vector v will degenerate its |v i | 2 1 / 2 respectively; and when p → ∞, the l pnorm of a vector v will degenerate its l ∞ -norm v ∞ = max i |v i |. The l r,p -norm of a matrix M ∈ R n×m is defined as: When r = p = 2, the l r,p -norm of the matrix M will degenerate into its Frobenius norm (F-norm) When r = 2, the l r,p -norm of the matrix M will degenerate into its l 2,p -norm Comparing the definition of l p -norm of a vector with the one of the l r,p -norm of a matrix, it can be seen that the l 2,p -norm of a matrix will degenerate into its l p -norm when the matrix degenerates into a vector.
In the field of machine learning, l 2,1 -norm and l 2,p -norm are mainly used in two aspects: one is to use the l 2,1 -norm of the transformation matrix or the p-th power of l 2,pnorm as a penalty term to implement feature selection [29], [30], [32], [33], The other is to replace the F-norm in the loss function using the l 2,1 -norm or p-th power of l 2,p -norm to improve the adaptability of the algorithm to data [31]- [37], such as the robustness to noise points and outliers in the data. Of course, l 2,1 -norm and l 2,p -norm are also used in two aspects at the same time: both as a penalty term to implement feature selection, and a loss function to improve the algorithm's adaptability to data [32], [33].
In view of the generality of the l 2,p -norm and its successful use in machine learning, DSC is extended in the sense of the l 2,p -norm in this paper. For convenience, the notations used in this article are summarized in Table 1. For simplicity, in the following discussion, we assume that x 1 , x 2 , · · · , x n form rows of a data matrix X ∈ R n×m , and have been centered, so that the total mean data setx = Xe n = 0, where e ∈ [1, 1, · · · , 1] T ∈ R n .

III. RELATED WORK AND MOTIVATION
In this section, firstly, the classical KM [18], [19] and the intelligent KM (IKM) algorithms [41], [42] are reviewed. Then, linear discriminant analysis (LDA) and LDA based on least squares [31] are also reviewed. Finally, DSC [16], its variants and the motivation of the proposed work is presented.

A. KM AND IKM
Consider a data set consisting of n data points {x i } n i=1 and denote data matrix as X = [x 1 , x 2 , · · · , x n ] whose i-th column is given by x i . KM clustering finds the partition of the data that minimizes the following objection function consisting of k cluster centers, and v j is the center of the j-th cluster, and the matrix U = {u ij } n×k is a cluster membership matrix such that u ij = 1 if x i is belongs to the j-th cluster X i , and u ij = 0 otherwise; and the following conditions are satisfied: k j=1 u ij = 1, i = 1, 2, · · · , n. The objective function (5) measures the total error of using k cluster centers to represent k class data. KM algorithm is a well known and widely used clustering algorithm because of its simplicity and efficiency, but it has mainly the following three defects: (1) The value of k needs to be given in advance and belongs to the preknowledge. In many cases, the estimation of k is very difficult. (2) The algorithm is sensitive to the initialization, the clustering results obtained by different random seed points may different, and it may converge to the local minimum.
(3) Noise and outliers have a great impact on the algorithm. The intelligent version of KM, intelligent k-means (IKM) [41], [42], finds the so-called anomaly cluster before running KM itself, by extracting the exception clusters one by one till there are no unclustered objects, after which the centers of the largest anomalous clusters are used to initialize KM. The detailed IKM algorithm can be found in [41], [42]. IKM algorithm solves the problem that KM algorithm is sensitive to initialization to a certain extent.

B. LINEAR DISCRIMINANT ANALYSIS BASED ON LEAST SQUARES
LDA a classical supervised feature extraction and dimension reduction method [15]. In LDA, given membership matrix U, the weighted membership matrix [19] can be defined as follows: where L j = 1 √ n j U j , U j is the j-th column of U, and n j is the number of points in the j-th class. In LDA, the within-class, between-class, and total scatter matrices are defined as and where v j = 1 n j x∈x j x is the means of the data points in j-th class. It can be shown that S t = S b + S w . The sizes of trace (S b ) and trace (S w ) measure the separation between classes and the compactness of the data points within the class respectively. The goal of LDA is to find an optimal linear transformation G that maps m-dimensional data into a lowdimensional space: the class structure of data set in the original high-dimensional data space is optimally preserved. An common used objective function of LDA is The optimal G can be obtained by computing the eigendecomposition on the matrix S −1 w S b if S w is nonsingular. However, in many applications, the data are so high-dimensional that scatter matrix often does not satisfy nonsingularity, so the classical LDA cannot be used. The regularization method [43] is usually used to improve the singularity of the scatter matrix, i.e., usingS = S + λI m replace the singular scatter S, where I m is the unit matrix of size m, λ > 0 is a regularization parameter.
The solution of LDA can be transformed into a generalized eigenvalue problem, and the equivalence of this generalized eigenvalue problem and the least squares (LS) based linear regression (LR) problem has been strictly proved in [38]. Based on this equivalence, an efficient algorithm, which is called least squares based LDA (LSLDA), is proposed for LDA [38]. The LSLDA algorithm can be divided into two stages. The first stage aims to solve the following LS problem: The second stage use the transformation matrix G 1 obtained in the first stage to project the original data onto a low-dimensional space, and perform a LDA procedure in the dimension reduced subspace [38]. Since the LS problem in LSLDA can be efficiently solved using LSQR algorithm [40], LSLDA can be used on large data sets.

C. DSC
The DSC [16], [17] algorithms can be seen unsupervised procedures of LDA. In DSC [16], [17], the transformation matrix G and the cluster membership matrix U are computed by maximizing the objective function as respectively. The algorithms work in an intertwined and iterative fashion. More specifically, for a given U, the optimal G can be obtain by the standard LDA procedure. For a given G, the optimal U can be computed by applying the gradient descent strategy [16] or by solving KM problem in the lower-dimensional space resulting from the transformation G [17]. Once DSC [16] was proposed, it has attracted the attention of many scholars. In [17], Ding and Li pointed out the LDA used in the DSC algorithm [16] only used the between-class information in data, and is not a full LDA [17]. They proposed combining another form of LDA, which involves not only 76046 VOLUME 8, 2020 between-class information but also within-class information in data, with the classical KM clustering algorithm to perform DSC [17]. In [18], Ye et al. interpreted DSC as a clustering algorithm that can adaptively performs clustering and distance metric learning simultaneously. In [19], Ye et al. proposed discriminative KM (DKM), which transformed DSC into a kernel KM with the special kernel matrix. However, the computation of a kernel matrix on a data set with large sample number is more time-consume. So DKM is not fit large data set. In [23], Hou et al. proposed Discriminative embedded clustering (DEC) algorithm, which can be formulated as following optimization problem where η is a balance parameter. DEC replaces the LDA in DSC with a general form of MMC [24], avoiding the singularity problem of the scatter matrix implicitly. Seeing that when the data are large and high-dimensional, most DSC algorithm are time-consuming, recently, Zhi et al. proposed an efficient algorithm for DSC by using QR decomposition-based LDA, which makes DSC can be used on more larger data set [28].

D. MOTIVATION
Many DSC algorithms have been proposed, and lots of experimental results have shown the effectiveness of DSC [16]- [28]. However, the definitions of the objective function in most of the existing DSC algorithms are based on the F-norm (Corresponding to the vector form, it is the Euclidean norm.). In fact, based on the equivalence between LDA and LS based LR [38], it can be seen that the optimization of DSC can be decomposed into two parts, i.e. and Eq.(15) constitutes the main body of the optimization of DSC since the optimization (16) does not change the class label. So, the objective function of DSC can be thought to be defined based on the F-norm. In [31], [32], [36], it was pointed out that the loss function based on the mean square F-norm is sensitive to noise and outliers in the data set. So the used square F-norm based loss function in the objective function of DSC may limits the application of DSC. Especially, if the data is with noise and outlier, the performance DSC may be greatly degraded. On the other hand, compared to squared F-norm, l 2,1 -norm based loss function is more robust to noise and outlier [31], [36]. In [31], The l 2,1 -norm of a matrix was first introduced in Principal Component Analysis to improve its robustness to noise and outlier. In [32], l 2,1 -norm is applied to the construction of the loss function and penalty term of the linear regression to obtain a noise-resistant supvervised feature selection algorithm. In [36], l 2,1 -norm was applied to the construction of the loss function of linear discriminant analysis (LDA) to obtain a noise-resistant LDA algorithm. In [33], the algorithm in [32] was extended to the sense of l 2,p -norm to further improve the adaptability of the algorithm to data.
It is certain that the rigidly used F-norm to define loss functions in DSC would limit its application. Existing lots of work shows that replacing the F-norm in the loss function with l 2,p -norm can general the original algorithm, which provides more alternative models for practical applications, and can improve the flexibility of the original algorithm greatly. So, in this paper, the loss function of the DSC objective function will be extended in the sense of l 2,p -norm to improve the flexibility of the DSC algorithm.

IV. DISCRIMINANT SUBSPACE CLUSTERING BASED ON l 2,p -NORM
Since DSC is a framework of the combination of LDA and clustering, a direct method of extending DSC in the sense of l 2,p -norm is to extend LDA and clustering simultaneously. In order to achieve this goal, in this section, firstly, efficient algorithms for l p -norm based KM (KM p ) and its intelligent version are presented. Then, based on the equivalence of LDA and linear regression, an l 2,p -norm based LDA (l 2,p -LDA) algorithm is proposed. Finally, KM p and l 2,p -LDA are combined into a single clustering framework, and obtain an l 2,p -norm based generalized DSC algorithm: l 2,p -DSC. In addition, a specail case of the proposed l 2,p -LSDC algorithm, which shows the robustness to noise and outlier in data set, is studied.

A. INTELLIGENT KM BASED ON l p -NORM
The KM algorithm based on l 1 -norm has been proposed in [43]. In [42], the feature weighted KM algorithm based on the more general l p -norm has been proposed. The fuzzy k-means algorithm based on l p -norm has been introduced in [44]. However, the KM algorithm based on l p -norm (KM p ) has not been studied directly. In this section, an efficient algorithm for KM p is presented.
Given the data matrix X = [x 1 , x 2 · · · x n ], the objective function of the KM p algorithm is: Similar as KM, KM p can be solved by using alternating optimization method. Given the clustering center matrix V, the optimal membership function matrix U can be obtained by computing the following equation Given the membership function matix U, for p = 1, the optimal clustering center of the j-th cluster is the median of the data point in the j-th cluster [43]. For p = 1, it can be shown that the objective function of KM p can be reformulated as Differentiating Eq. (19) with respect to v jl and setting the derivative to zero, the v jl can be computed as where X jl is the set of consisting of the data points in the j-th clustering on the k-th feature, and w ijl = x − v jl 2 p−2/2 . So given the membership function matix U, for p = 1, the optimal V can be obtained by computing Eq. (20). As a summary, the detailed algorithm flow of KM p can described as the following Algorithm 1.

Algorithm 1 KM p Algorithm
Input: Data matrix X, the number of clusters k, membership matrix U, the clustering center matrix V , parameter p and the convergence threshold ε. If U − U < ε, then stop iteration; else, let U ← U , return to Step 1.

Output: Membership matrix U.
Based on the proposed KM p algorithm in section IV.A, and the IKM algorithm [41], [42], an l p -norm based intelligent K-means (IKM p ) algorithm can be proposed. The following Algorithm 2 presents the detailed IKM p algorithm.

B. LDA BASED ON l 2,p -NORM
Recall that the loss term in the objective function of the first stage of the LSLDA algorithm [38] is G T 1 X − L T 2 F , which can be rewritten as Extending the square F-norm used in the Eq.(21) to the p-th power of l 2,p -norm, and adding the square F-norm penalty of Algorithm 2 IKM p Algorithm Input: Data matrix X, the number of clusters k, the convergence threshold ε and parameter p. 1: Take the non clustered point farthest away from the origin 0 as the tentative anomalous cluster's centroid and initialize the centroids by using the tentative anomalous cluster's centroid and the origin 0. 2: Define a cluster Y to consist of the points that are closer, in terms of the l p -norm induced distance, to the tentative centroid than to the origin 0 by running two-class KM p starting from the cluster's centroid found in step 1 above and the origin 0. 3: Remove Y from the data set and repeat step 1-2 until all the points are clustered. 4: Select the centroids of the k largest clusters. 5: Run KM p algorithm (Algorithm 1) starting from the found centroids and obtain U. Output: Membership matrix U. G 1 to it, a more general objective function can be obtained as follows: The objective function in Eq. (22) is a general form of the one in Eq. (11). So replacing the optimization (11) by (22) in LSLDA algorithm [38], a general LDA algorithm in the sense of the l 2,p -norm can be obtained. It can be attempted to find a numerical solution of the optimization problem (22) since to find its analytical solution is very difficult. In fact, the optimization problem (22) can be rewritten as where L i is the i-th line of L. Let Then differentiating (23) with respect to G 1 and setting the derivative to zero, an equation that the optimal G 1 must satisfies can be obtained as It follows that Then the G 1 can be rewritten as 76048 VOLUME 8, 2020 From Eq. (27), it can be known that the problem (23) can be solved by using an iterative algorithm: Given w i (i = 1, 2, · · · , n), G 1 can be updated by computing Eq. (27) which can be computed using LSQR algorithm [40] efficiently; and given G 1 , w i (i = 1, 2, · · · , n) can be updated by computing Eq. (24). This algorithm is in essence an Iterative Weighted Least Squares (IRLS) [39], and it can be initialized by using the transformation matrix obtained by classical linear regression (corresponding to the proposed algorithm with p = 2). Based on the LSLDA [38] and the above proposed IRLS algorithms, a l 2,p -norm based LDA (l 2,p -LDA) algorithm can be obtained as the following Algorithm 3. Algorithm 3 l 2,p -LDA Algorithm Input: Data matrix X, the weighted membership matrix L, regularization γ ; 1: Solve the following l 2,p -norm regression problem min The time complexity of the l 2,p -LDA algorithm can be analyzed as follows: Line 1 takes t 1 (t 2 k (3n + 5m + 2mn)) time for using IRLS to solve l 2,p -norm regression problem, where t 1 is the number of the iterations of IRLS, and t 2 is the number of the iterations of LSQR algorithm. Line 2 takes O (kmn) time for computingX, O k 2 n time for computing the matrix D and O k 3 for computing the matrix the matrix G 2 . So the total computational complexity of Line 2 is O mnk + k 2 n + k 3 . Line 3 takes O mk 2 for combining G 1 and G 2 . Therefore, the total computational complexity of the the l 2,p -LDA algorithm is O t 1 (t 2 k (3n + 5m + 2mn)) + mnk + k 2 n + k 3 + mk 2 . It can be simplified as when k m and k n.

C. DSC BASED ON l 2,p -NORM
In this section, the proposed l 2,p -LDA algorithm is introduced into DSC, and combined with KM p into a single clustering framework. Thus, a generalized DSC algorithm, which is called l 2,p -norm based discriminant subspace clustering (l 2,p -DSC) algorithm, is presented. In l 2,p -DSC, l 2,p -LDA is used to reduce the dimension of data, and KM p is used to cluster the dimension-reduced data. The dimension reduction and the clustering processes are mutually optimized until the optimal clustering and dimensional reduction of the data are finally obtained. The following Algorithm 4 presents the detailed l 2,p -DSC algorithm. In Algorithm 4, for distinction, The Algorithm 4 l 2,p -DSC Algorithm Input: Data Matrix X, number of clusters k, regularization γ , parameter p 1 and p 2 . 1: Run Algorithm 2 (IKM p algorithm) to obtain the initial membership matrix U 0 . 2: Run Algorithm 3 (l 2,p -LDA algorithm) to get the transformation matrix G. 3: Y ← XG 4: Run Algorithm 1 (KM p algorithm) on Y to update the membership matrix U. If U − U 0 < ε, then stop iteration, otherwise, let U 0 = U and return to step 2. Output: Transformation matrix G and membership matrix U.
parameters p used in KM p and l 2,p -LDA are noted as p 1 and p 2 respectively.
The time complexity of the l 2,p -DSC algorithm can be analyzed as follows: Line 1 takes O (nm (2k 0 t 0 )) time for IKM p clustering on the initial data, where k 0 is the number of the abnormal clusters found finally, t 0 is the maximum number of iterations of the two-class KM p in IKM p . Line 2 takes O (t 1 (t 2 k (3n + 5m + 2mn)) + mnk) time complexity for applying l 2,p -LDA algorithm to compute the transformation matix G. Line 3 takes O(mnk) for reduce the dimension of data using the transformation matix G. Line 4 takes O nk 2 t 3 time for KM p clustering on the dimension reduced data, where t 3 is the number of iterations of KM p on the dimension reduced data. So the total time complexity of the l 2,p -DSC algorithm is O nm (2k 0 t 0 ) + t 1 (t 2 k (3n + 5m + 2mn))+ mnk + nk 2 t 2 .

D. A SPECIAL CASE OF DSC: ROBUST DSC
Parameter p 1 and p 2 in the l 2,p -DSC algorithm have potent effect on the performance of the l 2,p -LSDC algorithm. In this section, based on the theories of Robust Statistics [46], [47], the robustness to noise and outlier of the proposed l 2,p -DSC algorithm is studied.
In fact, let [x 1 , x 2 , · · · , x n ] be an observed data set and θ is an unknown parameter to be estimated. In robust statistical theory [46], [47], an M-estimate of θ can be generated by minimizing the following form Here ρ is an arbitrary function that can measure the loss of x i and θ. In a location estimate that minimizes the M-estimator is generated by solving the equation where ψ( 2 , it follows that the estimate VOLUME 8, 2020 where Eq. (30) shows the estimate as a weighted mean of the observed data points. Since W (x) is a non-increasing function of |x| when p 1 < 2, outlying observations will receive smaller weights [47]. So when p 1 < 2, the estimate (30) is robust to noise and outlier. The object function of KM p can be rewritten as Given U, the part n i=1 u ij x il − v jl p 1 of the Eq. (31) can be seen as an M-estimate of the j-th clustering center using the j-th cluster as the observed data set on the l-th feature, and x il − v jl p 1 as the loss function. Base on the above analysis, it can be known that this estimate is robust to noise when p 1 < 2. So KM p is a noise-robust clustering algorithm when p 1 < 2. The use of robust statistical theory in regression leads to robust regression method [46], [47]. The classical linear regression can be robustlized in a straightforward way. Instead of minimizing a sum of square, one can minimize a sum of less rapid increasing function of the residuals [46]. Recall that the loss term in the regression (23) is Compared with x 2 2 , ρ(x) = x p 2 2 is a less rapid increasing function of the residuals when p 2 < 2. So the regression (23), which uses Eq. (32) as loss function, is a more robust regression model than the classical linear regression when p 2 < 2. Since the regression (23) constitutes the main part of l 2,p -LDA (Algorithm 3), it follows that l 2,p -LDA is a robust version of classical LDA when p 2 < 2.
Both KM p (with p 1 < 2) and l 2,p -LDA (with p 2 < 2) are robust to noise and outlier, and the proposed l 2,p -DSC algorithm is composed of KM p and l 2,p -LDA. Therefore, with p 1 < 2 and p 2 < 2, the proposed l 2,p -DSC algorithm is more robust to noise and outlier than the DSC [17] and its variant algorithms [19], [23].

V. EXPERIMENTAL RESULTS AND ANALYSIS
In this section, the performance the poposed l 2,p -DSC algorithm is empirically studied in comparison with the IKM [41], DC [17], DKM [19] and DEC [23] algorithms.

A. EXPERIMENTAL SETUP
A total of twelve data sets are collected to verify the effectiveness of the proposed l 2,p -DSC algorithm in the experiments. They are eight UCI data sets [48] including Wine, Post-Operative Patient, Statlog, page_blocks, Vehicle and Letter (abc), two synthetic data constructed of Wine: Wine1 and Wine2 (here Wine1 and Wine2 are constructed by adding one and five outliers to Wine data set respectively), two gene expression data set including Brain [49] and Leukemia (http://featureselection.asu.edu/ datasets.php), and two text data set including Cora_OS and WebKB_wisconsin (http://www.escience.cn/people/ fpnie/index.html). See Table 2 for more details. All of them are normalized using the following method: Clustering accuracy is used to measure the performance of the algorithms. It is defined as where n j represents the number of the data points clustered correctly in j-th class. The regularization parameters in DSC [17], DKM [19] and the l 2,p -DSC algorithm are all set to 10 −6 . The balance parameter in DEC [23] is set to 2. DSC [17], DKM [19] and DEC [23] algorithms are all initialized using IKM [41] algorithm. The maximum number of iterations of the l 2,p -DSC algorithm is set to 100, and the iteration threshold is set to 10 −5 . All experiments are performed on Matlab2014a under Inter(R) Core (TM) i5-6200 CPU @2.4G and 8GB RAM.

B. CONVERGENCE STUDY
To analyze the overall convergence of the proposed l 2,p -DSC theoretically is difficult. Since the proposed l 2,p -DSC algorithm can be decomposed into two stages (the stage of dimension reduction and the stage of clustering on the dimension reduced data), in this section, the convergence of each stage of l 2,p -DSC algorithm is studied empirically. Firstly, the convergence of the algorithm (Algorithm 1) of clustering stage is studied. Fig.1 depicts the evolution of objective function (OF) of KM P (Eq.(19)) as a function of iterations (N) on all twelve data sets, where parameters p 1 and FIGURE 1. Convergence of the clustering process in l2, p-DSC on twelve data set. p 2 are chosen as 1.5 and 2 respectively. From Fig.1, it can be observed that the values of objective function of KM P show downward trends with the iteration, and KM P converges for no more than dozens of iterations on each data set. These show the KM p algorithm (Algorithm 1) can minimize the objective function of KM p effectively and efficiently.
Then the convergence of the algorithm (Algorithm 3) of dimension reduction stage is studied. Fig.2 depicts the evolution of objective function (OF) of l 2,p -norm based regression (Eq.(23)) as a function of iterations (N) on all twelve data sets, where parameters p 1 and p 2 are chosen as 2 and FIGURE 2. Convergence of the dimension reduction process in l 2,p -DSC on twelve data sets. VOLUME 8, 2020 1.2 respectively. From Fig.2, it is observed that the values of objective function of l 2,p -norm based regression (Eq.(23)) show downward trends with the iteration, and l 2,p -norm based regression converges for no more than dozens of iterations for all data sets. These show IRLS can minimize the objective function of l 2,p -norm based regression effectively and efficiently.
C. STUDY ON THE EFFECTS OF PARAMETER p 1 AND p 2 p 1 and p 2 are important parameters involved in the proposed l 2,p -DSC algorithm. p 1 is used in the KM p algorithm for clustering initialization in l 2,p -DSC algorithm and clustering on the dimension-reduced data. p 2 is used in the l 2,p -norm regression for the dimension reduction process in l 2,p -DSC algorithm. In this section, the effects of parameter p 1 and p 2 on the performance of the proposed l 2,p -DSC algorithm are studied empirically.
i Firstly, the effect of parameter p 1 on the performance of the l 2,p -DSC algorithm is studied. To this end, p 2 is set to 2 in the experiments. At this point, the l 2,p -norm based regression used in the l 2,p -DSC algorithm degenerates the commonly used F-norm based regression. Fig.3 shows the clustering accuracies (y-axis) of l 2,p -DSC on twelve data sets for different p 1 values (x-axis), where the clustering accuracie of IKM [41] and DSC [17] are recorded as baselines. From the Fig.3, It can be observed that the clustering accuracies of DSC are higher than the ones of IKM on eight data sets including Wine, Wine1, Wine2, Letter(abc), Brain and Leukemia. This show DSC is commonly more effective than IKM. It can also observed that, with proper p 1 values, the clustering accuracies of l 2,p -DSC are higher than the ones of IKM and DSC on almost all data sets (l 2,p -DSC and DSC achieve the same accuracy on Wine). On Wine1, Wine2, Letter(abc), Brain, page_blocks, Leukemia and WebKB_wisconsin, the clustering accuracies of l 2,p -DSC are much higher than the ones of IKM and DSC. At the same time, comparing the clustering results of l 2,p -DSC and DSC on Wine1 and Wine2, it can be seen that when the value of the parameter p 1 is chosen from the interval [1-2), the algorithm l 2,p -DSC shows a relatively higher clustering precision. This shows that the l 2,p -DSC algorithm has better noise-robustness when the value of the parameter p 1 is between 1 and 2 (excluding 2). So it can be suggested that, given p 2 =2, commonly, the p 1 in the l 2,p -DSC algorithm can be selected in the range [1-2.5]; and, if the data is noisy, the p 1 in the l 2,p -DSC algorithm can be selected in the smaller range [1][2].
Then, the effect of parameter p 2 on the performance of the l 2,p -DSC algorithm is studied. In order to study the the effect of parameter p 2 on the performance of the l 2,p -DSC algorithm, the p 1 in the l 2,p -DSC algorithm is fixed to 2 in the experiment. At this point, the KM p clustering degenerates into the KM clustering accordingly. Fig.4 shows the clustering accuracies (y-axis) of l 2,p -DSC on twelve data sets for different p 2 values (x-axis), and the clustering accuracies of IKM [36] and DSC [14] are recorded as baselines. From the Fig.4, it can be observed that, with proper p 2 values, the clustering accuracies of l 2,p -DSC are higher than the ones of IKM and DSC on all data sets except for Post-Operative Patient. compared with IKM and DSC, l 2,p -DSC shows much higher clustering accuracy On Letter(abc), page_blocks, Leukemia and WebKB_wisconsin. At the same time, it can be also observed that for too or large values of p 2 , the l 2,p -DSC algorithm [17] may show poor clustering accuracy. So it can be suggested that, given p 1 =2, the p 2 in the l 2,p -DSC algorithm can be selected in the range [1-2.5]. At the same time, comparing the clustering results of l 2,p -DSC and DSC on Wine1 and Wine2, it can be seen that when the value of the parameter p 1 is chosen from the interval [1], [2], the algorithm l 2,p -DSC shows a relatively higher clustering accuracy. This shows that the l 2,p -DSC algorithm has better noise-robustness when the value of the parameter p 2 is between 1 and 2. So it can be suggested that, given p 1 =2, if the data is noisy, the p 2 in the l 2,p -DSC algorithm can be selected in the smaller range [1-2).   Table 3 presents the accuracies achieved by all five algorithms on all twelve data sets, where the accuracy achieved VOLUME 8, 2020 by l 2,p -DSC is the the optimal accuracy by using p 1 from the range of [1.5, 2.5] and p 2 from the range of [1, 2.5]. Table 4 presents the corresponding running time. From Table 3, it can be seen that, among five algorithms, l 2,p -DSC achieves the best on all data sets except for Letter in terms optimal accuracy. l 2,p -DSC yields improvements over DSC on all twelve data sets in terms of optimal accuracy. These appearances show the extension of DSC including clustering and dimension reduction processes from F-norm to l 2,p -norm is beneficial as a whole.

E. CLUSTERING EFFICIENCY EVALUATION
In this section, the efficiency of the l 2,p -DSC algorithm in terms of running time is studied. the running times of l 2,p -DSC are compared with the ones of IKM [36], DSC [17], DKM [19] and DEC [23] on twelve data sets. The running times of five algorithms on twelve data sets are presented in Table 4. From Table 4, it can be observed that, the running time of the proposed l 2,p -DSC algorithm generally increases with the increase of the data size. This phenomenon is more obvious on the following five data sets: Brain, page_blocks, Leukemia, Cora_OS and WebKB_wisconsin. In terms of running time, the efficiency of l 2,p -DSC is competitive with the DSC [17] and DEC [23] algorithms for high-dimensional data sets including Brain, Leukemia, Cora_OS and WebKB_wisconsin.

VI. CONCLUSION
In this paper, the DSC algorithm [17] is generalized in the sense of l 2,p -norm and l 2,p -DSC algorithm is proposed. This generalization provides more alternative models for practical applications, and improves the flexibility of the original DSC algorithm. Specially, theoretical analysis and experimental results show, compared with the DSC algorithm, with p 1 < 2 and p 2 < 2, l 2,p -DSC is more robust to the noise and outlier in data. In addition, the efficient LSQR algorithm based solution to l 2,p -LDA dimension reduction and iteration procedure based solution to KM p clustering in l 2,p -DSC offer a possibility to use l 2,p -DSC on large data sets. The selection of parameter is commonly an open problem in an unsupervised learning algorithm due to the deficiency of class label. In semi-supervised case, devising l 2,p -DSC algorithm, especially considering the selection of parameter P 1 and P 2 , is interesting, that will be carried out in our next work. In addition, multi-view learning has become a hotpot of machine learning since the features of data may come from different sources and multi-view learning can handle this kind of data effectively. DEC algorithm [23], which is a variant of DSC, has been extended to multi-view case, and the corresponding algorithm has show encuraging effect on the data with multiple source features [50]. So another further research will include the generalization of l 2,p -DSC in the sense of multi-view case. JIULUN FAN received the B.S. and M.S. degrees in fundamental mathematics from Shaanxi Normal University, in 1984 and 1988, respectively, and the Ph.D. degree in signal and information processing from Xidian University, in 1998. He is currently the President and a Professor with the Xi'an University of Posts and Telecommunications, Xi'an, Shaanxi, China. He is the author of five books and more than 200 articles. His research interests include fuzzy set theory, pattern recognition, and image processing. VOLUME 8, 2020