Robust and Efficient Linear Discriminant Analysis With L2,1-Norm for Feature Selection

Feature selection and feature transformation are the two main approaches to reduce dimensionality, and they are often presented separately. In this study, a novel robust and efficient feature selection method, called FS-VLDA-L21 (feature selection based on variant of linear discriminant analysis and <inline-formula> <tex-math notation="LaTeX">$L_{2,1}$ </tex-math></inline-formula>-norm), is proposed by combining a new variant of linear discriminant analysis and <inline-formula> <tex-math notation="LaTeX">$L_{2,1}$ </tex-math></inline-formula> sparsity regularization. Here, feature transformation and feature selection are integrated into a unified optimization objective. To obtain significant discriminative power between classes, all the data in the same class are expected to be regressed to a single vector, and the important task is to explore a transformation matrix such that the squared regression error is minimized. Therefore, we derive a new discriminant analysis from a novel view of least squares regression. In addition, we impose row sparsity on the transformation matrix through <inline-formula> <tex-math notation="LaTeX">$L_{2,1}$ </tex-math></inline-formula>-norm regularized term to achieve feature selection. Consequently, the most discriminative features are selected, simultaneously eliminating the redundant ones. To address the <inline-formula> <tex-math notation="LaTeX">$L_{2,1}$ </tex-math></inline-formula>-norm based optimization problem, we design a new efficient iterative re-weighted algorithm and prove its convergence. Extensive experimental results on four well-known datasets demonstrate the performance of our feature selection method.


I. INTRODUCTION
Machine learning has been widely applied in many fields of science, such as biology, economics, sociology, and engineering. The data in these domains are always characterized by high dimensions; for example, documents, images, videos, computer vision, gene expressions, and DNA copy numbers. The time and space complexity required to process these data is extremely high. Moreover, redundant features are not only useless, but can also severely reduce the effect of machine learning [1]- [5]. Therefore, dimensionality reduction is extremely important in the data preprocessing stage [6], [7]. Several methods exist to reduce dimensionality, such as the kernel method [8]- [10], subspace projection [8], The associate editor coordinating the review of this manuscript and approving it for publication was Tallha Akram . and artificial neural networks [11]. In this study, we focus on subspace projection.
Feature selection and feature transformation are two main approaches to reduce dimensionality. Feature selection is a process of selecting a subset of relevant features, and feature transformation methods transform the original features to a new feature subspace. The two methods achieve dimensionality reduction by different ways. It can be seen from the current literature that most of the literature generally focuses on one of the two approaches and few papers combine them [12]. Feature selection selects a subset of features that have significant discriminative power, and eliminates the noisy features. Thus, the selected features perform better that the original data in classification, clustering, and prediction tasks. Recently, numerous feature selection methods, such as MRMR [13], ReliefF [14], and LS [15], have been proposed. From the perspective of search strategy, there are three models of feature selection methods: filter, wrapper and embedded models. Filter models [16]- [18] are independent of classifiers. In these models, all the features are ranked according to a predefined criterion, and the highest rankings are then selected. In the wrapper models [19]- [22], the feature subset search algorithm is wrapped around the classification model, and the usefulness of the selected features are measured based on the classifier performance. Embedded models [23]- [25] are a trade-off between the previous two models. The procedure of searching for an optimal subset of features is embedded directly in the training process. In comparison with filter models, the wrapper models and embedded models are often characterized by good performance but high computational costs. In this study, we focus on the filter models.
Linear discriminant analysis (LDA) is one of the most popular supervised dimensionality reduction methods [26], [27]. Its main objective is to obtain an optimal projection matrix, such that the ratio of the between-class distance to the within-class distance is maximized. In the past years, several variants of LDA have been proposed to achieve dimensionality reduction. S. Niijima and S. Kuhara adopted the maximum margin criterion (MMC), which is a variant of LDA, to achieve feature selection. They proposed a recursive feature selection method using the discriminant vector of the MMC [28]. Z. Zhang et al. proposed a Tensor Locally Linear Discriminative Analysis (TLLDA) method for image presentation [29]. Z. Zhao et al. elaborated a pairwise criteria based optimized LDA technique by defining new marginal inter-and intra-class scatters and proposed a variant of LDA called robust linearly optimized discriminant analysis [30]. A. Sharma et al. proposed a feature selection method by improving the regularized LDA technique to select important genes, crucial for the human cancer classification problem [31]. F. Yang et al. proposed the LDA-based feature selection method, minority class emphasized linear discriminant analysis (MCE-LDA), which addressed problems, such as singularity, overfitting, and overwhelming [32]. Zhao et al. proposed the soft label based LDA to achieve dimensionality reduction and applied it to image recognition and retrieval [33], [34]. Then, they integrated the Laplacian regularized least square and semi-supervised discriminant analysis into a constrained manifold regularized least square framework, and proposed a new semi-supervised dimensionality reduction method to solve the problem that underlying discriminative information cannot be fully utilized [35]. Lu et al. combined the structurally incoherent learning and low-rank learning with NPP to form a unified model called discriminative LR-2DNPP that could enhance the discriminative ability for feature extraction [36]. Then they proposed a robust flexible preserving embedding method. In this method, the clean data is obtained by low-rank learning and used to learn the projection matrix [37]. In this study, we propose a novel efficient and robust feature selection method based on a new variant of LDA. In this method, the objective function is defined along the idea of least squares regression, such that a transformation matrix that can minimize the loss function is obtained.
Recently, sparsity regularization has been widely investigated to achieve feature selection. A well-known regularization method is the L 1 penalty [38]. Cai et al. proposed the multi-cluster feature selection (MCFS), which employed the L 1 -regularized regression model to select features [39]. Bradley and Mangasarian proposed the L 1 -SVM method to perform feature selection using the L 1 -norm regularization. The disadvantages of this method are that the number of selected features is upper bounded by the sample size and the highly correlated features are picked only one or few of them [40]. Wang et al. proposed a hybrid huberized support vector machine (HHSVM) method and applied it to gene selection. HHSVM combines the L 1 -norm and the L 2 -norm to form a more structured regularization. Thus, it performs automatic feature selection and encourages highly correlated features to be selected or eliminated together [41], [42]. Xu et al. proposed the L 1/2 penalty [43] and Huang et al. proposed the hybrid L 1/2+2 regularization (HLR) approach, which is a linear combination of the L 1/2 and L 2 penalties [44]. In this method, the L 1/2 penalty performs feature selection. Y. F. Ye et al. proposed robust L p -norm least squares support vector regression (L p -LSSVR) to achieve feature selection, which is robust against outliers [45]. Q. L. Ye et al. proposed a new discriminant method to achieve robustness by replacing the L 2 -norm distances in conventional LDA with L p -norm and L S -norm distances [46]. Zhang et al. proposed the use of L 2,p -norm regularization for feature selection, and presented the proximal gradient algorithm and rank-one update algorithm to solve the discrete selection problem [47]. Lu et al. proposed low-rank preserving projections (LRPP) for image classification. The L 21 norm is used as a sparse constraint on the noise matrix [48]. C. Hou et al. proposed an unsupervised feature selection framework in which the embedding learning and sparse regression are performed simultaneously to achieve feature selection [49].
Z. Lai et al. proposed a series of methods based on the L 2,1 -norm for linear dimensionality reduction. By replacing the L 2 -norm with the L 2,1 -norm to construct the objective function, these algorithms perform robust image feature extraction for classification [50]. Furthermore, they proposed a robust locally discriminant analysis via the capped norm. In this method, they constructed the robust between-class scatter matrix using the L 2,1 -norm instead of the L 2 -norm and imposed the L 2,1 -norm regularized term on the projection matrix to ensure joint sparsity [51]. Recently, a new generalized robust regression method for jointly sparse subspace learning was proposed. This method imposes the L 2,1 -norm penalty on both the loss function and the regularization term to guarantee the joint sparsity and robustness to outliers [52]. Several other studies have been conducted on the L 1 -norm, L 1/2 -norm, L 2 -norm, L p -norm (0 < p < 1), and so on [53]- [56]. VOLUME 8, 2020 Majority of the existing sparse dimensionality reduction methods always apply an L 1 -norm regularization on the transformation matrix [57], [58], enforcing sparsity on the individual elements of the transformation matrix, which does not necessarily achieve feature selection. To select lesser features, the transformation matrix needs to be forced to contain more zero rows. Therefore, in this study, we impose sparsity on the rows of the transformation matrix by adding an L 2,1 regularization term to achieve feature selection. It is difficult to optimize because the L 2,1 -norm is nonsmooth. Thus, an efficient iterative algorithm is proposed to solve this optimization problem. The theoretical analysis is conducted in detail and the convergence of the algorithm is proved. Extensive experiments on four realworld datasets demonstrates the effectiveness of the proposed method. The contributions of this study are summarized as follows: 1) We propose a novel efficient and robust feature selection method by combining a new variant of LDA and sparsity regularization. LDA and its variants are mostly feature transformations method. Recently sparsity regularization has been widely applied into feature selection studies. In this study, we integrate feature transformation and feature selection into a unified optimization objective to achieve feature selection.
2) We derive a new discriminant analysis for feature extraction from a novel view of least squares regression. To achieve significant discriminative power between the classes, all the data in the same class are expected to be regressed to a single vector, and the important task is to explore a transformation matrix, such that the squared regression error is minimized.
3) We impose row sparsity on the transformation matrix of the new variant of LDA through L 2,1 -norm regularization to achieve feature selection. So, the most discriminative features are selected, and the redundant ones are removed simultaneously. 4) To solve the L 2,1 -norm regularized optimization problem, we design an efficient iterative re-weighted algorithm. In addition, we perform the algorithm analysis and prove the convergence of the proposed algorithm.

II. LINEAR DISCRIMINANT ANALYSIS (LDA) REVIEW
LDA is a popular method of feature extraction, wherein the original high-dimensional data is transformed into low-dimensional data by the transformation matrix. The transformation process formula is Here, x ∈ R d is the original high-dimensional data. W ∈ R d×m is the transformation matrix (d > m). y ∈ R m is the low-dimensional data obtained after transformation. It is well known that the main idea of LDA is that points in the same class are as close as possible, and points in different classes are as far as possible. The within-class scatter matrix S w , between-class scatter matrix S b , and total-class scatter matrix S t are defined as follows: Let X = {x i ∈ R d |i = 1, · · · , n } ∈ R d×n be the given training dataset, where d is the dimensionality of the input samples and n is the number of samples. The samples are divided into c classes and each data x i corresponds to a class label k(1 <= k <= c). π k represents the dataset of class k and n k is the number of data points in class k. Notationx k is the average of the data points in class k andx is the average of all the data points. (·) T denotes the transpose of the matrix.
The traditional LDA solves this problem: where tr(·) is the trace of the matrix. The optimal projection matrix W can be obtained by computing the eigenvectors of S −1 W S b corresponding to the first m largest eigenvalues. In our previous research work, the discriminant analysis method for feature extraction was derived by least squares regression [59]. In this feature selection method, the squared loss function is defined as where , t c2 , · · · , t cn ] ∈ R c×n , and t k = W Tx k . · is the Euclidean norm, which is defined as M 2 = tr(M T M ). The goal is to minimize Eq. (6) by the linear transformation W. The associated optimization problem is To solve Eq. (7), a weighted matrix A W is defined as where I denotes the identity matrix. Thus, Eq. (2) can be rewritten as S W = X (I − A W )X T . Then, the optimization problem in (7) can be rewritten as To avoid trivial solutions, the project matrix needs to be constrained. In this study, we focus on the condition of the orthogonal constraint. Thus, the optimization problem in (9) becomes The optimal solution can be obtained by the Lagrangian function of the problem in (10). However, we propose another novel efficient and robust method to solve it.

III. FEATURE SELECTION BASED ON VARIANT OF LDA AND L2,1-NORM
In Eq. (10), S W is the within-class scatter matrix. By substituting Eq. (2) into Eq. (10), the optimization problem becomes where · 2 denotes the L 2 -norm defined as v 2 = ( n i=1 |v i | 2 ) 1/2 , and vector v ∈ R n . Furthermore, the optimization problem in (11) can be rewritten as Note that m k is also a variable that can be optimized. It can be easily identified that m k = (1 n k ) x i ∈π k x i is the optimal solution. Thus, Eq. (12) is equal to Eq. (11). It is known that the squared loss function is extremely sensitive to outliers. To improve the robustness, we use a nonsquared loss function in this study. Thus, the optimization problem in (12) becomes where Eq. (13) is not squared, and thus, the outliers have lesser importance than in Eq. (12). Then, we add the L 2,1 -norm regularization term with the parameter γ to achieve feature selection. The problem becomes the following optimization problem: Solving this optimization problem is not easy because it is non-smooth. In the next section, this problem is solved using a simple and efficient algorithm.

IV. EFFICIENT ALGORITHM A. ALGORITHM DESIGN
In this section, we propose an iterative re-weighted method to obtain solution W , such that Eq. (14) is solved. The algorithm is described in Algorithm 1, and the theoretical analysis of the algorithm is presented in the next section. In each iteration, we need to solve the following problem: where , and they are the weights as calculated in Algorithm 1. D is a diagonal matrix with the i-th diagonal element as d ii .
Taking the derivative of Eq. (15) w.r.t. m k and setting the derivative to zero, we obtain By substituting Eq. (16) into Eq. (15), the problem becomes The columns of the optimal solution W in Eq. (17) are the l eigenvectors of M , corresponding to the first minimum l eigenvalues.

B. ALGORITHM ANALYSIS
In this section, we prove that the objective function of Eq. (14) is non-increasing in Algorithm 1. First, we consider the following lemma: Lemma 1: For any nonzero vectors v, v t ∈ R c , the following inequality holds:

Until converges
Proof: Obviously, the following inequality holds: Theorem 1: Algorithm 1 monotonically decreases the objective of Eq. (14) in each iteration until the algorithm converges.
Proof: In the j-th iteration, denote the updated W and m k by W J +1 and m k(j+1) , respectively. We have Because W J +1 and m k(j+1) are the optimal solutions of Eq. (15), the following inequality holds: By substituting u and u t in Eq. (18) with W T j+1 (x i − m k(j+1) ) and W T (x i − m k(j) ), respectively, we arrive at Thus, the following inequality holds: Similarly, by substituting v and v t in Eq. (18) with w i j+1 and w i j , respectively, we arrive at Parameter γ > 0; thus, the following inequality holds: By summing Eqs. (21), (23), and (25) on both sides, we obtain i.e., Thus, Algorithm 1 monotonically decreases the objective of the problem in Eq. (14) in each iteration. Because the objective function has lower bounds, Algorithm 1 converges. Therefore, Algorithm 1 monotonically decreases the objective of the problem in Eq. (14) in each iteration until the algorithm converges.

V. EXPERIMENTAL METHOD
In this section, experiments are conducted to evaluate the performance of our proposed algorithm (denoted as FS-VLDA-L21). The comparison of our method with the previous methods, such as MRMR, ReliefF, and LS, is presented in the following section.

A. DATASET DESCRIPTION
In our experiments, four diverse public datasets, namely, ORL, YaleB, Umist, and Coil20, were used to test the performance of the different feature selection approaches.
The ORL face database included 40 distinct individuals and each individual had 10 different images. The images were captured at different times, with varied lighting, different facial expressions (open/closed eyes, smiling/not smiling), and different facial details (glasses/no glasses). The original size of each image was 112 × 92 pixels, with 256 grey-levels.
In our experiments, we resized each image to 28 × 23 pixels.
The YaleB database contained a total of 2,432 face images in 38 distinct subjects. Each subject had approximately 64 near frontal images under different illuminations. The images were cropped and resized to 32 × 28 pixels.
The Umist database had 575 total face images of 20 different people. The original size of each image was 112 × 92 pixels. In our experiments, they were cropped and resized to 28 × 23 pixels.
The Coil20 database was composed of 1,440 images of 20 different objects. The images of each object were taken 5 • apart as the object was rotated, and each object had 72 images. Each image was resized to 32 × 32 pixels.

B. EXPERIMENTAL PROCESS
In each public dataset, we constructed three training datasets consisting of 4, 6, or 8 samples from each class, respectively. VOLUME 8, 2020   These samples were randomly selected each time, and the remaining samples were used for testing correspondingly. The regularization parameter γ controls the tradeoff between      that can make the experimental result optimal is adopted. To improve the efficiency of the experiment, we preprocessed the data using the PCA method. The discriminative features were selected by the traditional MRMR, ReliefF, LS, and our FS-VLDA-L21 method, respectively. The 1-nearest neighbor classifier was used to perform the classification.   Each experiment was performed multiple times on different random samples and the average accuracy was calculated and recorded. When the feature selection method performed better, the classification accuracy was higher.

VI. EXPERIMENTAL RESULTS
In this section, the experimental results are shown in Figs.1-12 and Tables 1-12. The Comparison and analysis of the experimental results are presented in the following section.
Figs. 1-12 depict the classification accuracies computed by the 1-nearest neighbor classifier for the four public datasets and three different training samples per dataset using different feature selection algorithms. As illustrated in these figures, for the datasets ORL and Coil20, all the methods achieve higher classification accuracy with more features selected, and more often than not, the proposed method FS-VLDA-L21 performs better than the other approaches. VOLUME 8, 2020

VII. CONCLUSION
In this study, a novel supervised feature selection method, which combines a new variant of LDA and sparsity regularization was proposed. We derived a new discriminant analysis from a novel view of least squares regression. The key work was to explore a transformation matrix such that the squared regression error was minimized. We imposed row sparsity on the transformation matrix through L 2,1 -norm regularization to achieve feature selection. Therefore, feature transformation and feature selection were integrated into a unified optimization objective. Consequently, the most discriminative features were selected and the redundant ones were eliminated simultaneously. Furthermore, an efficient optimization algorithm was derived to solve the non-smooth objectives. We proved that the proposed algorithm monotonically decreased the objective until the algorithm converged. Extensive experiments were performed on four public datasets. Both theoretical analysis and empirical results demonstrated that our new feature selection method is robust, effective, and superior than the existing methods. LIBO  DATA ENGINEERING, ICCV, CVPR, ICML, AAAI, IJCAI, and NIPS. His current research interests include machine learning and its application fields, such as pattern recognition, data mining, computer vision, image processing, and information retrieval.
YANG LIU received the M.S. degree in computer science from the Institute of Optics and Electronics, Chinese Academy of Sciences, Chengdu, China, in 2012. She is currently an Associate Professor and a Master's Supervisor of the North China University of Water Resources and Electric Power. She has authored 30 technical articles in refereed journals and proceedings. Among these articles, 20 have been included by EI and SCI. Her current research interests include machine learning and its application fields, such as data mining, intelligent information process, and intelligent water conservancy.