Robust Image Feature Extraction via Approximate Orthogonal Low-Rank Embedding

Feature extraction (FE) plays an important role in machine learning. In order to handle the “dimensionality disaster” problem, the usual approach is to transform the original sample into a low-dimensional target space, in which the FE task is performed. However, the data in reality will always be corrupted by various noises or interfered by outliers, making the task of feature extraction extremely challenging. Thence, we propose a novel image FE method via approximate orthogonal low-rank embedding (AOLRE), which adopts an orthogonal matrix to reserve the major energy of the samples, and the introduction of the $\ell _{2,1}$ -norm makes the features more compact, differentiated and interpretable. In addition, the weighted Schatten ${p}$ -norm is adopted in this model for fully exploring the effects of different ranks while approaching the original hypothesis of low rank. Meanwhile, as a robust measure, the correntropy is applied in AOLRE. This can effectively suppress the adverse influences of contaminated data and enhance the robustness of the algorithm. Finally, the introduction of the classification loss item allows our model to effectively fit the supervised scene. Five common datasets are used to evaluate the performance of AOLRE. The results show that the recognition accuracy and robustness of AOLRE are significantly better than those of several advanced FE algorithms, and the improvement rate ranges from 2% to 15%.


I. INTRODUCTION
As we all know, image processing technology has been applied to various areas of our lives, such as face recognition, image analysis, document clustering and medical diagnosis. By denoising, reducing noise, normalizing, enhancing, segmenting, extracting features, and registering the image, the image will be easier to observe and analyze, and the quality of the image can be significantly improved, which can facilitate the researcher to perform the image follow-up related operations. Feature extraction (FE), as the core link in image processing, plays a vital role in the final image processing result. Image feature extraction technology not only creates new scientific and technological achievements, but also greatly improves the accuracy and accuracy of digitization. The continuous optimization of data acquisition method makes the dimensionality of the collected samples rise sharply. While bringing more detailed feature informa-The associate editor coordinating the review of this manuscript and approving it for publication was Paolo Remagnino . tion, irrelevant and redundant features also follow, which greatly weakens the inherent structural relationship between features, and ultimately reduces the algorithm performance. Therefore, how to find high-quality low dimensional representation features from massive high dimensional data has become a attention focus of researchers [1], [2]. In recent years, some scholars have carried out a lot of research on the application of FE. For example, there are many feature extraction algorithms are used for their ability in terms of energy compaction property such as orthogonal polynomials [3]- [5]. This study focuses on the design of feature extraction algorithm, so the application of FE will not be further described here.
As a typical unsupervised FE method, the principal component analysis (PCA) [6] achieves the optimal data reconstruction by reducing the dimensionality of the data. Of course, another typical supervised FE method is linear discriminant analysis (LDA) [7], whose discriminative projection is obtained by setting different degrees of inter-class and intra-class dispersion. Unfortunately, both methods assume that the data obeys a Gaussian distribution. Therefor, extended versions of these two methods have been proposed one after another [8]- [12].
Over the years, with the continuous development of manifold learning theory, many local-based FE methods have been proposed. For example, locality preserving projections (LPP) first finds an embedding to retain local structure, then get a subspace that can better detect the basic manifold structure [13]. Some of its improved versions have also been proposed [14]- [16]. In addition, local neighborhood reconstruction information is also used for feature extraction, such as neighborhood preserving embedding [17] and its improved versions [18], [19]. In general, if the target image satisfies the manifold constraint, these methods can achieve good performance. The core task of these methods is to model the structure of the data by constructing a graph. However, the actual data is often contaminated and corroded due to the interference of the external environment when acquiring, which makes the distance between samples deviate from the true level. At this time, the samples' geometric connection reflected by the constructed graph has deviated from the real situation, so the effectiveness of the aforementioned methods may be extremely reduced. Therefore, improving the robustness of image FE is still a challenging task.
Lately, representation-based FE methods have received widespread attention [20]- [25], and the most typical ones are sparse representation (SR) and low-rank representation (LRR). As a general classification algorithm, SR classification (SRC) [20] utilizes the sparsity in the recognition problem to efficiently improve the effect of face-recognition.
The key to this method is whether there are enough features and whether the sparse representation can be accurately obtained. Unfortunately, in the face of corrupt data, SRC often fails to achieve the desired results. Xu et al. proposed a two-phase test sample SR method, which makes coarse-tofine classification decisions, and finally achieves good face recognition [22]. However, these methods are not expert in seizing the unique global information of samples. For this reason, many FE methods based on LRR have been proposed one after another [23], [24], [26]. These methods assume that samples of the same kind are gathered in the same subspace, so that LRR may easily capture the global features, thereby improving the robustness of FE. SR is expert in seizing the local information of the sample and LRR is expert in seizing the global information. Inspired by this, a method combining SR and LRR is proposed [27]. Although LRR has been widely used, it still has some limitations [28].
As mentioned earlier, traditional FE methods and representation-based FE methods have their own advantages and disadvantages. This makes many researchers aim at the fusion of the two types of methods to gain their respective advantages and overcome their respective disadvantages. Many advanced methods have been proposed one after another, which has promoted the continuous development of feature learning theory. For example, sparse PCA (SPCA) [29], robust PCA (RPCA) [30], latent LRR (LatLRR) [31], Collaboration-based method (CRC) [32], structured LRR for image classification (LRRC) [33], extended version of LatLRR (SALPL) [34]. Especially, in LatLRR and SALPL, the data is separated into two spaces from the perspective of rows and columns, and the data information is reconstructed by optimizing two low-rank matrices. As a recently proposed superior FE method, low-rank embedding (LRE) [35] can complete the latent embedding subspace search task while obtaining the optimal low-rank representation. However, this method still has some defects in data reconstruction and noise separation, and as an unsupervised method, its application range is also limited.
In addition, the aforementioned LRR-based methods deal with low-rank problems by using nuclear norm for convex relaxation of optimization targets, and it's almost impossible to get the optimal solution [36]. Hence, the Schatten p-norm (SN) is applied to enforce low-rank regularization [37], [38]. To better approach the optimal solution, Brbić and Kopriva [39] introduced the Schatten-0 (S 0 ) and 0 -regularized objective. However, these models treat all singular values the same, ignoring the difference in rank contribution. Recently, to treat each rank component differently, a weighted method is proposed to deal with low-rank problems (WNNM) [40]. On this basis, to make better use of the low-rank property, a weighted SN method (WSNM) is proposed [38], [41], [42].
Encouraged by LRE, to ameliorate the shortcomings of existing FE methods to get more excellent features, we then develop a novel FE algorithm via approximate orthogonal low-rank embedding (AOLRE). Summarized as follows: • In our AOLRE model, the data reconstruction term and the regular term are cleverly combined. Specifically, by designing an orthogonal matrix firstly to ensure that a good principal component is preserved in the embedding space. In addition, the · 2,1 is applied to constrain the embedded projection matrix F to guarantee that the differences of obtained features are significant. Third, the introduction of low-rank constraint · p w,Sp enables AOLRE to superior seize the unique internal global information of samples, so as to recover the damaged data as much as possible. Finally, as a robust measure, the correntropy is applied in this model to better handle non-Gaussian noise.
• Our model introduces a classification loss item. In the optimization process, it can masterly use the label information to continuously obtain the most favorable information to classify.
• Sufficient experimental results confirm that the effectiveness, robustness and reconstruction ability of this method are superior to some existing methods, especially in the case of noise and damaged observations, the advantages of our model are more obvious.
The remaining article content is divided into four Sections. Some related works are briefly reviewed in Section II. Section III presents our model and gives the optimization process. Sufficient experimental results and analysis are given in Section IV. Section V gives the conclusion of this work.

II. RELATED WORKS
Here, some related works are briefly reviewed, i.e., LRR-based FE, weighted Schatten p-norm and correntropy. Some notations used in our work are listed in Tab. 1.
As a recently proposed LRR-Based FE method, LatLRR can effectively handle the situation where the quantity of samples is insufficient or severely damaged. In LatLRR, the data information is reconstructed by optimizing two low-rank matrices L c ∈ R n×n and L r ∈ R m×m . For data X ∈ R m×n , LatLRR's mathematical expression is where E ∈ R m×n denotes the sparse error.
Obviously, there is no response between L c and L r in (1), so the obtained result may not be optimal, which will make its recognition performance unpredictable.
In LRE [35], the optimal LRR and the optimal subspace can be sought at the same time. In particular, first an orthogonal matrix F is introduced, and then L is obtained on the orthogonal subspace, and then the data is reconstructed. Its mathematical expression is, In fact, when data is converted from one space to another, there must be a loss of information. As shown in (2), only measuring the data reconstruction ability in embedding space is obviously not perfect. In other words, LRE neglects the loss of information during space conversion.

B. THE WEIGHTED SCHATTEN P-NORM
Given X ∈ R m×n , its weighted Schatten p-norm is, .., w min(m,n) ] > 0, and δ i is the i-th largest singular value. Usually we will use the following form, where both W and are diagonal matrices composed of w i and δ i , respectively.

C. CORRENTROPY
In [44], for two vectors a and b, their correntropy is, where E[·] denotes the expectation. Based on this, the Correntropy Induced Metric (CIM) [44] is, Thus, if we assume that c = a − b, the similarity between a and b can be measured by the correntropy in the form of In practice, we usually use the Maximum Correntropy Criterion (MCC) [44] to equivalent minimizing CIM. In [45], the performance of MCC is obviously better than Minimum Mean Square Error (MMSE), which confirms the robustness of MCC. In addition, we compared the effects of several loss functions, including absolute error ( 1 -norm), MSE ( 2 -norm) and CIM. Fig. 1 shows that CIM can handle large errors better. Hence, the correntropy will be used in the proposed model.

III. APPROXIMATE ORTHOGONAL LOW-RANK EMBEDDING
Next, we first give the derivation process of model AOLRE, and then develop an efficient optimization algorithm for it.

A. PROBLEM FORMULATION
As mentioned earlier, as an embedding-based method, LRE shows good performance in FE tasks, but it also has some shortcomings. In this regard, the goal of this work is to propose a FE method that can maintain the advantages of LRE while overcoming its shortcomings. As we all know, in the process of obtaining real data, it is always unavoidable to avoid noise pollution caused by equipment or the environment. This pushes us to introduce a noise matrix E ∈ R m×n in the model. By comparing (2), we can get F X = F XL + E. As a robust measure, the correntropy is applied into our model, i.e., the term i,j ϕ(E i,j ), where In addition, we use · p w,S p to obtain a low-rank representation. Therefore, model (2) can be improved to where F ∈ R m×d is an orthogonal transformation matrix (or projection matrix).
Since the purpose of introducing E is to model the noise in original space, but the constraint F X = F XL + E in (7) clearly shows that E is used to model the data noise in the target space. This deviates from the original intention of separating noise. To prevent this circumstance, we specially introduce an adscititious matrix P ∈ R m×d to perform data reconstruction, then F can learn features more flexibly. So, the principal components in the data are easier to obtain by the embedding representation (i.e. F X). Therefore, the target formula can be re-expressed as, Analyzing problem (8), we can see that the term F X = F XL in its constraint condition is exactly equivalent to the latent LRE. Then, mathematically, we can simplify (8) to The orthogonal constraint F F = I is introduced in the model to avoid trivial solutions. Through analysis, substituting P P = I for F F = I can have the same effect, and the obtained F and P are more accurate.
Furthermore, given any sample x i , its projected result The analysis shows that encouraging the sparsity of the rows of F can help to obtain a good embedding representation F X. Considering the characteristics of 2 -norm [46], we introduce the F 2,1 into the model, and the optimization problem is transformed into where α > 0 and β > 0 are balance parameters, and Furthermore, the known label information can obviously help extract more effective features. To this end, we expect to introduce a special discriminant function (X, F) into the model.
where γ > 0 is a parameter to balance term L. Generally, in the learning process, the stronger the resolving power of the projection matrix, the better the classification effect. Specifically, for n training samples X = [x 1 , x 2 , · · · , x n ] ∈ R m×n from c classes, we will learn its zero-one label matrix B = [b 1 , b 2 , · · · , b n ] ∈ R c×n . In the training phase of supervised FE, linear classifier is usually used to process sample features. In this paper, a linear classifier is adopted for F X. Consequently, using multivariate rigid regression, we can get (X, F) = 1 2 B−F X 2 F . Then, the final expression of AOLRE is,

B. SOLUTION FOR AOLRE
Considering the particularity of the model, we design an effective algorithm HQ-ADMM to improve the efficiency of solving our model by coalescing ADMM [43] and HQ [44].
In [47], the convergence of ADMM for non-convex problems is analysed. For convenience, we introduce an adjuvant J to get the following mathematical formula, Its augmented Lagrangian form is, where λ denotes the penalty parameter.

1) UPDATE E
It can be optimized by fixing (F, J, L, P) as, However, the correntropy measurement term ϕ(E i,j ) in (16) is difficult to optimize directly. According to [48], the HQ technique can be applied to optimize E. So, we introduce an adjuvant S to get the following formula: where ψ(·) denotes the dual function of ϕ(·).
Substituting (17) into (16) and transforming the optimization problem as, where ⊗ denotes the dot product. This problem can be obtained the optimal solution by alternate updating E and S. In detail, when E i,j is fixed, we can obtain, where σ 2 = 1 2m×n X − PF XL 2 F . Then, fixing S i,j = S * i,j , E i,j can be optimized by, where ./ denotes the division of element-by-element.

2) UPDATE F
It can be optimized by fixing (E, J, L, P) as, By setting ∂L(F) ∂F = 0 to get the analytical solution of (21): (22) where D = XL, C = X − E + Y 2 λ . A ∈ R m×m is a diagonal matrix and meets the following conditions: where F i,. is the i-th row of F.

3) UPDATE J
It can be optimized by fixing (E, F, L, P) as, where Q = L + Y 1 λ . This is a typical WSNM problem. In light of [41], the general solution of (23) can be calculated by Lemma 1. The detailed solution process is given in [41].

4) UPDATE L
It can be optimized by fixing (E, F, J, P) as, By setting ∂L(L) ∂L = 0, we have: Obviously, where G 3 = X −E+ Y 2 λ . Obviously, Eq. (28) is an orthogonal procrustes problem (OPP) [49]. So, we can get the solution directly: where U and V can be solved by U V = G 3 L X F. 6) UPDATE Y 1

AND Y 2
By fixing other parameters, they can be updated by, To clearly explain the optimization process, we assume that (E t , F t , J t , L t , P t ) and (E t+1 , F t+1 , J t+1 , L t+1 , P t+1 ) represent the results of the t-th iteration and (t + 1)-th iteration, respectively. Algorithm 1 detailed reveals main steps of AOLRE.
In our experiments, five general datasets are used to evaluate the performance of AOLRE, which are taken from different application scenarios, such as face images (i.e., Extended Yale B [50], AR [51], ORL [52]), digital images (i.e., USPS [53]), and object images (i.e., COIL-20 [54]). In particular, the Extended Yale B (YaleB) dataset is used for three types of performance verification, namely different illumination, random pixel corruption and block occlusion corruption. The detailed information of these datasets used in this work is described in Tab. 2. Similar to the strategy used by several comparison algorithms, all samples are separated two parts, which are used for feature extraction and feature extraction capability evaluation, respectively. The parameters in our model are set as α, γ = {10 −4 , 10 −3 , 10 −2 , 10 −1 , 1}, β = {10 −3 , 10 −2 , 10 −1 , 1, 10} and p = {1.0, 0.8, 0.5, 0.1, 0.05}. The parameters used in the comparison algorithm come from its original article. In particular, in order to increase comparability, the unsupervised LRE is extended to the supervised scenario by adding a classification loss item B−F X 2 F . For this reason, the dimension of F needs to be reduced to the same number of training classes, that is, c. In addition, to ensure the reliability of the results, each experiment is repeated 20 times, and the final results are listed in the form of average and standard deviation.

B. FACE RECOGNITION USING THE YaleB DATASET
This dataset collects 2414 samples of 38 categories with various illuminations, and each category contains approximately 64 photos. As the light intensity and angle change, all images are damaged to varying degrees. In this work, all samples are resized to 32 × 32 pixels. VOLUME 8, 2020

1) THE RECOGNITION PERFORMANCE UNDER DIFFERENT ILLUMINATIONS
As shown in Fig.2 (a), most images are damaged by shadows or noise, which makes the FE task very challenging. Here, the training set is composed of Tr = {10, 15, 20, 25} random samples of each category (''Tr'' represents the number of training samples), and all the other samples constitute a test set. The test results of all methods are revealed in Fig.3. Obviously, AOLRE gets the best performance under different number of training samples. Taking the Avg value as an example, the recognition rate of AOLRE is 2.23%, 3.78%, 6.70%, 4.71%, 5.02%, 3.35% and 2.22% higher than that of SRC, LRC, LatLRR, CRC, LRRC, LRE and SALPL, respectively.

2) THE RECOGNITION PERFORMANCE WITH BLOCK OCCLUSION
In this sub-experiment, for training set, we set Tr = 20, and all the other samples constitute test set. In particular, 5 of the 20 images have randomly added occlusion block corruption. The position of these occlusion blocks in each sample is random, and the ratio is from 5% to 30%. Some examples are revealed in Fig. 2 (b), and the test results are revealed in Tab. 3. Taking the Avg value as an example, the recognition rate of AOLRE is 7.78%, 7.97%, 14.24%, 8.22%, 10.97%, 3.38% and 3.63% higher than that of SRC, LRC, LatLRR, CRC, LRRC, LRE and SALPL, respectively.

3) THE RECOGNITION PERFORMANCE WITH PIXEL CORRUPTION
Similar to the previous sub-experiment, Tr = 20 for training set, and all the other samples constitute test set. In particular, the pixels in 5 out of 20 pictures are damaged. These pixels are randomly selected and replaced with uniformly distributed values (the values is [0, 255]), and the ratio is 10% to 40%. Some examples are revealed in Fig. 2 (c), and the recognition accuracies are revealed in Tab. 4. Obviously, AOLRE has achieved the highest recognition rate, and it is 4.87% higher than the second best method SALPL.

C. FACE RECOGNITION USING THE AR DATASET
The images in this dataset come from 70 male subjects and 56 female subjects, a total of more than 4000 facial images. The 26 samples of each category are gathered during two sessions, 13 samples are gathered each session. Of the 13 images, 6 were covered by sunglasses/scarves and 7 were uncovered. Fig. 4 shows some examples of the AR dataset.
In this work, we use a total of 2600 image samples of 50 female and 50 male subjects to form a sub-dataset, and all samples are resized to 60 × 44 pixels. The 1200 samples in the test set come from 6 samples randomly selected from each person in each session (including 1 sunglasses, 1 scarf, and 4 neutrals), and the rest are used for testing. In Fig.5, the test result of AOLRE is 4.01% higher than the next best model SALPL.

D. FACE RECOGNITION USING THE ORL DATASET
This dataset consists of 40 individual image samples, each of which provides 10 different images. When these pictures are collected, the angle and expression of the face are changed. In addition, the facial details of these subjects are also different. For example, some people wear glasses and some don't. Some examples are revealed in Fig.6.
Similar to the YaleB dataset, the pixels of all samples are reduced to 32 × 32. the training set is composed of Tr = 5 random samples of each category, and all the other samples constitute a test set. The recognition accuracies of all algorithms are shown in Fig.7. As we can see, the recognition accuracy of AOLRE is the best among all methods, 2.01% higher than the next best model SALPL.

E. DIGIT RECOGNITION USING THE USPS DATASET
Next, we evaluate the recognition level of model AOLRE by using the USPS [53] handwritten digit database for digit recognition. This dataset is composed of handwritten images of 10 numbers from '0' to '9', and each subject has 1100 samples. Some examples are revealed in Fig.8.
Here, we select 1000 samples from the original dataset to form a sub-dataset, with 100 pictures per category. For this sub-data set, there are 40 random samples of each category for training, and others for testing. As we can see from Fig.9, the recognition accuracy of AOLRE is the best among all methods, 2.54% higher than the second best method SALPL.

F. OBJECT RECOGNITION USING THE COIL-20 DATASET
This dataset is composed of 20 different categories, and each category has 72 black background images taken from different views. Here, there are 10 random samples of each category for training, and others for testing. Images of 10 categories in this dataset are displayed in Fig.10. The recognition accuracies are revealed in Fig.11. The analysis reveals that    AOLRE obtains the best result in all methods, which proves that model AOLRE can effectively extract the features of object images.

G. EXPERIMENTAL ANALYSIS
Based on the results of extensive experiments on five datasets, we can summarize the following points: 1. Compared with other advanced representation based feature learning methods, AOLRE achieves better results in face recognition, object recognition and so on. There are the VOLUME 8, 2020    following reasons: (1) In the process of dimensionality reduction, F 2,1 plays a key role in the retention of key features, and can effectively remove redundant information; (2) The introduction of data reconstruction enables AOLRE to retain  the main energy of the original data. (3) Classification loss term B − F X 2 F in AOLRE can clearly guide the FE and make the learned features more distinctive. These characteristics enable AOLRE to effectively complete the classification task. The specific experimental results are revealed in Fig.3, Fig.5, Fig.7, Fig.9 and Fig.11. 2. As the robustness verification experiments, it can be clearly seen from Tab.3 and Tab.4 that the recognition accuracies of LRE, SALPL, AOLRE are evidently exceed other comparison models, and the performance of AOLRE is the best. The main reasons are as follows: (1) LRR can mine the global information of data, which is very helpful to clear away the corrupt part. The introduction of L p w,Sp enables AOLRE get better low-rank performance. (2) The correntropy is applied in AOLRE, which is better for handling non-Gaussian noise. (3) As an extended version of LRE, AOLRE seeks important effective features by learning a robust embedding subspace. All in all, the above three aspects respectively focus on damaged blocks, outliers samples, and damaged characteristics. The integration of these advantages significantly improves the robustness of AOLRE. In the face of actual data with complex noise, our model can also be competent for FE tasks.

H. PARAMETERS SENSITIVITY AND CONVERGENCE OF AOLRE
Next, we analyze the parameter sensitivity and convergence of AOLRE on the YaleB dataset (textTr = 10). In our model, there are three balance parameters (α, β and γ ) and one  low-rank parameter p. The parameter settings are as described in section IV-A. For the balance parameters, we fix α first, and find a better combination by varying β and γ synchronously. Then replace α and repeat the above process until the optimal solution is obtained. Fig.12 and Fig.13 show the recognition result of AOLRE on the YaleB dataset when varying the parameters. Obviously, the value of the parameters has different influence on the recognition performance, and overall, they are relatively stable.
Furthermore, the convergence of AOLRE is verified on the YaleB dataset (textTr = 10). In Fig. 14, we plot the curves of primal residual (computed as max( L − J F , X − PF XL − E 2 F ) and the objective function value of Eq. (13). Obviously, our model converges in 20 iterations. Similar outcomes can also be obtained on other datasets. Due to space constraints, we don't show experimental results on other datasets.

I. COMPUTATIONAL PERFORMANCE
In addition, the computation time is also a key criterion for evaluating algorithms, so we test all algorithms on the YaleB dataset (textTr = 10), and the results are shown in Fig. 15. Obviously, compared with LatLRR, LRRC, and SALPL, AOLRE has a lower computational cost. For example, SRC, LRC and CRC, although their computational cost is lower than AOLRE, their recognition rate is low, especially for noisy data, the recognition accuracy of AOLRE is significantly higher than the other three algorithms.

V. CONCLUSION
In this work, a novel robust FE model (AOLRE) is designed, which can be thought of as an evolutionary version of LRE. Especially, this method uses an orthogonal matrix to retain the primary components of the samples. In addition, the introduction of L p w,Sp enables AOLRE get better low rank performance. Meanwhile, the correntropy is applied in AOLRE, which successfully improves the model's resistance to complex noise, thereby enhance the robustness of the algorithm. Finally, the introduction of the classification loss item allows our model to effectively fit the supervised scene. Adequate experimental results certify that AOLRE is an advanced feature extraction method and can be competent for feature extraction of data with complex noise.
Certainly, there are still rooms for further improvement in AOLRE: 1) The solution of the model (especially the relationship between the balance parameters) needs to be further improved and optimized to be suitable for processing large-scale data; 2) It is worth trying to integrate the ''kernel trick'' into the model to deal with the nonlinearity of data; 3) Similar to [55], it may be an interesting attempt to introduce structured auto-encoder into the model.