Robust Kernel Principal Component Analysis With ℓ2,1-Regularized Loss Minimization

Principal component analysis (PCA) is a widely used unsupervised method for dimensionality reduction. The kernelized version is called kernel principal component analysis (KPCA), which can capture the nonlinear data structure. KPCA is derived from the Gram matrix, which is not robust when outliers exist in the data. This may yield the principal axis in the feature space deviated by outliers, leading to misinterpretation of the principal components. In this paper, we propose a robust method for KPCA with a reformulation in Euclidean space to construct a robust KPCA method, where an error measurement is introduced into the loss function, and $\ell _{2,1}$ -regularization is added to the loss function. The idea of $\ell _{2,1}$ -regularization of the proposed method is motivated by sparse PCA via variable projection. However, because orthogonality is not satisfied in the proposed method, orthonormal bases are obtained by using the Gram-Schmidt orthonormalization process. In the experiments, a toy example and real data are used for outlier detection to verify the method’s performance and effectiveness. In the toy example, the proposed method reduces the influence of outliers and detects more outliers than KPCA. For the real data, the proposed method improves detection in comparison to other existing methods.


I. INTRODUCTION
Principal component analysis (PCA) [1] is a linear feature extraction method for reducing dimensions from possibly high-dimensional data. A kernel extension of PCA for capturing the nonlinear feature is kernel principal component analysis (KPCA) [2], where data are mapped into the feature space through nonlinear mapping. Instead of replacing nonlinear mapping, the kernel trick is used to calculate the inner product. KPCA has been successfully applied in many applications, such as fault detection [3], novelty detection [4], outlier detection [5], and so on. It is known that PCA can be obtained from the maximum variance criterion or the mean squared error (MSE) criterion. However, no matter what derivation is used, PCA is sensitive to, or not robust against, outliers. For instance, outliers with large squared errors can dominate in MSE and lead to misleading interpretations. The same problem occurs in KPCA.
To solve this problem, many researchers have proposed robust KPCA to mitigate this weakness. The robust The associate editor coordinating the review of this manuscript and approving it for publication was Sungroh Yoon .
KPCA method can be divided into five categories. The first category for robustifying KPCA method is based on the robust estimator of the Gram matrix. The rationale for this method is that the zero mean is not a robust measure of the true center when outliers exist in the data. Therefore, this method has been proposed to approximate the robust version of the covariance matrix [6]. In [6], a robust principal component (PC) can be obtained from the eigendecomposition of the robustness covariance matrix. There are many robust estimators for estimating multivariate location and scatter in the presence of outliers from a statistical perspective, such as the M estimator [6], the minimum covariance determinant method [7], and so on. An overview of several robust estimation methods is presented by Rousseeuw and Hubert [8]. Debruyne and Verdonck [9] presented three robust KPCA algorithms: spherical KPCA, kernel projection pursuit (KPP), and kernel ROBPCA, which are the kernel versions of sphere PCA [10], projection pursuit [11], and robust PCA (ROBPCA) [12], respectively. Spherical KPCA calculates the spatial median in the feature space and computes the Gram matrix with the spatial median. KPP uses the Gram matrix obtained from spherical KPCA to obtain the first PC with the help of the Q estimator [13], and then the remaining PCs are found by implementing the projection pursuit based on the Q estimator. Kernel ROBPCA utilizes the Stahel-Donoho estimator to calculate the outlyingness for each sample and selects three-quarters of the samples with the lowest outlyingness to compute KPCA. However, this method is computationally expensive for a large training set.
The second category for robustifying KPCA is to consider use of a robust loss function. One example of this strategy is PCA-1 [14]. The 1 -norm is substituted into the 2 -norm maximum variance objective function of standard PCA. After the first PC is obtained, the subsequent PCs are derived with the greedy search method. Kwak proposed a robust method for KPCA with the nonlinear projection trick [15]. By using the eigendecomposition of the kernel matrix, the input data is mapped into a reduced dimensional kernel space with explicit maps, yielding data with new coordinates, and then the PCA-1 algorithm is performed on these data. Xiao et al. [16] proposed 1 -KPCA to robustify KPCA. This method can be seen as a kernel extension of PCA-1 , but the constraint is different from that of PCA-1 . The procedure for the 1 -KPCA algorithm includes the upper layer and the lower layer to obtain orthogonal loading vectors. The upper layer is to deflate the Gram matrix to preserve the orthogonal loading vectors, and the lower layer is to obtain the optimal coefficients of the loading vector. Kim and Klabjan [17] proposed a simple and fast algorithm for robustifying KPCA. The advantage of this method is that it avoids implementing eigendecomposition. Although these methods reduce the effect of outliers, one drawback is that the components are obtained sequentially rather than simultaneously. Other methods that include a robust loss function have been proposed [18]- [20]. Alzate and Suykens [18] used an epsilon-insensitive robust loss function and proposed two algorithms. However, the optimization process is complex, and only one component can be obtained at a time. Huang et al. [19] proposed an iterative reweighted algorithm for robustifying KPCA, and small weights are assigned to outliers in the iteration process. He et al. [20] presented robust KPCA based on a maximum correntropy criterion. However, the optimization procedure of these methods is equivalent to the optimization problem of weighted PCA, which leads to expensive computation.
The third category for robustifying KPCA can be considered outlier rejection. Lu et al. [21] introduced robust KPCA by iteratively eliminating outliers according to reconstruction errors from the training sets. However, this method fails to cope with the small sample size problem. Ding et al. [22] considered removing undesirable samples with the largest influence indexes. Based on the idea of the influence index, Duan et al. [23], [24] constructed a weight function by using the influence index for the Gram matrix and computing the weight iteratively. The outlier can be identified by a weight smaller than a threshold. Xu et al. [25] proposed a robust PCA algorithm and extended it to a kernel version. This approach includes a random step to remove a possible outlier at the price of the computational cost for each iteration.
Feng et al. [26] proposed a deterministic approach by decreasing the weights of all observations for each iteration to reduce the computation cost of the algorithm in [25]. However, those methods are computationally expensive due to the iterative calculation of the Gram matrix.
The fourth category for robustifying KPCA is based on robust PCA (RPCA) [27], which aims to recover a low-rank matrix and a sparse matrix from the corrupted matrix. Ma et al. [28], [29] extended this model in the feature space and proposed two approaches for optimizing the problem. The first method builds a kernel matrix for a low-rank matrix and a sparse matrix by using the distributive property over the matrix addition. The second approach is constructed by using the Gram matrix as a surrogate of the original low-rank matrix in the feature space and a 1 -norm sparse matrix in the input space. The limitation of these two models is that the former may lose the advantage with a small number of training sets, while the latter is sensitive to the initialization. Fan and Chow [30] proposed to robustify KPCA based on the nuclear norm of a low-rank matrix in the feature space and a 1 -norm sparse matrix in the input space. However, this method involves a derivative matrix, which raises the computational complexity because of the large size of the Gram matrix.
The fifth category uses the regularization technique. Nguyen and Torre [31] proposed a robust method for handling noise, missing data, and outliers with the help of the penalty term. However, in the case of noise and missing data, KPCA is calculated on a clean dataset. The method gives few details for how to deal with outliers. Another option for using the regularization technique is based on the membership values of the input data. The membership value indicates the contribution to the calculation of the covariance matrix. Pang et al. [32] decided the membership values by using the penalty term with discrete values restricted in {0, 1}. The idea of this method is based on a robust technique for PCA [33]. However, the limitation is that the membership value is decided in the context of the first PC. It is not sufficient to reflect the inherent structure learned from the data. To overcome this limitation, Heo et al. [34] considered more than one PC to decide the membership by using entropy regularization. Then, the fuzzy mean and the fuzzy covariance are computed with the membership values. The drawback of this algorithm is its sensitivity to the initial membership value. Tao et al. [35] proposed density-sensitive robust fuzzy KPCA to overcome this drawback. However, the effect of the kernel parameter is not discussed in their research. Mateos and Giannakis [36] considered a low-rank bilinear decomposition model with outlier-sparsity regularization to find outliers. The sparsity parameter must be adjusted according to the percentage of outliers. However, it is difficult to manually set to perfectly identify outliers in practical applications.
The underlying idea of regularization focuses on robust estimates of the Gram matrix or outlier sparsity to reduce the influence of outliers, but pays little attention to mitigating the contribution of some samples to the principal axis. VOLUME 8, 2020 Because the Gram matrix is not robust when outliers exist in the data, the principal axis in the feature space could be deviated by outliers, leading to misinterpretation of the PCs. In this paper, we propose a robust method for KPCA. The idea of the proposed method comes from sparse PCA via variable projection [37]. In detail, we show that KPCA can be reformulated in Euclidean space. Then, a robust loss function based on the idea of RPCA [27] is introduced, and 2,1 -norm regularization is added to the loss term. Different from the other regularization method, our idea is to restrain the effect of outliers expressed in the principal axis. The goal of the proposed method is analogous to 1 -norm methods of the second category, but based on 2,1 -regularized minimization. In addition, as orthogonality is not satisfied in the proposed method, we use the Gram-Schmidt orthonormalization process to obtain the orthonormal bases.
The rest of the paper is organized as follows. In Section II, we describe KPCA, an expression of KPCA in Euclidean space, and the drawback of KPCA. Then, in Section III, we provide the proposed algorithm, and in Section IV, we show the experimental results. Finally, in Section V, we present the conclusion.

II. KPCA AND ITS DRAWBACK
In this section, we give some basic notations first and then detail KPCA and its drawback.

A. NOTATIONS
For convenience, some notations are used in this paper. The lowercase bold letters denote vectors. The superscript T represents the transpose of a matrix or vector. The uppercase bold letters denote the matrices. The lowercase letters i, j, and k denote the index, and t represents the iteration. D denotes the dimensionality, N denotes the number of samples, and M stands for the number of indexes. The other uppercase letters represent function for convenience. x 2 denotes the 2 -norm of vector x. For matrix E ∈ R N ×M , E F represents the Frobenius norm. The 2,1 -norm of matrix E is defined → H as the kernel mapping maps the input data into the feature space, where H is a reproducing kernel Hilbert space (RKHS) [38]. Given the mapping φ, each x i is mapped into the feature space H: Next, we detail KPCA and its drawback.
. For simplicity, we suppose is a mean-centered matrix. The covariance matrix C in the feature space can be expressed as: The eigenvalue λ k and the eigenvector v k of C satisfy: Note that v k in the feature space can be written as a linear combination of φ( where a k = (a k1 , . . . , a kN ) T is the coefficient vector.
As the inner product in the RKHS is given by kernel function , by left-multiplying on both sides of (4), we obtain Solution a k can be obtained from the kth eigenvector of K corresponding to the kth largest eigenvalue. By using the eigendecomposition, K can be decomposed into where E = [e 1 , . . . , e N ] whose columns are orthonormal, and D is a diagonal matrix storing the eigenvalues λ 1 , . . . , λ N in descending order. Then, solution a k is given as where E M = [e 1 , . . . , e M ], and D M = diag(λ 1 , . . . , λ M ). The principal orthogonal axis v k can be expressed as v k = The projection of onto the subspace spanned by the first M principal axes is given by However, an alternative interpretation of KPCA can be obtained in the least square sense. Jenssen [39] pointed out that the KPCA procedure minimizes the Frobenius norm of where The underlying idea behind this derivation is based on the connection between KPCA and classical metric multidimensional scaling [40]. D 1 2 M E T M are known as the coordinates of in low-dimensional space [15]. Projection (7) can be rewritten which is the solution to minimization problem (8) (see [39] for details).

C. DRAWBACK OF KPCA
One of the main drawbacks of KPCA is its sensitivity to outliers. Note that the projection of onto v k is given by leading to misinterpretation of the PCs because the outliers distort the direction of v k . Coefficient vector a k constructs a relationship between principal axis v k and orthonormal basis e k from the formulation above. The row index of a k corresponding to the sample index coincides with the row index of e k . This implies that it is possible to reduce the influence of outliers by applying a robust method to e k . Therefore, we consider optimizing with respect to e k instead of a k . Because each column vector of E M comes from E, E M meets KE M = E M D M according to the eigendecomposition. Thus, we have Note that E T M E M = I M ×M . The equation above can be interpreted as a PCA problem in Euclidean space with dataset There are various methods for obtaining sparse representation based on objective function (9). However, objective function (9) is formulated in the least square sense, which is not robust to outliers. A method for robustifying the estimation of E M must be developed. In the next section, we harness a similar idea of RPCA [37] to illustrate the proposed method.

III. PROPOSED METHOD FOR ROBUST KPCA
Zou et al. [41] proved the minimization problem given as (10) is equivalent to the minimization of (9): where β > 0. The factor of 1/2 is included for convenience. Note that in the relationship, between E M and A in (6), Wang and Tanaka [42] considered the application of the 2,1 -norm on A to obtain sparse representation, which means the principal axis can be represented by the same data. However, the robust problem is not considered in their work.
Recently, Erichson et al. [37] presented a robust formulation for sparse PCA. The underlying idea is based on the idea of RPCA [27] in which a robust loss function is introduced, and 1 -norm regularization is added to the sparse model. Inspired by this method, we note that the norm k i − E M E T M k i 2 would be affected by outliers. For this reason, we propose the following objective function: The optimization problem above (11) can be interpreted as follows. The 2,1 -norm is enforced on S T to restrict the contribution from samples whose error is affected by outliers. After S is calculated, the expression K − KE M F T M − S 2 F can be seen as a robust criterion for dampening the contamination of the outliers. However, to guarantee each principal axis can be expressed by the same data, the 2,1 -norm is imposed on E M . The objective function (11) involves E M , F M , and S which need to be optimized. We can optimize them alternatively, and each optimization problem involves only one variable.

A. OPTIMIZING THE E M SUBPROBLEM
First, we use the Gauss-Seidel method [43], [44] to optimize the E M subproblem. Fixing the variable F M and S at the iteration t, and adding a proximal term containing E M for the purpose of guaranteeing convergence, we have the following problem for solving E M : Denote the first and second terms of (11) by Thus, problem (12) that uses the proximal algorithm can be addressed by Note that 1 2µ thus, (13) is equivalent to minimizing, The first two terms in (14) are constant values with respect to E M and can be omitted. Therefore, the E M subproblem is equivalent to the following problem: where , and e i and θ (e i ) are the ith row vectors of E M and Θ(E t M ), respectively. The solution of each e i is given by the vectorial soft-threshold operator [45], Thus, the solution of (15) can be written as E t+1 ; · · · ; e N ]. Through the vectorial soft-threshold operator, we may see that some rows of the optimal E M corresponding to (15) could be close to zero, which dampens the effect of some samples expressed in the principal axis.

B. OPTIMIZING THE F M SUBPROBLEM
Next, for fixed E t+1 M and S t , the F M subproblem minimizes the following problem: . It can be solved by the orthogonal Procrusters problem [46]. The solution of F t+1 M can be obtained from the singular value decomposition (SVD): and set F t+1 M = MN T .

C. OPTIMIZING THE S SUBPROBLEM
Finally, for fixed E t+1 M and F t+1 M , updating the S subproblem is optimized as follows: To optimize problem (18), we first optimize the following problem: , where λ 1 is the largest eigenvalue in KPCA. 3: while not converged or within the preset iteration do 4: Compute E t+1 M according to (15) 5: Compute the SVD of (17) to update F t+1 M 6: Update S T using (19) and obtain S t+1 from (20) 7: The solution of S T t+1 can be obtained by using a vectorial soft-threshold operator on each row of Z, and the optimal S t+1 is the transpose of S T t+1 , namely, Based on the analysis, the proposed robust KPCA method is summarized in Algorithm 1. The algorithm is terminated when either the relative error is smaller than a preset tolerance value or a maximum number of iterations is reached. The relative error at the t iteration is defined where J is objective function (11), and tol is the preset tolerance value. Inputs E M and F M are matrices obtained from KPCA, and S is the zero matrix in Step 1 of Algorithm 1.

IV. EXPERIMENTS AND DISCUSSION
In this section, a toy example and analysis with real data are performed to verify the effectiveness of the proposed method. The Gaussian radial basis function k(x, y) = exp(− x−y 2 2 2σ 2 ) is used as the kernel function in the experiment.

A. PARAMETERS IN KERNEL PRINCIPAL COMPONENT ANALYSIS
Two parameters are associated with KPCA: the Gaussian kernel parameter σ and the number of PCs, M . In the toy example, σ is given in Section IV-C. In the real dataset, parameter σ 2 is estimated by the average pair-wise squared Euclidean distance between the training samples, that is, The choice of parameter σ depends on the nature of the data, and different parameters might have different levels of effectiveness in outlier detection. We complied with the following recommendation [5], [30]: Parameter σ should be of the same order of magnitude as the pair-wise distance between the training samples. However, the pair-wise distance and its squared distance are not essentially different for σ and σ 2 . Therefore, for computational simplicity, we use the sum of the pair-wise squared distance to compute σ 2 , which can be calculated by computing the Gram matrix. The number of PCs, M , is determined by the cumulative percent variance (CPV) [47], where λ i is the eigenvalue of the Gram matrix sorted in descending order. The CPV measures the percentage of the variance accounted for by the first M PCs. With this criterion, we select the desired CPV for detecting outliers. We use the same kernel parameter σ and the same number of PCs chosen by KPCA in the proposed method.

B. OUTLIER SCORE AND THRESHOLD VALUE
To detect outliers, the outlier score [5], [48] for each sample is utilized. When KPCA is used, the outlier score can be calculated by The outlier score involves orthonormal basis e j and deviation λ j along the direction of e j . To compute the outlier score that corresponds to the proposed method, we need to compute orthonormal basisē j and deviationλ j along the direction of eachē j . First, orthogonality can be sacrificed in the proposed method. Second, if the PCs are correlated, the total variance could be too optimistic. For these reasons, we apply the Gram-Schmidt orthonormalization process to E t M to obtain a matrixĒ = [ē 1 ,ē 2 , . . . ,ē M ] with orthonormal columns. Then, orthonormal basisē j is the jth column vector ofĒ. Next, we useĒ to replace E M in (7); namely, the new embedding matrix is D where γ j = K Tē j 2 . The intuition behind this computation is that k i with scaled λ j − 1 2 will project onto a new orthonormal basis with a high deviation. Therefore, the outlier score corresponding to the proposed method is computed as follows: After the outlier score is calculated for each x i , outlier threshold value c is decided based on the percentile of the empirical distributions of these outlier scores.

C. EXPERIMENT ON THE TOY DATASET
In the toy example, artificial data are generated from the Gaussian distribution at three clusters (−0.5, −0.2), (0, 0.6), and (0.5, 0) with a standard deviation of 0.1. Each cluster has 100 samples, and a total of 300 samples are deemed the normal data. Twenty outliers are added to the normal data in which the horizontal coordinates are generated by uniform distribution over [−0.5, 0.5], and vertical coordinates are created by uniform distribution over [1,2]. A scatter plot of this dataset is shown in Fig. 1. Parameter σ 2 = 0.05 is used in KPCA. We calculate the CPV from the candidate set {50%, 60%, 70%, 80%, 90%}. The parameters used in the proposed method are κ = 0.1, α = 0.001,β = 0.1, and tol = 10 −9 , and the iteration number is set to 300. The distribution of the outlier scores corresponding to KPCA and the proposed method is shown in Fig. 2, and the first row to the fifth row correspond to 50%, 60%, 70%, 80%, and 90% of the CPV in order. In Fig. 2, the green line represents the threshold c, which is used to classify the normal, and the outlier. It is determined based on the distribution of the outlier score. In the case of KPCA, threshold value c is computed at the 6.1th percentile of the distribution of the outlier scores except for 80% and 90% of the CPV. For 80% and 90% of the CPV, threshold value c is computed at the 94th percentile of the distribution of the outlier scores. For the proposed method, threshold value c is computed at the 6.1th percentile of the outlier score distribution for the entire CPV candidate set.
KPCA can detect outliers correctly at 50% and 60% of the CPV. As the CPV increases, the detection performance worsens at 70% and 80%. When 90% of the CPV is used in KPCA, the outliers could be correctly classified with a few normal data wrongly predicted. However, the performance of the proposed method is superior to KPCA especially with the same number of PCs decided by 70%, 80%, and 90% of the CPV. The toy example shows that the distribution of outlier scores in KPCA depends on the number of PCs. Different numbers of PCs used in the computation of outlier scores lead to different distributions, and have an effect on outlier detection. Compared to KPCA, the experimental result reveals that the proposed method reduces the influence of outliers and is more robust than KPCA for outlier detection.

D. EXPERIMENTS ON REAL DATA
In this section, six real datasets from the Outlier Detection DataSets (ODDS) Library [49] are used to verify the performance of the proposed method. Information about the dataset is given in Table 1. For each dataset, the whole samples are VOLUME 8, 2020  used in the experiment and preprocessed with the zero mean and the unit variance.

1) EVALUATION METRICS
The classification result for outlier detection can be expressed in a confusion matrix as shown in Table 2. True positive means the number of outliers correctly classified, and true negative represents normal data correctly predicted. False negative refers to outliers wrongly classified as normal, and false positive means normal data incorrectly classified as outliers. To evaluate the classification performance, we use the following metrics: Recall is also referred to as the detection rate or the true positive rate. As outliers are regarded as rare instances in the dataset, the most reliable evaluations are based on a high detection rate and a low false positive rate. One may obtain fewer false negatives but at the expense of more false positives. To this end, we report the area under the precision-recall curve (AUC-PR) [50]. The AUC-PR is computed with the average precision [51]. For each dataset, we first compute the outlier score under different CPV situations. After the outlier score s(x i ) is calculated for each x i , the observation where c is determined according to the percentile of the outlier score distribution. We adjust c by changing the percentile within the outlier score range and record the AUC-PR. The maximum AUC-PR corresponding to the specified CPV is reported in the experiment results. The parameters used in the proposed method for each dataset are shown in Table 3. The value in parentheses represents the  iteration number needed to adjust in some context of the CPV. We explain this point in Section IV-E.

2) EXPERIMENTAL RESULTS
We evaluate the proposed method on the six datasets and compare it with several existing methods, 1 Fig. 3. The performance of KPCA decreases as the CPV increases, while the other methods give a better performance than KPCA. Fig. 4 shows the AUC-PR values for all methods at different CPVs for the Ionosphere dataset. The proposed method achieves satisfactory results at 90% of the CPV. Fig. 5, Fig. 6, and Fig. 7 show the results for the BreastW, Cardio, and Musk datasets, respectively. We see that the AUC-PR is high only when a few CPVs are used in the KPCA case. As the CPV increases, the poor performance becomes even more apparent. Note that the different numbers of PCs used in the outlier score yield different distributions. The most relevant principal axes could contain all the information about the distribution of the data, and the remaining axes are considered to be associated with outliers [52]. This is why KPCA gives a high AUC-PR with a few CPVs. However, for the Cardio dataset, the proposed method gives low detection at 40% and 50% of the CPV but obtains a satisfactory    performance gradually in comparison with KPCA. Although the proposed method is poor in the case of only a few PCs, a natural way to compensate this loss is to tune the parameter  to reduce the gap in comparison. This is because KPCA gives a better result. It is not forced to ensure that the same number of PCs used in the proposed method obtains a similar result with the same parameters. For the Musk dataset, the proposed method shows satisfactory detection results among the other methods. Note that there is a big gap at 80% of the CPV. This result reveals that the number of PCs plays an important role in KPCA for detecting outliers. Fig. 8 shows the detection results for the Mnist dataset. The proposed method achieves a satisfactory AUC-PR compared to the other methods. Table 4 shows a summary of the AUC-PR performance. The proposed method is feasible and efficient for detecting outliers, and outperforms the other existing methods for some datasets.

E. DISCUSSION
Note that the evaluation of the proposed method is derived from the outlier score extracted from the new embedding matrix D − 1 2 MĒ T K, and factor λ j − 1 2 does not cause any contribution to outlier scores(x i ) in (21). Therefore, we can interpret that the deviation of dataset {k i } N i=1 along the new orthonormal basisē j is γ 2 j =ē T j KK Tē j , in other words, the transformed embedding matrix K TĒ whose outlier score gives the same expression with the new embedding matrix. Therefore, we useĒ to replace E M in (7) to compute the outlier score. The new set of orthonormal bases is the columns ofĒ obtained from the 2,1 -regularized minimization and the Gram-Schmidt orthogonalization process in tandem. However, there is an issue that we still need to consider-the convergence of the proposed method and the parameters used in the proposed method. It is difficult to theoretically prove the convergence of the proposed algorithm. Instead, we follow the analysis of multilinear sparse PCA [53] to show the convergence property. The objective function values and corresponding iteration numbers are drawn for each dataset. Fig. 9 displays the behavior of the objective function value at each iteration. All plots are obtained at 70% of the CPV and under the parameters shown in Table 3. It can be observed from the figure that the algorithm monotonically decreases in (11), and the curve of each dataset becomes flat when either the relative error is smaller than a preset tolerance value or the maximum number of iterations is reached.
Parameters κ,α,β, and tol and the number of iterations in Table 3 are set empirically. Because the number of PCs in KPCA depends on the CPV, a low number of PCs can cause every row of E t M to be shrunk to zero during the iteration in the proposed method. In this case, E t M is the zero matrix, and thus, the Gram-Schmidt orthonormalization process is invalid. To this end, we need to reduce the iteration numbers or change parameters κ,α, andβ, respectively. In contrast, when more PCs are used in the algorithm, if we set the loops to low numbers of iterations, the algorithm does nothing but minimizes the objective function in two consecutive iterations, and no outliers or samples at all would be dampened from the l 2,1 -regularization. In both cases, we experimentally vary the iteration number or parameters κ,α, andβ to monitor the performance. It is hard to reconcile with the same iteration number for all CPV cases. As the proposed method is similar to the sparse PCA model [41], it appears to be sensitive to the choice of the number of PCs and all the sparsity parameters [54]. Therefore, for the Musk dataset, the numbers of iterations are 110 and 300 at 80% and 90% of the CPV, respectively. For the Mnist dataset, the iteration number is set to 600 at 90% of the CPV.

V. CONCLUSION
In this paper, we propose a new robust method for KPCA from the perspective of 2,1 -regularized loss minimization and apply this approach to detect outliers. Based on the fact that the coefficient vector expressed in the principal axis builds a bridge between the orthonormal eigenvector of the Gram matrix and the principal axis, we consider optimizing with respect to the orthonormal eigenvector instead of the coefficient with 2,1 -norm regularization in Euclidean space. Note that orthogonality is not satisfied in the proposed method. The Gram-Schmidt orthonormalization process is applied to serve the purpose of the computation of the outlier score. In the experiment, the toy example and real data are used to verify the performance. In the toy example, the proposed method gives more robust results than KPCA. For the real dataset, the results indicate that the proposed method is better than KPCA with the help of 2,1 -norm regularization, yielding a more satisfactory robust performance than KPCA and other existing methods. In future work, we will investigate how the parameters influence the detection performance and aim to apply the proposed method in other fields, such as fault detection and the brain-computer interface. Furthermore, we would like to investigate the relationship between KPCA and the graph-based method [55].