Nonlinear Kernel Dictionary Learning Algorithm Based on Analysis Sparse Model

In the past decades, relevant sparse representation models and their corresponding dictionary learning algorithms have been explored extensively as they could be applied in various fields. However, most of them are focusing the linear model and nonlinear one is still less touched, although there are plenty of nonlinear scenarios in real applications. To further address this kind of the unmet challenge, in this work we mainly focus the following two works (i) propose a kernel transformation based method directly transforming the nonlinear analysis problem into a linear one, which is exactly the standard sparse analysis form but implies all nonlinear information of the original problem; (ii) present a nonlinear dictionary learning algorithm by leveraging the kernel trick and the KSVD-like manner, which has its root in analysis sparse model rather than synthesis model. Then, the proposed methods are employed to address the classification problem. Benchmark experimental results on three well-known datasets show that the proposed algorithm in (ii) outperforms some related linear algorithms and other existing nonlinear dictionary learning algorithms. Moreover, when the data is interfered by noise or some pixels are missing in the data, the algorithm is also effective, which proves its theoretical advantages owing to analysis sparse model’s merits of the equality of all atoms and the much smaller dimensionality for signal representation to some extent. And the classification accuracy of the proposed method in (i) is slightly lower than that of (ii), but better than that of other state of art methods.


I. INTRODUCTION
Sparse and redundant signal representation is a research hotspot in recent years. Many scholars have developed strong interests in many domains [1], [2]. Among them, signal models are fundamental for handling various processing tasks, such as denoising [3], solving inverse problems [4], compression [5], interpolation [6], sampling [7], and more. In general, there are two sparse representation models, namely the synthesis sparse model and the analysis sparse model. In the synthesis sparse model, a signal y ∈ R n is modeled as being the outcome of the multiplication y = Dα, where D ∈ R n×K is a dictionary -its columns are signal prototypes (atoms) that are used to build the signal [8]. As for the analysis sparse model, it relies on a linear operator (a matrix) ∈ R p×n , which we will refer to as the analysis dictionary, and whose rows constitute analysis atoms [8]. Note that there are both connections and differences between these two models. While the analysis model is referred to as the dual form of the synthesis model when dealing with a square (and invertible) matrix , it is in fact very different when dealing with a The associate editor coordinating the review of this manuscript and approving it for publication was Aysegul Ucar . redundant dictionary p > n [8]. And their key difference is that they pursue the signal's sparsity in different fields. Specifically, the synthesis sparse model seeks for the sparsity in the space spanned by the dictionary, while the analysis sparse model pursues the sparsity in the range of the sparsity transform.
Furthermore, a fundamental ingredient in the definition of both signal models and its deployment to applications is the dictionary D or . Researches have already demonstrated that dictionary learning model usually has a better signal reconstruction and classification performance than directly utilizes original training samples [9], [10]. And more and more dictionary learning algorithms based on the above two models have been proposed for classification tasks, such as face recognition [11], image classification [12], object recognition [13], ear recognition [14].

A. DICTIONARY LEARNING ALGORITHM FOR SYNTHESIS SPARSE MODEL
In this part, we will describe in detail one of the most popular models over the past decades -the synthesis sparse model min α,D y − Dα 2 + α 0 (1) where α ∈ R K is the sparse representation vector, α 0 is the l 0 sparsity measure that counts the number of nonzero elements in the vector α, and y − Dα 2 is the mean squared error resulting from sparse approximation. However, we all know that the l 0 norm problem is NP-hard, which usually needs to be transformed into other problems, such as l 1 norm problem, to approximate its solution. Then the optimization problem (1) can be transformed into the following problem min α,D y − Dα 2 + α 1 (2) It is worth noting that the selection of the dictionary D in model (1) and (2) is crucial during the processing of synthesis sparse model. There are mainly two approaches for its selection, including a predefined dictionary and the dictionary learning method. Indeed, it has been observed that learning a dictionary directly from the training data rather than using a predetermined dictionary (e.g. wavelet) usually leads to a more compact representation and hence can provide better results in many image processing applications, such as recovery [1] and classification [15], [16]. Therefore, over the past decades, researchers have developed many effective dictionary learning methods both for the original problem (1) and its relaxation problem (2), such as KSVD algorithm [14], online dictionary learning algorithm [17], sparse representation-based classifier (SRC) algorithm [18], and so on. Among them, KSVD is the most representative algorithm, which performs singular value decomposition (SVD) on K submatrices and is the generalization of the K-means clustering process. It is noting that all the above methods are aimed at linear signals processing, but in practice the linear problem is too simple and idealized. Therefore, the nonlinear signal processing problem has become a hot research topic recently. Generally speaking, its basic optimization problem is as follows where is a nonlinear mapping. In recent years, many methods have been proposed to deal with the nonlinear signal problem (3), such as the manual nonlinear transformation method [19], the neural network algorithm [20], the kernel method [21]- [24], and so on. Among all kinds of related methods, the kernel method is an important one because it can capture the non-linearity in the data very well, for example, Hien et al. proposed a new method based on kernelizing MOD and KSVD algorithms, which not only found the sparse representation in the feature, but also learned the dictionary [22]. Moreover, just like model (2), model (3) can also be relaxed in the following min α,D (y) − Dα 2 + α 1 (4) There are also many kernel-based methods to solve it [25], [26]. After so many years of extensive study of the synthesis sparse model, it has become a mature and stable field, with clear theoretical foundations, and appealing applications. Coexisting with the synthesis sparse model, there is the analysis sparse model [27], which, despite its similarity to the synthesis alternative, is markedly different. For example, in the analysis sparse model, all atoms take an equal part in describing the signal, thus minimizing the dependence on each individual one, and stabilizing the recovery process. Certainly, Elad et al. pointed out, although these two models are not exactly equivalent, it's hard to say which model is better, since the value of the model depends heavily on the actual problems [27]. Therefore there is a growing interest in the analysis sparse model. While we gain more understanding and insight to its interesting viewpoint, many works have already concerned on this model [28], [29].

B. DICTIONARY LEARNING ALGORITHM FOR ANALYSIS SPARSE MODEL
The basic optimization problem to solve the analysis sparse model is as follows min x, Here x ∈ R n is the reconstructed signal. The objective is to find a suitable operator so that the analysis coefficients operator x is sparse. Note that when both the synthesis sparse model and the analysis sparse model depart, the analysis model becomes more interesting and powerful. This case of analysis dictionary training is a challenging problem [8]. In the past few years, model (5), as one kind of linear case, has been mainly studied [8], [30], [31]. For example, the analysis KSVD algorithm [8] aims to learn an analysis dictionary, which is parallel to the synthesis KSVD in its rationale and structure-analysis KSVD algorithm.
Just like the above synthesis sparse model, the optimization problem (5) can be also relaxed to the following equation min x, Many scholars have also given some results on this kind of relaxation problems [32]- [34]. However, most of the above-mentioned works about the analysis sparse model are all focusing on the dictionary learning method for the linear case. Compared with the linear case, there is less work to discuss nonlinear analysis-based dictionary learning methods, although we all know that nonlinear case is more practical in real problems. Therefore, in this work, we creatively combine the nonlinear kernel approach with the analysis sparse model, and propose two effective methods. The first one is about the kernel transformation method, which transforms the nonlinear analysis model into the linear form, and then is solved directly by linear dictionary learning algorithms. The second focuses on the nonlinear dictionary learning algorithm, which can effectively obtain the analysis dictionary from given datasets by utilizing the sparsity of data in high dimensional feature. Moreover, the validity of both methods are demonstrated in a series of experiments on some datasets.
The manuscript is organized as follows. In Section II, we will introduce the kernel transformation method. VOLUME 8, 2020 Section III presents the two-stage method which comprises an analysis pursuit step using kernel backward-greedy algorithm, and a dictionary update step based on analysis kernel KSVD. Section IV gives many experimental results to illustrate the effectiveness of our approaches. The final Section concludes this work.

II. KERNEL SPARSE REPRESENTATION TRANSFORMATION METHOD
In this section, we will propose the first work -a kernel transformation method (KTM) that can directly transform a nonlinear problem based on an analysis model into a linear one, but which implies all nonlinear information of the original problem.
Firstly, we will introduce a nonlinear mapping (·), which maps the data from the original space R n into a high-dimensional Hilbert feature spaces H hereñ is generally much larger than n, and could even be infinite. Therefore, the nonlinear-analysis-based sparse representation in feature space H can be represented as follows where (y) ∈ Rñ is the given signal, (x) ∈ Rñ is the reconstructed signal corresponding to sample (y), ( ) T ∈ R K ×ñ is the sought analysis dictionary. Here we define the analysis dictionary as ( ) T instead of the usual ( ) for the convenience of the following calculations.
Next, we will use the nuclear technique, so we will briefly introduce the nuclear function firstly. The nuclear function κ(x, y) satisfies the Mercer's condition [35]: For all data , the function gives rise to a positive semidefinite matrix K ij = κ(x i , y j ). Then the nuclear function in the Hilbert feature space H can be expressed as κ(x, y) = (x) T (y). Nuclear functions are often used to implicitly specify the mapping . It helps to avoid the huge computations required to map of data into the high-dimensional feature space. Generally speaking, some commonly used kernels include Gaussian kernels κ(x, y) = exp(− x−y 2 c ) and polynomial kernels κ(x, y) = ((x, y) + c) d , where c and d are the parameters.
Using the above nuclear technique, given the SVD of From the above derivation, (8) can then be transformed into the following standard linear optimization problem miñ As a consequence, we can use the linear dictionary learning algorithm to solve the original problem, such as analysis KSVD.

III. KERNEL DICTIONARY LEARNING ALGORITHM
In this section, we are no longer limited to just giving a transformation method, but give a complete nonlinear dictionary learning algorithm. We will first define and formulate the kernel dictionary learning problem. And then we present the kernel backward-greedy algorithm for analysis pursuit in the feature space. Finally, we will propose the analysis kernel KSVD algorithm for learning the analysis dictionary in detail.

A. PROBLEM FORMULATION
The definitions of the nonlinear mapping and kernel functions were introduced in the previous section and will not be repeated here. Therefore, we directly give the sparse representation based on nonlinear analysis model in space H: min where (Y ) = [ (y 1 ), . . . , (y N )], X ∈ Rñ ×N is a matrix whose i-th column is the vector x i corresponding to the sample (y i ), ∈ R K ×ñ is the sought analysis dictionary. Since the dimension of feature space H would be infinite, traditional optimization methods (such as MOD or KSVD) are not feasible for this kind of problem. In order to solve it, we will set X = DZ firstly, here D ∈ Rñ ×K and Z ∈ R K ×N is the coefficient matrix that we want to solve for. Then the following proposition will contribute the reformulation of (11).
Proposition 1: There exists an optimal solution D * to (11) that has the following form for some A ∈ R N ×K . Proof . Using standard orthogonal decomposition, we can display D * as follows The second term can be expressed as The inequality is because D T ⊥ D ⊥ is a positive semidefinite matrix. Therefore, in order for the cost function to be optimal, the second term can only be 0. In short, D ⊥ = 0 and D * = D = (Y )A is an optimal solution.
As a result, we can substitute for X = (Y )AZ, then the problem (11) can be rewritten as follows min Z , Here, we define A as a sparse matrix [36]. Note that the optimal problem (16) will be minimized through the use of Mercer kernels. To introduce kernel function more smoothly, we carry out some algebraic manipulations on the objective function based on the original, and the cost function can be rewritten as Denote Note that in space H, employing the kernel trick avoids mapping data to high-dimensional feature spaces that require a lot of computation. Therefore, it is more efficient to optimize the problem (16) since it only involves kernel matrices of finite-dimensional, avoiding dealing with infinite dimensional matrices that might exist in the original formula (11). Specifically, we convert the problem of the analysis dictionary to its alternate . In the following, we will focus on the learning of instead of .

B. KERNEL BACKWARD-GREEDY ALGORITHM (KBG Algorithm)
In this stage, a task called analysis sparse coding or analysis pursuit will be considered, i.e., seeking for the coefficient matrix Z while fixing . The formula (16) can be rewritten as so it can be solved by the following N different problems Next we will incorporate the kernel trick into the backward-greedy algorithm to solve the above problem (20). Note that in the following we will omit the subscript i for simplification.
Since in the analysis sparse model, we mainly focus on the number of zero of z, which is defined as co-sparsity and meanwhile define the co-support of z as the set of = | | rows that are orthogonal to it. In other words, z = 0, where is a sub-matrix of that only contains the rows indexed in . For a given dictionary , we define the co-rank of z as K − r and co-support as the rank of , and we assume that r (r K ) represents the dimension of the subspace that z belongs to. Therefore, the problem that we need to solve is Here, represents a pre-determined minimum threshold. Note that if there is a correct correspondence between r and , both of problems (22) and (23) are equivalent, and the choice between them depends on the available information about the process of generating (y). Here, we refer to both of them as analysis sparse-coding or analysis pursuit problems.
In an oracle setup, the true co-support is known, and z can be expressed initially as thus which can be used to obtain ẑ is a projection matrix onto a r-dimensional space. The proposed KBG algorithm parallels to the synthesis greedy pursuit algorithms [37], however, in which the kernel trick is introduced for solving the infinite dimensional problems that synthesis one has no ability to deal with. The pseudo-code for KBG algorithm is given in the following. In practice, the proposed algorithm can be implemented efficiently by accumulating an orthogonalized set of the cosupport rows, which means that oncek j has been found and the row θ T k j is about to join the co-support.

C. THE ANALYSIS KERNEL KSVD ALGORITHM
In this part, we will give the main algorithm -the analysis kernel KSVD algorithm, which include the updates of coefficient matrix Z and the analysis dictionary . Since we have introduced KBG algorithm for updating Z, now we will turn to the update of in the following.
Firstly, we will denote some symbols beforehand. Suppose Z is known in advance. we will first learn θ p for p row in , whose update should be only affected by the Z columns Algorithm 1 KBG Algorithm Input: Signal y, a set of signals Y , analysis dictionary , coefficient matrix A, target co-rank K −r and kernel function κ.
which are orthogonal to it. Let P denote the indices of the columns of Z orthogonal to θ p . Therefore, Z P can represent the sub-matrix of Z, which contains columns orthogonal to θ p . Meanwhile, θ i is used to denote the sub-matrix of containing the rows from that z i is currently orthogonal to, excluding the row of θ p . The update step of θ p can be given as follows Here the normalization constraint on the rows of θ p aims to avoid degeneracy, but has no any practical influence on the results. Since Z is known in advance, we will approximate the above optimization problem aŝ Its solution is the singular vector corresponding to the smallest singular value of θ p , which can be efficiently computed based on the SVD of θ p or using some inverse methods. We all know that one advantage of this specific approximation method is that it disjoints the updates of the rows in , enabling all rows to be updated in parallel. Another desirable feature of the resulting algorithm is that it adopts a structure similar to the KSVD algorithm, in which the minimum eigenvalue is calculated instead of the maximum eigenvalue. The proposed analysis kernel KSVD is described in the following.

IV. EXPERIMENTS
In this section, we will demonstrate the classification performances of the two algorithms we proposed above on some Use KBG Algorithm to obtain the coefficient matrix Z J , and at the same time fixed (J −1) . common datasets, and the reconstruction error will be taken as our classification index.

Stage 2: Dictionary Update
• KTM+ Analysis KSVD: We first use KTM to preprocess training data and testing data, and then the existing analysis KSVD algorithm is applied for dictionary learning. In this method, the reconstruction error is r s = ỹ − z s 2 2 ∀s = 1, . . . , m • Analysis Kernel KSVD: The algorithm adopts a two-stage iterative method, including two stages of analysis pursuit and dictionary learning. And the reconstruction error of this algorithm is shown in the following equation where m is the number of classes. Note that the smallest reconstruction error determines which class the data belongs to. And we will present a set of experimental results based on the proposed methods, which can verify the core competence of the proposed methods and reveal the potential of the analysis sparse model in combination into the dictionary learning method. Specifically, we will firstly show the classification results of USPS digital dataset, and give the classification results under the condition of data absence and noise. And then we will present classification results on the MNIST dataset and Extended YaleB (E-YaleB) dataset to validate the efficiency of the proposed methods.

A. USPS DATASET
The USPS dataset [38] contains 10 classes of 256-dimensional handwritten digits, i.e. n = 256. For each class, we randomly select N train = 500 samples for training and N test = 200 samples for testing. Specifically, we need to choose the appropriate parameters for different algorithms to learn the dictionaries of all classes. The co-sparsity r of the analysis kernel KSVD algorithm is 5, and for KTM, r = 7. The selection of other parameters is the same, that is, each class dictionary is learned with the dictionary atoms K = 200, the maximum number of training iterations is set to 20, and the degree of the kernel of the polynomial is 2. The construction error is calculated by (29) and (30) respectively. In order to visualize the learned kernel dictionary atoms, we find images that lie on the input space and best approximate these atoms in terms of the mean square errors. Fig.1 shows the pre-images of the analysis kernel KSVD dictionary atoms for 10 digits. Next, we start to verify the classification performances of the proposed methods. In the first set of experiments, we give the results for the situation where the test samples are randomly removed with different percentages of image pixels (replaced by zero values), as shown in Fig.2(a). Similarly, Fig.2(b) shows the results of random gaussian noise with different standard deviations to the test samples. In these two simulations, the classification accuracy obtained by the analysis kernel KSVD algorithm proposed in this work is always better than other linear algorithms, moreover the gap between our method and other linear algorithms grows with the increase of distortion degree. We can also see that the trend of our results is similar to that of kernel KSVD and the classification accuracy is even slightly better than kernel KSVD. This shows that the analysis kernel KSVD algorithm has good robustness. For KTM, although the calculated results are much better than the traditional KSVD algorithm in the case of missing pixels, there is still a certain gap compared with other nonlinear kernel methods. However, in the case of noise interference, the classification results are poor. This shows that the robustness of this method is general and there are some limitations in practical application.
Note that in these experiments, some important parameters, such as the suitable polynomial degree, the optimal dictionary atom number and the co-sparsity, should be selected properly. Therefore, in the second set of experiments we will further investigate the influence of parameter selections on the overall recognition performance, thus indicating the reasons why choosing the above parameter values. Fig.3 shows the classification accuracy of KTM and the analysis kernel KSVD algorithm without interference while we vary the polynomial degree. Since the classification accuracy of KTM is poor at degree = 1, which is only 21.5%, this result is not considered in Fig.3 and is discussed from degree = 2. From it, we can see that, without external influence, the classification accuracies of the USPS dataset change with the increasing of the polynomial degree. When the degree is 2, the classification accuracies reach the highest value, and then decrease and fluctuate with the increase of the polynomial degree.
Next, we further study the effect of polynomial degree on classification accuracies under external interference. Due to the poor robustness of KTM, the classification accuracies under the condition of interference are not satisfactory, we only use the analysis kernel KSVD algorithm to do  these experiments. Fig.4 shows the specific results of the experiments. The resulting trend of the broken line is basically the same as that in Fig.3. Therefore, we will fix the polynomial degree at 2 on the USPS dataset.
After determining the polynomial degree, we do the same experiments for the dictionary size and co-sparsity selections. Fig.5 shows the performance of the analysis kernel KSVD algorithms when the dictionary size changes within the range of 50 to 400. The sample dimension of the USPS dataset n = 256. If the number of dictionary atom K > n, then the dictionary is not redundant, which will make the classification accuracy worse. So for KMT, we limit the number of atoms in the dictionary to 50-250. Fig.6 shows the specific results of the influence of the dictionary size on the classification accuracies of the analysis kernel KSVD algorithm under interference.
From it, we can see that, when the number of atoms in the dictionary reaches 200, the classification accuracy of KTM reaches the highest level, so we chose the optimal number of atoms in the dictionary at 200. For the analysis kernel KSVD algorithm, when the number of the dictionary atoms reaches 200, the classification accuracy tends to be relatively stable. And then it starts to drop a little bit after the number of atoms is greater than 300. Since the increase of the number of dictionary atoms will lead to the increased computational burden of the algorithm, we select the optimal number of dictionary atoms 200 while considering the balance between the classification accuracy and the operation time. From Fig.5 we can also see that when the dictionary size is 200, the best error rate obtained by the analysis kernel KSVD algorithm is 2.2%, and the error rate of KTM is only 3.2%, which devote to the beautiful results.
In addition, in Fig.7, we compare the performance of the two algorithms proposed in this paper when the co-sparsity varies from 3 to 15. For analysis kernel KSVD, the change of the co-sparsity has no significant influence on classification accuracy. When the co-sparsity is 5, the classification accuracy reaches its maximum value. After that, although there is a small fluctuation, the overall trend is decreasing. But for KTM, the influence of co-sparsity on classification accuracy is more obvious. The classification accuracy reaches the highest when the co-sparsity is 7, and is much higher than other cases.
Furthermore, Fig.8 shows the influence of the co-sparsity on classification accuracies under external interference. In the cases of large external interference, such as 70% pixel missing, the influence of the co-sparsity on classification accuracy fluctuates greatly. But the overall trend of the image is still similar to Fig.7, so we still set the co-sparsity of the analysis kernel KSVD algorithm for the experiment in this paper as 5.
In the following, in order to demonstrate the performance of the proposed methods, we compare them with some stateof-the-art methods, for example, SRC, FDDL [39], KSVD, LC-KSVD [40], and kernel PCA, kernel KSVD [22], LKDL [41], DLE [42]. We list the results in Table 1. From this table, we can see that the classification accuracy of analysis kernel KSVD algorithm is up to 97.8%, which is far superior to all other methods. The classification accuracy of KTM is 96.8%, which is second only to the analysis kernel KSVD algorithm.
In addition, in order to explain the efficiency of KTM, we further calculate the classification results on the USPS dataset based on the analysis KSVD algorithm only, whose classification accuracy rate is only 62.0%. Compared with the experimental result 96.8% using KTM combining  analysis KSVD, the classification accuracy is very low, which indicates that the KTM is an effective preprocessing transformation in improving the classification accuracy.

B. MNIST DATASET
For futher demonstrating the performances of the proposed methods, in this part, we will conduct experiments on the MNIST dataset, which is also a handwritten dataset, but is larger and more complex than the USPS. The MNIST dataset is composed of 28 × 28 images of the 10 handwritten digits. There are 60,000 training images with 10,000 testing images in this benchmark. Just like the experiments in USPS, the digits is stacked in vectors of dimension n = 784. Fig.9 shows the sample images from MNIST.
Note that in these experiments, we also use (29) and (30) to calculate the reconstruction errors. By analyzing the experimental results on the USPS dataset, we can know that the polynomial degree d and the co-sparsity r have less influence on the classification accuracy, while the dictionary size has a greater influence. Therefore, in the experiments of MNIST dataset, we follow the values of d and r on USPS dataset experiments, and carry out further experiments on the selection of the dictionary size. Fig.10 shows the classification results of the two methods when the dictionary size varies from 200 to 400. Thus, we select the dictionary atomic number K = 300 for the analysis kernel KSVD algorithm. For KTM, the classification accuracy gradually flattens out with the increase of the number of dictionary atoms. As the increase of K will greatly increase the operation time, we choose K = 400.
Furthermore, Table 2 gives the comparisons about the recognition accuracy of our algorithms with other common methods and the latest methods. We find that even for a large handwritten digital dataset, the analysis kernel KSVD algorithm has a higher amount of improvement for the classification accuracy. However, the result of KTM is not so optimistic, and its classification result is not outstanding among all algorithms. The reason we speculate may be that the error of the dataset with large dimension will increase after the kernel transformation.
Besides, we also do the same classification experiment on the MNIST dataset similar to USPS dataset, in which the classification accuracy rate of analysis KSVD is still very low, only 54.8%, which also proves that the KTM method's effectiveness.

C. E-YaleB DATASET
In this part, we will use the E-YaleB dataset to verify the proposed methods further. The E-YaleB dataset contains 2,414 frontal face images of 38 subjects [43] captured under various laboratory-controlled lighting conditions. There are about 64 images for each person. The original images are cropped to 192 × 168 pixels. This dataset is challenging due to varying illumination conditions and expressions.       11 shows some sample images from E-YaleB. Following the common setting, the random features from a 504 dimensional input domain will be exploited to evaluate the algorithms [44]. For this dataset, half of the samples per class will be randomly picked as the training samples while the rest are used as the testing samples. We still use the reconstruction  errors (29) and (30) as the criteria for classifications in this experiment. For analysis kernel KSVD, let the co-sparsity r = 5, the dictionary atoms K = 250, and the polynomial degree d = 2. For KTM, we set the co-sparsity r = 7, the dictionary atoms K = 200, and the polynomial degree d = 2. The classification accuracies of all methods are summarized in Table 3. Besides the algorithms already mentioned in the above experiments, we also add five state-of-the-art algorithms for comparison, which are EasyDL [45], MCDL [46], IC-DDL [47], SADL [48] and K-LSDSR [49]. We find that the analysis kernel KSVD algorithm can achieve the 96.8% classification accuracy, which shows that our method is competitive among all compared methods. However, the result of KTM is as high as 98.3%, which exceeds all other algorithms, indicating that although this method is not very robust, it still owns some merits, such as the perfect classification accuracy in E-YaleB. Just like the above mentioned two datasets, we also use the analysis KSVD algorithm for the classification experiment on the E-YaleB dataset, and the classification accuracy is 97.7%. Compared with the experimental result 98.3% using KTM, the classification accuracy is still slightly lower although it is already very high. Compared with USPS and MINST dataset, the performance of the analysis KSVD algorithm in E-YaleB has improved greatly, which may be caused by the advantage of analysis-based method, such as a very common structure in image processing.

V. CONCLUSION
In this work, we first propose a kernel sparse representation transformation method (KTM), which tactfully transforms nonlinear problems into a linear one. After this kind of transformation, the original nonlinear problem possesses the optimization model similar to the standard analysis sparse problem. Therefore, linear dictionary learning algorithms, such as the analysis KSVD algorithm, can be applied to solve the original problem. However, it is noting that all of the variables after KTM imply all nonlinear information of the original problem rather than linear. Maybe this is one of the reasons why the efficiency of the proposed algorithm is higher than the original analysis KSVD algorithm. Moreover, experimental results also indicate that exploiting nonlinear sparsity via learning dictionaries in a nonlinear feature space can provide better discrimination than the corresponding linear dictionary learning methods.
However, this method has some limitations, for example, it lacks robustness for the classification of the datasets with noise, therefore, another nonlinear dictionary learning algorithm is introduced, which is our main contribution, i.e., the analysis kernel KSVD. More specifically, first of all, the proposed algorithm adopts a two-stage iterative way for obtaining the analysis dictionary from a given dataset in a KSVD-like manner. However, due to the different concerns of the initial model -analysis sparse model, the aim of the proposed method is to find a suitable dictionary so that the analysis coefficient x is sparse, which is other than the related synthesis-based methods essentially. Furthermore, the presented algorithm is also parallel to the analysis KSVD in its rationale and structure, for example, both of them are for learning the l 0 analysis dictionary, however, its root is in the nonlinear high-dimensional dataset processing problem rather than the linear problem. In more details, we first incorporate the appropriate kernel functions into the kernel backward-greedy algorithm for analysis pursuit and the analysis dictionary learning stage, then design the final analysis kernel KSVD algorithm to meet the challenges from nonlinear and high-dimensional data. It is noting that the introduction of kernel tricks makes the curse of dimensionality better settled, which possess higher adaptability for solving nonlinear signal problems than original analysis KSVD algorithm. Benchmark experimental results on three well-known datasets, USPS, MNIST and E-YaleB, show that the proposed method outperforms some related linear algorithms and other existing nonlinear dictionary learning algorithms. Moreover, its robustness is further proved when the data is interfered by noise or some pixels are absent from data, which makes up for the lack of robustness in the kernel transformation method.
Therefore the proposed nonlinear kernel dictionary learning algorithms' superiority on datasets validates its effectiveness to a larger extent. Meanwhile, its superiority also verifies its theoretical advantages owing to analysis sparse model's merits of the equality of all atoms and the much smaller dimensionality for signal representation to some extent.

ACKNOWLEDGMENT
Thank all the referees and the editorial board members for their insightful comments and suggestions, which improved this article significantly.