Linear Representation-Based Methods for Image Classification: A Survey

In recent years, linear representation-based methods have been widely researched and applied in the image classification field. Generally speaking, there are three steps within linear representation-based classification (LRC) algorithms. The first step is coding, which uses all training samples to represent the test sample in a linear combination. The second step is subspace approximation, where residuals between the test sample and the linear combination of each class are calculated. The third step is classification, which assigns the class label to the minimum class-specific residual. We classify the LRC methods into six categories: 1) linear representation-based classification methods with norm minimizations, 2) linear representation-based classification methods with constraints, 3) linear representation-based classification methods with feature spaces, 4) linear representation-based classification methods with structural information, 5) linear representation with subspace learning, and 6) linear representation in semi-supervised learning and unsupervised learning. The purpose of this paper is to: 1) make an accurate and clear definition of the linear representation-based method, 2) provide a categorization and a comprehensive survey of the existing linear representation-based classification methods for image classification, 3) Summarize the main applications of linear representation-based methods, 4) provide extensive classification results and a discussion of the linear representation-based methods. Furthermore, this paper summarizes specific applications of the linear representation-based methods. Particularly, we performed extensive experiments to compare thirteen linear representation-based classification methods on seven image classification datasets.


I. INTRODUCTION
Image classification is a hot topic that has been extensively studied in recent years with the increasingly active developments of computer vision and pattern recognition. The problem of classification is identifying the category a instance belongs to, based on the given observations (training data), and category membership [1]. Visual applications like remote sensing [2], face recognition [3], [4], object recognition [5], biometrics [6]- [8] widely use the models and algorithms of image classification. The linear representation-based classification method is an active research area in the image classification field [9]- [11]. Up to now, extensive LRC methods have been proposed and developed for better and more robust image classification, such as Sparse representation-based classifier (SRC) [9], Collaborative representation-based The associate editor coordinating the review of this manuscript and approving it for publication was Ramakrishnan Srinivasan . classifier (CRC) [10], Non-negative representation-based classifier (NRC) [12], and so on. To the best of our knowledge, there is no related literature that provides a clear definition of the linear representation-based method. Accordingly, there is no article to comprehensively survey this group of methods. In this paper, we first make an accurate and clear definition of a linear representation-based method to establish the concept. Then, based on the definition, we proposed six categories to summarize various linear representationbased method into different perspectives to present both an overview and detailed interpretation. Afterwards, the main applications that widely apply linear representation-based methods were established. Lastly, we provided extensive experimental results and discussion.
The linear representation-based classification (LRC) method has a high correlation with the nearest subspace classification (NSC) [13] method by assigning the class associated with the optimal class subspace to the test sample. VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Like the NSC algorithm, in the LRC algorithm, all training samples represent the test sample in a linear combination. Therefore, the LRC performs classification in the linear system in order to organize the subspace for classification. Unlike NSC, LRC is composed of three main components: the reconstruction term, the regularization term, and the constraints. The reconstruction term ensures the training samples' linear combination will be as close as possible to the test sample. The regularization term imposes different assumptions on the coefficient in a linear combination to make a robust representation that overcomes variations existing within the data. For example, SRC for face recognition [14] utilizes the l 1 minimization to make the coefficient vector sparse by assuming the test sample can be represented by samples from the same class only. To reach the same purpose, TPTSR [15] perform the l 2 minimization twice to achieve the sparse coefficient vector. The CRC [10] argued that the collaboration representation organized by all training samples can perform robust classification. The NRC [12] considers the non-negative constraints on the coefficients will enhance the representation ability of the linear combination. Moreover, with different regularizations, LRC methods can be interpreted in different ways. For instance, both SRC and CRC have a geometric interpretation to prove the rationality of the proposed model and assumptions. The ProCRC [16] utilizes the theory of probability to interpret the newlyadded term of enforcing the representation to be as close as possible to the class-specific subspaces. Generally speaking, there are three procedures in the LRC algorithm: coding, subspace approximation, and classification. The first procedure is coding, whereby we calculate the coefficients of the linear combination with the test sample and training samples. Next, the second procedure, called subspace approximation, obtains the subspace representation of each class using the already-calculated coefficients. The last procedure is classification, which computes the residuals between the test sample and each class before assigning the class label associated with the minimum residual to the test sample.
There are six categories of linear representation-based classifiers: 1) norm minimization, 2) constraints, 3) feature spaces, 4) structural information, 5) subspace learning, and 6) semi-supervised learning and unsupervised learning. This categorization covers all components and procedures involved in the linear representation-based method. The relationship between these six categories and the linear representation-based method are shown in Figure 2. As clearly shown in Figure 2, each category (Norm minimization, constraints, feature space, structural information, subspace learning, and semi-supervised/unsupervised learning) corresponds to a component/procedure of the linear representation-based method (more details about this figure can be found in section III). For the norm minimization, there are many LRC algorithms and its corresponding fusion extensions. The SRC [14] and CRC [10] are representative methods that use l 1 and l 2 minimization according to its assumptions of the data. To fully extract the properties of the data, many constraints were imposed on the basic LRC model to achieve a better recognition rate. For example, the sparsity [17], [18] and locality [19]- [21] are extracted to better represent the test sample in the linear representation. The non-negative constraint [12], [22] believes the non-negativity makes the representation more focused on the heterogeneous data and easier to interpret. Several works are devoted to creating an optimal projection or creating an ideal feature subspace for image classification [23]- [25]. In the new feature space, the data will be more discriminative than the original space. There are currently three methods to create a new feature space, the kernel-based representation [26], the mapping method, and the use of deep features. The structural information within the dataset is critical for image classification [27], [28]. The low-rank representation [27] attempts to recover the subspace structure by minimizing the rank of the coefficient matrix for robust image classification. The convolutional SRC [29], [30] used the dictionary filter to make convolutional operations on the coefficients, which involves the information of the entire image. Some LRC-based methods consider the neighbor information [31], [32] in the decision-making procedure. Besides fully supervised learning, there are existing methods proposed for semi-supervised learning and unsupervised learning [33]- [36], showing that the LRC-based method can be applied widely in the image classification domain.
The linear representation-based methods for image classification has numerous applications. Hyperspectral images are frequently used in the remote sensing scenario, which contains a wide range of electromagnetic spectrum information. The LRC algorithms are utilized to exploit the shared information among the spectral signals inside the image. Since the pixels in the hyperspectral image lie in the high dimensional space, it is helpful if an approximation can be provided from the low-dimensional subspace of the sample class [37]. The LRC methods are also widely applied in the tasks of medical biometrics and multimodal biometrics. For example, the ProCRC was applied for detecting diabetes mullitus and achieved a promising accuracy [38]. Group SRC was used to fuse the multimodal biometrics data. The other popular application is face recognition. There are many LRC methods specifically designed for face recognition [39]- [41]. For instance, the conventional SRC [14] was originally proposed for robust face recognition.
To this end, in this paper we conducted a survey on the linear representation-based methods for image classification. This paper makes contributions and points of inspiration for the readers in the following areas: 1) Definition. We present an accurate definition of the linear representation-based method for image classification by reviewing influential works in this research area. With this understanding, the readers are able to not confuse this kind of method with other methods easily. 2) Categorization. We categorized the LRC methods for image classification into six classes and discuss the different methods proposed within them. Readers can refer to the corresponding category for detail if they have interest in a certain area. 3) Application. We reviewed the applications of LRC methods in the four most frequently used areas: remote sensing, face recognition, medical biometrics, and multimodal biometrics. This allows readers to learn from the LRC methods in a specific application scenario. It is also beneficial to the readers who are seeking a solution in a related application. 4) Experiments and discussion. We performed extensive experiments on seven datasets to show the performances of the different LRC methods. Based on the experimental results, we provide a discussion of the LRC methods and point out the challenges and possible points of interest in linear representation for image classification. Readers can have an insight into the classification ability, properties, and the potential future directions of the linear representation-based methods. The overview of the organization of this paper is shown in Figure 1. The remainder of this paper is organized as follows: In section II, we introduce the notations used in this paper, the preliminary background knowledge of the linear representation-based method, and some basic definitions of the linear representation-based method. In section III, six groups of linear representation-based classifiers are presented and discussed in detail. Different types of applications using linear representation-based methods will be reviewed in section IV. We performed extensive comparison experiments on seven image datasets and showed the results in section V.
Finally, a discussion will be presented, and we will reach a conclusion to summarize this paper.

II. BACKGROUND AND DEFINITIONS
Throughout this section, we will first summarize the notations that are used in the later part of this paper. Afterwards, we will briefly talk about nearest subspace classification [13] and make a comparison with LRC since the two methods are highly correlated. Finally, an accurate definition of linear representation-based classification will be provided and discussed.

A. NOTATIONS
In this paper, we denote y ∈ R m as the test sample which needs to be assigned a label. X ∈ R m×n represents n training samples, and X i ∈ R m×r (r < n) means the training samples from the i th class. The α ∈ R n is the coefficient vector of the linear combination in the LRC method obtained in the coding procedure. The α is the coefficient vector in the linear representation, and the α i is the vector only contains the coefficients from the i th class. λ ∈ R is the scaling factor. I ∈ Z + is the output label of LRC method. The notations used in this paper are summarized in Table 1. k represents the number of classes in the dataset. H represents the number of training samples in the task. is the operator performing element-wise product between two vectors.

B. NEAREST SUBSPACE CLASSIFICATION
In the nearest subspace classifier [13], [42], there is an assumption proposed that is the foundation of classification.  The assumption is that samples from the same class lie in the same subspace. We can describe the assumption as follows: Assumption 1: Given a collection of images C = . . , C l } ∈ R m×l , the samples from the sample class X i = {I p , I p+1 , . . . , I q } lies on the same linear subspace S X i ∈ R k×l X i . According to the Assumption 1, the primary classification idea of NSC is to assign the class label associated with the closest class-specific subspace to the test sample, which is shown in the following formulation: The metric is usually l 2 norm distance: (2)

C. LINEAR REPRESENTATION-BASED CLASSIFICATION
The linear representation-based classification method [10], [12], [14] inherits the assumption and theorem of NSC. However, compared with NSC, LRC emphasizes the characteristics of the coefficients in the linear combination, expecting to perform the robust image classification. Besides this, the LRC believes that each sample in the dataset can be represented in a linear combination composed of other samples, which can be described in the following Theorem 1: Theorem 1: Given a collection of images C = {C 1 , C 2 , C 3 , . . . , C l } ∈ R m×l , a sample C i in C can be represented in a linear combination of other samples C j ∈ Based on the Theorem 1, in the image collection C, a test sample y can be represented as follows: The above Eq. 3 can be rewritten as: Basically, the LRC method is to solve the following minimization problem: where p represents the minimization norm depends on specific method (e.g., when p = 1, it becomes the sparse representation-based classification model [14]). λ is the regularization parameter. The first constraint represents a linear combination. Other constraints below are specified by different methods individually according to their assumption.
Generally speaking, the LRC method comprises of three procedures: coding, subspace approximation, and classification. The coding procedure seeks the optimal coefficients of the linear combination under different regularizations and constraints.
In the first step, The coding procedure can be considered to solve the following general problem: After the coefficient vector α is obtained using the above Eq.6, the second step is subspace approximation, which first organizes all class-specific subspaces then calculates the residuals between the test sample and all class-specific subspaces: Finally, in the classification procedure, the class label with minimum residual is assigned to the test sample: As previously shown in section I, although there are plenty of algorithms based on LRC method, no uniform definition of LRC method is made, here, we made a definition as follows: Definition 1: Linear representation-based classification: A method that classify a given test sample using all training samples by following three steps: 1) Coding using Eq.6.
3) Classification using Eq.9. We summarized the LRC in the following Algorithm 1.

III. MODELS AND ALGORITHMS
Based on the basic LRC method mentioned in section II, various extended classifiers are proposed. In this section, we categorize them into six directions and separately introduce them in detail. Figure 2 illustrates the relationship between the six categories and the linear representation-based method: the norm-minimization category focuses on the coding procedure; the constraints category concentrates on the property of the coefficients; the feature space category contains methods that perform linear representation in different feature spaces; the structural information category exploits the structural information in the training samples; the subspace learning category devotes to learning the best subspace for the subspace approximation; the semi-supervise learning and unsupervised learning category consists of methods that perform classification in a semi-supervised or unsupervised way.
A. LINEAR REPRESENTATION-BASED CLASSIFICATION WITH NORM MINIMIZATION 1) l 1 MINIMIZATION By using l 1 minimization in Eq.5, the coefficient of linear combination will be ''sparse''. The term ''sparse'' in the linear representation means a majority of elements in the coefficient vector are zero [43]. Therefore, the linear representation-based classification with l 1 minimization is called ''Sparse Representation''. Wright et al. proposed the sparse representation-based classification (SRC) method [14] to perform robust face recognition. Based on Assumption 1, the SRC intends to represent the test sample by utilizing only the training sample from the same class as the test sample.
In order to make the coefficient sparse, the coding procedure can be formulated as a l 0 minimization problem: However, in the above Eq.10, the l 0 is a NP-hard problem, which means there is no existing algorithm to solve this. Therefore, when the coefficient α is sparse enough, the Eq. 10 can be replaced by l 1 minimization: There are four groups optimization method to solve the problem showing in Eq. 12: 1) greedy strategy approximation, 2) constrained optimization, 3) proximity algorithmbased optimization, and 4) homotopy algorithm-based sparse representation [44]. Here we introduce two most frequently used algorithms to solve the l 1 minimization problem in Eq. 10: The orthogonal matching pursuit (OMP) [45] algorithm and fast iterative shrinkage thresholding (FISTA) [46] algorithm.
The OMP algorithm belongs to the greedy strategy approximation method group, which achieves the local minima in each step and obtains the global minima in the final step. The core idea of OMP is to select the most contributing training sample each time to approximate the test sample.
The algorithm will stop when the reconstruction representation α is close enough to the test sample. In each iteration, the contribution of the i th training sample x i is evaluated by the inner product between the training sample x i and the residual vector r d−1 of the last step: The residual vector of the last step r d−1 is orthogonal with the selected samples in reconstruction matrix d−1 .
Then, the index of most contributing sample will be added to the index set and a reconstruction matrix Next, the coefficient vector α is updated according to the reconstruction matrix : As the final step of an iteration, the residual vector is calculated: To control the algorithm to stop when the reconstruction representation is close enough to the test sample, a criteria is set: The fast iterative shrinkage thresholding (FISTA) algorithm utilizes the proximity algorithm to solve the Eq.13. Here we reformulate the Eq.13 as follows: where u(α) = 1 2 X α − y 2 2 and v(α) = α 1 . In each iteration of FISTA algorithm, the update formulation is shown as follows: where L is the Lipschitz constant [47]. There were weighted LRC methods with l 1 minimization proposed to enhance the stability of the original classifier [48], [49]. The weight can be imposed on two places in LRC: the training samples and the coefficients. To impose weight on the training samples, the Eq.13 can be extended as follows: In [49], Fan et al. evaluate the weights of training samples based on the Gaussian kernel distances between test sample and training samples dist(x i , y) = exp(− x − y 2 /2σ 2 ). Therefore, the weighted matrix diag(W ) = [dist(x 1 , y), [dist(x 2 , y), . . . , [dist(x n , y)]. The Gaussian kernel distance can not only measure the similarity of samples, but capture the nonlinear information in the dataset. The other way of weighting is imposing weights on the coefficients, which can be described as the extended Eq. 12: In [48], the Euclid distance between test sample and each training sample dist(x i , y) = y − x i 2 is applied to be the weight of each training sample. The weighted matrix This weighting strategy considers the similarity between test sample and its neighbor into account when represent the test sample.
The SRC (Linear representation-based classification with l 1 minimization) algorithm is summarized in the following Algorithm 2.

Algorithm 2 Sparse Representation-Based Classification
Require: Test sample y, Training set X , Number of class k, Training Label set X L .

Classification:
Assign the class label I to sample y using Eq.9; return class label I ;

2) l 2 MINIMIZATION
The linear representation-based classification with l 2 minimization uses all training samples collaboratively to represent the test sample. The problem is formulated as follows: Unlike l 1 minimization in SRC, the l 2 minimization makes the coefficient ''dense'' which means the majority of elements in the coefficient vector are non-zero. This type of representation makes all training samples from the dataset to collaborate to represent the test sample, rather than only using the training samples from the same class as the test sample. Zhang et al. used the l 2 minimization based on the LRC method to perform image classification, and called it collaborative representation-based classification (CRC) [10] method.
CRC believes that the samples from different classes share similarities so that all training samples will better represent the test sample. The CRC is able to achieve competitive classification result compared with SRC, and moreover, it costs less computation time than SRC since there exists a closedform solution to solve the l 2 minimization: The CRC has an elegant geometric interpretation to support its rationality, which is shown in the following Figure 3(a). We can observe that the CRC involves all training class samples to represent the test sample, and that SRC only utilizes the samples from a single class. We take the representation τ 1,1 as the example to explain the robustness brought by the CRC. Since the representation ξ 1,1 is parallel toȳ − τ 1,1 , according to the law of sines, we have the following Eq.
The ȳ − τ 1 2 is the residual e 1 between test sample and the representation τ 1 , therefore, the Eq.29 can be rewritten as: According to Eq.30, the CRC not only enforce the test sample to be as close as possible to the representation τ 1 , but also consider make τ 1,1 be as far as possible to the representation composed of other samples ξ 1 = i,j =1 τ i,j . This is called ''double check'' mechanism [10] ensuring the robust classification.
The weighted version of LRC with l 2 minimization (termed as weighted CRC) is proposed in [50] to improve the classification performance. The formulation of weighted CRC is shown in the following equation: where the is a diagonal matrix whose non-zero elements are estimated by the squared residuals between the test sample and representation obtained from the train samples. ϒ is the Tikhonov matrix which is usually ϒ = εI to avoid the ill-posed problems. In [51], the adaptive WCRC (AWCRC) was proposed, where ϒ is replaced by a diagonal matrix whose non-zero elements are distance between training samples and the test sample In [16], Cai et al. proposed the Probabilistic collaborative representation-based classification (ProCRC) by maximizing the test sample's class-specific likelihood. It assumes the coefficient vector in the linear representation determines the confidence of a test sample belongs to each class. The probability of a test sample belongs to a specific class can be described as follows: where δ X is the label set in the class of training set X , δ (y) is the label of test sample y. The ProCRC tries to construct a probabilistic collaborative subspace that ensure the classspecific likelihood is maximum. Therefore, the objective function of ProCRC is maximizing the joint probability of VOLUME 8, 2020 the test sample: Based on the Eq. 33, we can obtain the coefficient by solving the following optimization problem: where the term X α − X i α i 2 2 enforces the representation to be as closed as possible to the class-specific collaborative subspace. The Eq. 34 has closed-form solution [16].
The collaborative representation-based classification method and probabilistic collaborative representation-based classification method are summarized in the Algorithm 3 and Algorithm 4.

Algorithm 3
Collaborative Representation-Based Classification Require: Test sample y, Training set X , Number of class k, Training Label set X L .

Coding:
Solve the l 2 -minimization problem to get coefficient vector α: α = (X T X + λI ) −1 y Subspace approximation: Construct the subspace set S using Eq.7; Calculate the residuals R i using Eq.8;

Classification:
Assign the class label I to sample y using Eq.9; return class label I ;

Algorithm 4 Probabilistic Collaborative Representation-Based Classification
Require: Test sample y, Training set X , Number of class k, Training Label set X L . Coding: Solve the problem described in Eq. 34 to get coefficient vector α: Construct the subspace set S using Eq.7; Calculate the residuals R i using Eq.8;

Classification:
Assign the class label I to sample y using Eq.9; return class label I ; The LRC with l 2,1 minimization assumes that the sparsity should be imposed on the training samples from the incorrect class instead of an individual sample. In other words, only the training samples from the correct class will have non-zero coefficients in the linear representation. The basic formulation of LRC with l 2,1 minimization is showing as follows: This LRC with l 2,1 minimization is called group sparse classification (GSC) [52]. In the GSC, the group sparsity is imposed on the training samples. The illustration of group sparsity and sparsity is shown in the following Figure 4. The problem in the above Eq. 35 can be solved by SPGL1 algorithm [53].

B. LINEAR REPRESENTATION-BASED CLASSIFICATION WITH CONSTRAINTS
The constraints imposed in the linear representation will enforce the coefficient vector in this combination to be discriminative, which boosts LRC's classification ability. Several properties of the coefficients have been exploit, such as non-negativity [12], sparsity [17], locality [20].

1) NON-NEGATIVE REPRESENTATION-BASED CLASSIFICATION METHOD
The non-negativity in the linear representation means all elements in the coefficient vector are non-negative. Since only the additive linear combination has a clear visual intuition meaning [54], the non-negative representation is suitable to perform the image classification in the real-world application with both physical interpretation and mathematical feasibility. Figure 5 shows the non-negativity of coefficients in linear representation.
The non-negative representation-based classification (NRC) [15] is an extension of the LRC method, which imposes a non-negativity constraint on the coefficient vector to enforce all elements in the vector to be non-negative. Using the non-negative representation, samples from the homogeneous class will be enhanced in the linear representation; meanwhile, samples from the heterogeneous class will be suppressed. The NRC can be described in the following Eq.36.
where ρ is an auxiliary variable in a linear equalityconstraints problem.
216652 VOLUME 8, 2020 The above Eq.36 is the no-negativity constrained least square model [55], which can be solved using ADMM method [56]. The augmented lagrange function can be written as follows: The ADMM method is applied to alternatively optimize one variable when fixing others until the stop requirement is met. The update of each variable can be described as follows: The algorithm of NRC is summarized in the Algorithm 5.

2) LOCALITY CONSTRAINT
The locality means the information in data brought by the neighbors of the sample. Usually, in the classification scenario, the farther a sample is from the test sample, the less likely it belongs to the same class. Therefore, the locality is beneficial for the classification by referring information from the neighbors. Figure 6 shows the locality in the LRC. The locality constraint in the linear representation will enforce the coefficient to consider the neighbors' sample. By considering information from the neighbors, the LRC

Algorithm 5
Non-Negative Representation-Based Classification Require: Test sample y, Training set X , Number of class k, Training label set X L , Number of iteration T , Tolerabce .

Classification:
Assign the class label I to sample y using Eq.9; return class label I ; will reconstruct the test sample and generate a discriminative representation for classification. In addition, by applying the locality constraint, the coefficient vector will naturally become sparse [21], since the sample's neighbors are a portion of the whole data. In [21], the locality constraint is first implemented for feature coding. The basic form of locality constraints in the LRC method can be described VOLUME 8, 2020 as follows: where dist i represents the distance vector whose elements are the distances between the i th test sample and all training samples, represents the element-wise produce between two vectors. The constraint 1 T α = 1 is a shift-invariant constraint. Here, the locality information of a test sample is quantified by measuring similarities with all training samples. For the coefficients with low value, there is a threshold set to reset them as zero. As shown in Figure 6, the closed samples with the test sample have larger weights (bold lines), and the samples far from the test sample have low weights (dash line).
Inspired by the locality constraint in LLC, the localitysensitive dictionary learning (LCDL) [19] is proposed, which utilizes the locality constraint to construct a representative dictionary. When constructing the dictionary, all training samples are involved, and the distances between each test sample and all training samples are considered the locality information. Rather than using the distances with all training samples, the locality-constrained collaborative representation (LCCR) [20] used the sum of distances between each test sample and its neighbors as the locality constraint. The WSRC [48], [49] takes different similarity measurement (e.g. Gaussian kernel distance) as the locality information.

3) SPARSITY CONSTRAINT
The sparsity means some coefficients in the linear representation are zero. The sparse coefficient tends to produce a representation using samples from the correct class [57]. Therefore, the sparsity constraint will lead to a better classspecific residual error for classification. The sparse representation (SR) [14] fully exploits the sparsity in LRC, which only uses the training sample from the correct class to represent the test sample. Unlike the SR, some LRC extensions impose the sparsity constraint to enhance the performance, rather than use the sparsity for classification.
The general formulation of sparsity constraint on the linear representation can be described as follows: where ι is the number of training samples used in the linear representation. In Eq. 42, the number of training samples is restricted in a small number ι, which enforces the coefficients to be sparse. The sparsity augmented collaborative representation (SACR) is proposed by imposing the sparsity constraint in the above Eq. 42 on the LRC with l 2 minimization (dense representation). In this method, the coefficient vector of representation is the fusion of two components: 1) the coefficient vector of LRC with sparsity, and 2) the coefficient vector of dense representation. The coefficient vector of SACR can be described as follows:ά whereα is the coefficient vector of dense representation, andα is the coefficient vector of linear representation with a sparsity constraint. As the name of the representations says, the coefficients under sparsity augment the coefficients of dense representation by enlarging the value of the coefficients from the correct class. This augmentation operation makes the correct class's coefficients discriminative since it enlarges the gaps of the coefficients' values between the correct class and the other classes, which is proven to be the advantage brought by the sparsity [58], [59] since the gap enlargment can be viewed as the sparsity enhancement. Similarly, to enhance the sparsity, Tian et al. proposed the FFT Consolidated Sparse and Collaborative Representation [18], making representation fusion between SRC and CRC in the frequency domain by using FFT. This method shows a more robust performance than FFT and CRC. As one way to implement the sparsity constraint, a representation strategy named sample removing. In [15], the two-phase test sample representation (TPTSR) is proposed, where the test sample will be represented in two phrases. For the first phrase, the representation of the test sample is produced. Then, c training samples will be removed from the representation. The strategy of removing is removing c training samples with the smallest residuals of the representation {r j = y − X j α j |j = 1, 2, . . . , H }. In the second phase, a new representation organized by the remaining H −c training samples is produced for the classification. Figure 7 shows the coefficient distribution of two phases in TPTSR. This representation strategy follows the idea of sparsity constraint described in Eq. 43. The removing operation in the first phase does the same thing as the sparsity constraint. The removal of samples is equal to setting coefficients of these samples to zero making the coefficient vector sparse. Similarly, based on this strategy, samples can be removed in a heuristic way [40]. In this method, the removing operation is performed repeatedly until meeting the stop criteria. The strategy of removing is removing the training samples with minimum absolute value in each iteration.
Besides the sparsity for each element in linear representation, the group sparsity is proposed, which imposes the sparsity on class rather than the element in the coefficient vector. The group sparsity can also be realized using the sample removing strategy. The coarse-to-fine face recognition (CFFR) [60] method is proposed by removing all samples of the c classes with minimum residuals between the test sample and class-wise representation. In [17], the class-wise sparse representation (CSR) is proposed which focuses on the sparsity between classes. The problem of CSR can be described as follows: where α i represents the coefficient vector of i th class samples. The second term is the class-wise sparsity measurement, g is the scaling factor. The constraint below assigns the l 2 norm of samples from each class to the variable ρ, which enforces the l 0 regularizer to focus on the sparsity between classes.

C. LINEAR REPRESENTATION-BASED CLASSIFICATION WITH FEATURE SPACES
The data in different feature spaces will have different discriminative properties. The kernel methods [61] map the data used in linear representation to the kernel feature space, enabling the LRC to perform classification on the non-linear separable data. Different mapping methods [25] aim to map the data to distinct feature spaces that fit the classification mechanism of the LRC method. The deep features extracted from deep learning architecture [62] form the deep feature space, which is able to improve the performance of LRC methods for image classification.

1) KERNEL-BASED REPRESENTATION
The kernel trick was extensively applied in SVMs [63] in the very beginning. As a linear classifier, the SVMs perform well in the data, which is linear-separable. When processing the data, which is not linear-separable, the original SVMs will be extended to the non-linear classifiers by using the kernel trick.
The kernel trick will map the low-dimensional data to kernel space, making it linear separable in this high-dimensional space. The kernel methods usually use the mercer's kernel, which can be described as follows: where x 1 and x 2 are two data sample in the space, k(·, ·) is the kernel function, φ(·) is the mapping function. Since the distribution varies when the data changes, the mapping function φ is undetermined in different scenarios. Three types of mapping functions are frequently used based on three different data assumption. They are linear kernel, polynomial kernel, and Gaussian radial basis function kernel: When the kernel trick is applied in the LRC methods, it will extend it to nonlinear classifiers. However, they perform the linear representation and classification as linear representation-based methods in the kernel feature space. All three kernel-based LRC methods meet the definition of the linear representation-based classification. The Kernel sparse representation-based classifier (KSRC) [64], [65] was proposed to kernelize the SRC method, which can be described as follows: where is the mapped training samples, φ(y) is represented by a linear combination of mapped training samples in the kernel feature space.
Since the mapping function φ(·) is unknown, the Eq.49 above should be rewritten in the following form: where K = φ T φ ∈ R H ×H is the Gram matrix. After the coefficient α is obtained, the class-specific residuals are calculated: where the α i represents the coefficient vector of the i th class. Like the conventional LRC method, the final classification output is the index of class associated with the minimum class-specific residual. Rather than using the l 1 minimization, the kernelized LRC with l 2 minimization is proposed in [66], called Kernel collaborative representation-based classification (KCRC). The formulation of KCRC is showing as follows: The solution of the above Eq. 52 is: The work in [26] proposed a weighted kernel representation-based method (WKRBM) to impose a weight on the kernel-based representation of LRC for better performance.

2) DEEP LEARNING FEATURES
The deep learning architectures [62] show a powerful capability to learn the discriminative feature representation for the image classification. Several convolutional neural networks proposed recently have achieved state-of-the-art performance in the visual classification task [67]- [69]. Based on the deep neural networks already trained on the large-scale image dataset, the concept of transfer learning [70] is proposed, which means mapping the original data from the raw feature space to the deep feature space where data representation is discriminative by using the pre-trained deep learning architecture.
Trials of implementing deep learning features to the LRC method have achieved success [10], [71]- [73], which obtained the competitive performance. Some LRC methods directly use the data from deep feature spaces and improve the performance smoothly compared with the results based on the raw feature space [10], [12], [73]. In [71], a test sample is represented in parallel by two groups of linear representation with l 2 minimization: a linear representation with data from the raw feature space and a linear representation with data from deep feature space. Thus, two coefficients are obtained simultaneously: where X deep represents the training samples from the deep feature space, α deep is the coefficient obtained based on the X deep .
Then, two groups of class-specific residuals are calculated and fused in a element-wise multiplication manner: Since the class-specific residuals indicate the probability of the test sample belonging to a specific class, the smaller the residual of this class is, the higher probability the test sample belongs to this class. Therefore, the fusion between two classspecific residuals can be viewed as the 'weighting' operation that imposing weights obtained from deep feature spaces to the residuals of the raw feature spaces. Figure 8 shows the effect of 'weighting' operation. Besides implementing the deep learning feature to the residual in LRC method, Cheng et al. [72] used the welltrained deep learning feature to achieve the state-of-the-art performance in face recognition: where f (·) represents the deep feature space mapping. In this method, each image is fed into a specially-designed 5-layers CNN for deep feature extraction. Then, the SRC is applied to data from deep feature space to output the classification result.

3) OTHER FEATURE SPACES
There are other methods that map the data to different spaces and then apply the LRC methods to perform classification.
In [25], the Euler sparse representation-based classification (Euler SRC) maps the data samples to an Euler space before inputting them to SRC in order to boost the robustness of the classifier. Correspondingly, in the complex space, the conventional l 2 -norm distance metric used for calculating residual in SRC is replaced by the cosine distance [74]. With the cosine distance, the margin between data samples of two classes will be larger than using the Euclidean distance.
In [24], the data samples are transformed to a latent space where samples from the same class can be represented in one point to overcome the pose variant in the face identification problem. Then, the SRC is applied to classify the data in the latent space. In order to map the data to a space that makes it discriminative and fits the SRC method, the SRC-DP [75] is proposed. In the SRC-DP, a projection matrix is proposed to map the data samples to space where the between-class residual is maximized, and the within-class residual is minimized. In [76], a log-euclidean space is learning for the SRC method, where the data from the same class will lie on a subspace with discriminative structure. In [77], the proposed projection representation-based classification (PRC) method constructed an ideal representation that maps the data samples from each class to a hyperplane with the nearest projected test sample. The [23] proposed an algorithm to generate approximately symmetrical images to recover flaws existing in the raw images in data. All data processed by this algorithm is fed to an SRC classifier to predict the label. This method can be viewed as the refinement of feature space, which, to some extent, improves the classification performance of the SRC method. In [78], Liu et al. proposed a discriminative sparse embedding (DSE) that projects data from the highdimensional space to a low-dimensional feature space for classification by integrating SRC and a graph-based method to capture the local information of the noised data. In [79], a discriminative feature extraction based on sparse and lowrank representation (DFE) was proposed to map the data to the feature space that was embedded with both local information and global information from the raw feature space.

D. LINEAR REPRESENTATION-BASED CLASSIFICATION WITH STRUCTURAL INFORMATION
The conventional LRC methods usually consider each sample separately in the representation, which ignores the structural information inside the data. The structural information is the relationship among samples in the dataset or relationships among pixels of a sample. For example, the SRC performs the l 1 minimization on the image level, and no emphasis is imposed on the correlation in the pixel level. For dealing with this problem, the structural information is modeled in different LRC methods. In [80], Wang et al. proposed the adaptive sparse representation-based classification (ASRC) by considering the correlation structure in the SRC model. Since the correlation structure describes the relationship between the samples, it can be a variable to control the attention of the LRC classifier. For example, if the samples are highly correlated, the classifier will pay more attention to the representation correlation. When the samples have a low correlation, the classifier focuses more on the sparsity. The LRC can be adaptive between l 1 and l 2 norm minimization by using the trace norm, which also captures the correlation in the data. The objective function of the ASRC is showing as follows: where the · * is the nuclear norm. The term Xdiag(α)α * is called the correlation regularizer, which involves the training samples X into consideration in order to exploit the correlation structure inside the training samples. The correlation regularizer can be decomposed as the following forms when X T X = 1: when X = X 1 1 T : The X T X = 1 means each training sample is orthogonal from each other, indicating a low correlation in the data. The X = x 1 means each training sample x i is the same as x 1 , indicating a high correlation in the data. In the two extreme cases above, we can observe that the trace norm takes both sparsity and correlation into account when representing the test sample.
Besides the correlation structure information in the data in [81], the structural error is also considered in the LRC model. The proposed matrix-based representation in [81] is described as follows: where ∈ R n test ×n training is the coefficient matrix composed of coefficient vector corresponding to each sample. E is the VOLUME 8, 2020 structural error matrix of all test samples. In Eq. 60, representation coefficient and structural error E are optimized simultaneously. The first term ensure the structural error in the representation is minimized, and the second makes the representation of each test sample sparse. Since the structural error is modeled, the method becomes more robust.
The patch-based representation [41], [82], [83] represents the test sample using small patches from the whole image. The patch is a fixed-scale small partition of a whole image. By using the patches to represent the patch, the local structure in the dataset is fully utilized. The patch-based collaborative representation-based classification (PCRC) [41] method applied N p CRC classifiers to separately classify N p patches. For each image in the dataset, N p patches are cropped with a fixed size and same location. When representing the test sample, there are N p coding procedures performed in parallel. The objective function of the PCRC is showing as follows: where α i represents the coefficient vector of the i th patch, y i represents the i th patch of the test sample, P i = P 1 i , P 2 i , . . . , P H i represents the image set containing the i th patch of all training samples. After coefficient of each patch α i is obtained, the following subspace approximation and classification procedures are performed as normal. Finally, this algorithm will produce N p classification outputs. To ensemble these N p classification outputs, a class-specific weightsŵ i is calculated using the constrained l 1 -regularized optimization: where e is the vector only contains elements of 1, D is the decision matrix whose the element d ij corresponding to the j th patch of the i th image. If the classification result of a certain patch is equal to the label of its image, d ij = 1. Otherwise, d ij = 0. After the class-specific weight vectorŵ is calculated, weights from all patches will be summed up to generate an overall class-specific weight vector. The class label with the highest weight will be selected as the output. The pipeline of PCRC is illustrated in Figure 9. Besides using the ensemble learning technique to decide the outputs from different patches. Gao et al. [83] proposed regularized patch-based representation (RPR), which established a uniform model to classify the patches. This model can be described as follows: where E = E 1 , E 2 , . . . , E N p represents the error of the i th patch, α i = α 1 , α 2 , . . . , α N p represents the sparse coefficient vector of the i th patch, D i represents the intraclass variance matrix proposed in the ESRC [34] of the i th patch. β i = β 1 , β 2 , . . . , β N p represents the intra-class variance coefficient vector of the i th patch. This method imposes group sparsity (l 2,1 minimization) to the representation of intra-class variance and sparsity to the representation of training sample. The sparsity ensures the correct samples are selected for representation and the group sparsity ensures the correct class variance is selected for representation. The intraclass variance describes the degree of difference among the samples of the same class, which belongs to the structural information between samples.

E. LINEAR REPRESENTATION-BASED CLASSIFICATION WITH SUBSPACE LEARNING
In the subspace approximation procedure of the LRC methods, the composition of each class-specific subspace is critical to the final accuracy. In [84], the collaborative representation optimized classifier (CROC) is proposed by seeking for the trade-off between the nearest subspace classifier (NSC) and collaborative representation-based classifier (CRC). The strategy of CROC can be described as follows: where r i represents the final residual of the i th class, α CRC and α NR are the representation coefficient of CRC and NSC, respectively. µ is the weight to balance the significance between CRC and NSC, which can be determined by performing the cross-validation [85].
In [86], the PLRC is proposed to classify a set of images by calculating two coefficient vectors. The figure illustrates two subspaces construction strategy and the calculation of two coefficient vectors. One coefficient vector contains the joint coefficients between related subspace and test space; the other coefficient vector contains the joint coefficients between the test sample and unrelated subspace. These two types of the coefficient vector construct a pair of metrics related to metrics and unrelated metrics. The two metrics are combined as follows: where dist f represents the combined metric, dist r represents the related metric, and dist u represents the unrelated metric. This combination strategy maximizes the related metric meanwhile minimizes the unrelated metric.
In [31], k subspaces of each class are constructed for further representation in LRC methods, which is called two-stage LSCL. In the first stage of LSCL, nearest c samples with the test sample from each class are selected to form a subspace of the i th class, then the average sample of each subspace is calculated: whereX i is the average sample of the i th class, x k ij is the j th sample of the i th class. In the second stage, the subspaces are fed into the LSRC classifier [87]: where W is the weighted diagonal matrix whose elements on the main diagonal are the distance between the test sample and samples in each subspace.

F. LINEAR REPRESENTATION-BASED CLASSIFICATION IN SEMI-SUPERVISED LEARNING AND UNSUPERVISED LEARNING
Semi-supervised learning means the machine learns from on the dataset containing both labeled data and unlabeled data [88]. When there are few data labeled, the variations in each class are hard to capture. Likewise, in the LRC method, the lack of labeled training samples will heavily influence the performance [89]- [91]. The S 3 RC was proposed in [33] to perform semi-supervised classification with the LRC method to address this problem. In S 3 RC, the linear variations are firstly eliminated from the raw samples, and all samples are normalized to fit the zero-mean Gaussian distribution.
Since the dataset contains both labeled and unlabeled data, the Gaussian Mixture Model (GMM) [92] is applied to estimate the prototype sample of each class. Next, to estimate the parameters in the GMM, S 3 RC utilizes the EM algorithm. Finally, the prototypes of all classes are organized to construct a new training set for further classification using ESRC [34] method. The basic model of S 3 RC can be described as follows: is the variation dictionary whose each column is the subtraction between training sample and the extended prototype i of each class, α, β are the sparse coefficient of linear representation of training samples and atoms in the variation dictionary. The prototype i will be estimated using the GMM model: where π represents the prior probability of the i th class, , . . . , x norm l n , x norm ul 1 , x norm ul 2 , . . . , x norm ul n represents the training set whose each element is normalized and variation eliminated to fit the non-zero gaussian distribution, i represents the covariance matrix,ŷ = represents the image set contains both labeled normalized samples after variation elimination and unlabeled normalized samples after variation elimination, u is the label of unlabeled samples. The parameters in Eq. 69 can be estimated by EM algorithm. Finally, the output label is decided by ESRC [34] method, which the formulation of decision is: where * is the new estimated training set, α * i and β * is the newly calculated sparse coefficient vectors using ESRC. In [93], an active learning paradigms [94] is imposed to the TPTSR [15] which uses two-phase coding to represent the test sample. There are two groups of samples separately represent the test sample in each phrase: the group of labeled data and the group of the labeled data and unlabeled data. Based on the sample removing strategy introduced in section III-B3, the residual to decide the final result in the second phrase is showing as follows: where H l is the number of samples of the labeled sample, x l i α l i is the i th representation of labeled samples. This decision function seeks for the trade-off between labeled samples and all samples to perform semi-supervised classification. In unsupervised learning for image classification, an optimal projection is first learned using a linear representation method. Then, the clustered data is fed to a LRC method for classification. The adaptive weighted nonnegative lowrank representation (AWNLRR) [35] is a typical method that performs image classification with unsupervised learning. Firstly, a low rank projection is learned: where P is the weighted matrix, Q is the affinity graph to capture the intrinsic feature from the data, B is a matrix, where each element B ij is the distance between the i th sample and the j th sample, || · || * the nuclear norm, and || · || F denotes the Frobenius norm. The first term is the reconstruction term, the second term and constraint S T 1 = 1 ensure the weighted matrix is in a reasonable range, the third term makes the matrix Q be low-rank such that the global structure is preserved, and the last term enforces the affinity graph Q to learn VOLUME 8, 2020 the local information in the data. The non-negative constraints P ≥ 0, Q ≥ 0 ensures the learned weighted and projection matrix have good interpretability. After the affinity graph Q is learned, the data is first clustered by the Normalized cut (Ncut) algorithm. Then, CRC performs classification on the data with clustered labels. Besides AWNLRR, the low-rank preserving projection via graph regularized reconstruction (LRPP_GRR) [36] constructs the graph in the reconstruction term before classification. The Double Low-Rank Representation (DLRR) [95] learns two low-rank matrices simultaneously capture global intrinsic information in the row space and column space, and the LatLRR [96] learns a pair of low-rank matrices to capture the intrinsic and salient features for image classification.

IV. APPLICATIONS A. REMOTE SENSING
In the remote sensing research area, the hyperspectral imagery (HSI) [97], [98] are widely used for different applications [99]- [101]. In HSI, each pixel on the image contains information on a wide range of wavelength channels, which makes the whole image informative and high-dimensional. Hyperspectral image classification aims to assign the class label to each pixel of the image. Since pixels in the same class lie in the same subspace, the LRC methods classify the hyperspectral test pixel with a pixel subspace constructed using a linear combination of the hyperspectral pixels in the training set, which makes full use of the informative highdimensional image in the representation and alleviates the computational cost in the classification. Figure 10 shows the sample from the Indian Pine Site 3 AVIRIS hyperspectral dataset [102]. In [103], Chen et al. proposed two strategies using the sparse representation of the pixels in the training set to represent the given test pixel. The first strategy considered the contextual information when performing classification, where four neighbor pixels in the spatial domain are utilized to represent the test pixel sparsely. The sparse coefficient vector was obtained from the linear combination composed of these four neighbors. The second strategy takes the interpixel correlation into account during classification. The joint sparsity model is implemented here to calculate the shared sparse coefficients among the N neighbors of the test pixel. The classification output was determined by the residual between the class-specific representation and test sample, which can be viewed as the SRC scheme. Based on the above second strategy, the joint sparsity model was extended to a kernel version in [104] to improve the classification performance. The joint sparsity model can also be extended to a multi-task model, which is the multi-task joint sparse representation (MJSR) proposed in [105]. In MJSR, image sets of different bands were clustered into B band sets, with t tasks established by selecting one band in each set. Next, a coefficient matrix whose row is the sparse coefficient vector of a task was learned for classification. The MJSR calculated the joint sparse coefficient while persevering the correlations in the spectral field. In [106], an adaptive neighborhood system was constructed by introducing selfpaced learning (SPL) [107]. The proposed self-paced joint sparse representation (SPJSR) learned the weight of each neighbor approximation and sparse coefficient in a self-paced scheme.
In [108], the multiscale adaptive sparse representation (MASR) was proposed to exploit the spatial information using multiscale test pixels. The different scales of the test sample provided different spatial structures and properties. This method jointly optimized different scale-level representations to produce a shared sparse coefficient vector across multiple scales. As for the output, the class label was determined by the residuals similar to SRC. MASR showed better classification performance than in the above-mentioned joint sparsity model using a single scale [103].
For dealing with the unstableness brought by the sparse representation, the manifold-based sparse representation algorithm was proposed in [109]. Two regularization terms were imposed in the conventional SRC method. The first regularization term was the locally linear embedding regularization so that the extension of SRC regularized by this term is called LLESR. In order to enhance the robustness of the sparse representation, the LLESR considers the local structure by minimizing the distance between the test sample and the representation composed of its neighbors. The second regularization term is the laplacian eigenmap regularization. Therefore the extension of SRC with this regularization term is called LESRC. The LESRC considers the local structure by minimizing the overall distances between each pair of neighbors from the test sample. Since the correlation between some classes is high in the HSI classification, using conventional sparse representation which imposes sparsity individually on each sample, is not suitable. The class-dependent sparse representation classifier (cdSRC) was proposed to make the SRC robust in HSI classification. The cdSRC combines the SRC and KNN [110] algorithm together in the coding procedure and subspace approximation procedure, respectively. The class-dependent sparse representation imposes sparsity on a class rather than a sample to make the class-specific residual of the test sample more discriminative as samples across the class will not represent the test sample. The classdependent KNN produced a class-specific distance between the test sample with the average samples in the neighborhood of samples from each class, preserving the locality information of the test sample.
In [111], collaborative representation (CR) was applied in hyperspectral imagery classification. Based on the idea of CR, the joint collaborative representation (JCR) model was built to perform classification in a competitive and efficient way. The JCR calculates the shared coefficient vector of the test pixel and its neighbors using the training sample. Furthermore, a nonlocal joint-signal matrix is constructed by the top n corr neighbors according to the degree of correlation to filter the pixels that are not similar to the test pixel. Similarly, using the collaborative representation, Jia et al. [112] applied the 3-D Gabor feature to the collaborative representation, termed as 3GCR, to boost the robustness of the classification performance on hyperspectral images. The 3-D Gabor feature provides an informative feature space that contains a large number of feature dimensions to ensure the robustness of the CR. The CR is an efficient representation method since it has a closed-form solution. Combining them together will produce a classification method, that is both effective and efficient. Inherited from the idea of fusing the Gabor feature with the CR, the Gabor cube selection based multitask joint sparse representation-based classification (GS-MTJSRC) was proposed in [113]. The 3-D Gabor transformation was applied to the data sample to generate the Gabor cubes based on three directions. Then, a filter removed the Gabor cubes with a low Fisher discriminative score for representation. Next, the multitask sparse representation represents the test pixel using the filtered Gabor cubes in the training set and finally outputs the classification result. This method outperformed the aforementioned 3GCR. Since the Gabor feature is beneficial to HSI classification, in [114], a multi-feature learning strategy that utilizes CR processing four types of features (global feature, local feature, shape feature, and spectral feature) was proposed. In this strategy, the CR will generate the coefficient vector for each feature, where an overall coefficient vector is obtained by summing up the subtraction between the coefficient vector and the mean overall coefficient vector.

B. FACE RECOGNITION
LRC methods have been extensively applied in face recognition applications due to the critical assumption that: the face images captured under different conditions (e.g., lighting, corruption, expression) lie on a low-dimensional subspace. According to Assumption 1, LRC methods are able to construct the face subspace for each class of face images. Therefore, by using LRC methods, there exists an effective and robust classification ability to perform face recognition. The SRC [14] method was originally designed for robust face recognition, which assumes the test sample can be represented by only the training samples from the same class. Therefore, the subspace constructed by the sparse representation is the face subspace of the test sample. Although the coefficients are sparsely distributed, the coefficients of the correct class are densely distributed. This phenomenon explains why the SRC is able to perform robust face recognition: the subspace built by samples from the correct class is more discriminative than the subspace constructed by the other samples. In [10], the CRC performed competitive and efficient face recognition by using collaborative representation, which represents the test sample using samples across different classes. The collaborative representation is closer to the test sample since all possible training samples are utilized, where the constructed class-specific subspace is nearer to the test sample as much as possible. It is argued that the locality brought by the CRC is more critical than the sparsity brought by the SRC. To produce the sparsity based on the collaborative representation, [15], [40], [60], [73] used a two-stage strategy and achieved acceptable classification performances. The patchbased representation [17], [41] divided the face image into several non-overlapped patches and integrated their outputs to make the final decision. For large-scale face recognition, the two-stage non-negative representation sparse representation [115] was proposed by reduce the scale of the dataset in the first stage and perform efficient non-negative sparse representation in the second stage. In [80], an adaptive SRC was proposed based on the trace norm, which maintained a balance between l 1 and l 2 minimization according to the data's correlation. For multiview face recognition, the joint sparse representation-based classification (JSRC) [116] was proposed, where it constructed a shared sparse coefficient for different views of an individual's face image. To overcome the pose variation of the face, synthesized face images were generated [23], [24] to produce the coefficient vector that was invariant to pose.

C. MEDICAL AND MULTIMODAL BIOMETRICS 1) MEDICAL BIOMETRICS
Medical biometrics is a research field that monitors an individual's health condition based on the characteristics of a certain disease [117]. Specifically, LRC has also been frequently used in this domain to detect disease individual according to the appearance of body surface features (e.g., regions of the face and tongue). In [118], [119], the SRC was used for microaneurysm detection by performing classification on the extracted retinal blood vessel image, which is a binary image containing the outline of the retinal blood vessels. In [8], the SRC was applied for diabetes mellitus detection based on the color features extracted from human facial block images. The detection result was promising, reaching 97.54% in accuracy. Based on the color features of the different combinations from the facial blocks, Shu et al. utilized the probabilistic CRC [16] to perform disease detection and achieved an impressive accuracy of 99.88% [38]. Using the same strategy in [120], a high accuracy was achieved for heart disease detection utilizing ProCRC based on the facial blocks again feature. Besides the facial images, tongue images were also involved, which is regarded as another view in medical biometrics. In [121], a joint discriminative collaborative representation (JCDR) method was proposed as a multimodal method to simultaneously process the facial blocks and tongue blocks as multiple views and color with texture as multiple features for detecting liver disease.

2) MULTIMODAL BIOMETRICS
Usually, in the biometrics field, there are different biometric information sources that require multimodal techniques to fuse the information for better results. As a LRC method, the group sparse representation based classification (GSRC) [7] method (as an extension of the SRC method) considered multimodal information when representing the test sample. In this method, the test sample was the concatenation of a N m modal sample, where N m sparse coefficient vectors corresponding to the modals were concatenated to construct a sparse coefficient matrix. All coefficient vectors were learned simultaneously. For processing the multimodal data on a high-dimensional space, KGSRC [122] was proposed as an extension of the GSRC method. In [123], the joint deep convolutional feature representation (JDFR) was proposed to perform hyperspectral palmprint recognition. In JDFR, a specially-designed CNN with 16 layers was used to extract each band's deep feature. Therefore, to use the information from all bands in the hyperspectral palmprint dataset, a CNN stack whose basic element is a 16-layer CNN was constructed. Followed by the CNN stack feature extraction, CRC was applied to perform classification on the concatenation of the deep features from the CNN corresponding to each band. The JDFR-CRC architecture outperformed other state-of-theart methods in hyperspectral palmprint recognition.

A. DATASET DESCRIPTION
In this subsection, we make a briefing of each dataset used in the experiments: GT. The Geogia Tech face database contains 750 face images from 50 people. Each image is JPEG image in the size of 150 × 150 on average. There are several forms of variation in each subject, such as different facial expression and lighting conditions. In the experiment, We resize each image to 40 × 30 pixels. Figure 11 (a) shows the samples in the GT face database.
ORL. The ORL database of faces contains 400 images from 40 classes. Each image is PGM image in the size of 92 × 112 pixels. Since the images are taken in different times, some images has different lighting condition, facial expression and details. In the experiment, we resize each image to 32 × 32 pixels. Figure 11 (b) shows the samples in the ORL face database.
AR. The AR face database contains 4000 color images from 126 people. Each image is a RGB RAW file in the size of 768 × 576 pixels. For each subject(person), there are two sessions which taken from 2 different days. Therefore, each subject contains images of different lighting condition, facial expression, and other intra-class variant. In the experiment, we resize each image to 40 × 32 pixels. Figure 11 (c) shows the samples in the AR face database.
COIL20. The columbia object image library contains two sets of images. The first set has 720 raw images of 10 objects, and the second set has 1440 images from 20 objects. In the experiment, we select the second set. Each image in the dataset is a PGM image in the size of 128 × 128 pixels. For each subject (object), there are 72 images captured by a CCD with 360 degree rotating around it. Each image has 5 degree angle changing compared with the previous one or the next one. We resize the each image to 32 × 32 pixels for the experiment. Figure 11 (d) shows the samples in the COIL dataset.
FEI. The FEI face database contains 2800 images from 200 people. Each image is a JPEG image in the size of 640 × 480 pixels. In the experiment, we resize each image to 24 × 96 pixels. For each subject (person), there are 14 images with different angle of face (ranging from 0 degree to 180 degree) and lighting conditions. Figure 11 (e) shows the samples in the FEI dataset.
Yale B. The Yale face database B contains 5760 images from 10 person. Each image is a PMG image in the size of 640 × 480 pixels. In the experiment, we resize each image to 320 × 240 pixels. For each subject (person), there are 576 images with 9 different poses and 64 lighting conditions. Figure 11 (f) shows the samples in the YaleB database.
Flavia. The Flavia is a leaf recognition system [130]. Here we call the leaf image dataset used for the system as the Flavia image dataset. The Flavia image dataset contains 1907 images from 32 species. Each image is a JPEG image in the size of 1600 × 1200 pixels. In the experiment, we resize each image to 30 × 40 pixels. The samples of the Flavia dataset are shown in Figure 11 (g).

B. EXPERIMENT SETTING
In the experiments, we applied the 13 LRC methods on 7 image datasets. We ran all the experiments on a PC with an Intel Core i7-6700 CPU and 16GB RAM. The software platform was Matlab 2018a. There are two parts to the experiments: parameter analysis and results. We will first perform parameter analysis to ensure the results shown in section V-C2 are the optimal for each classifier. Then, for each dataset, we show the results of different methods by increasing the number of training samples from each class. For each result in section V-C2, we ran it 10 times to calculate the standard deviation.

1) PARAMETER ANALYSIS
We first adjusted the regularization parameter λ used in the coding procedure of the LRC methods (refer to Eq. 6).
The parameter analysis is performed on seven datasets: GT, ORL, AR, FEI, COIL20, Yale B, and Flavia. We set the number of training samples in each class as 10 (GT), 7 (ORL), 16 Figure 12 shows the performances of each classifier under the different regularization parameter λ settings. The NRC is not in the analysis since its λ is set to zero (see Eq. 36). It is clearly seen that SRC and SCRC will have sharp drops when the value of λ approaches 1. Some classifiers may not be stable when testing on different datasets. For example, the accuracy of KWCRC decreased rapidly on the ORL dataset and FEI dataset (see Figure 12 (b) and (d)), and SARC achieved a low accuracy in the COIL20 dataset. For the face database, the best accuracies obtained by the different classifiers are relatively close to each other. The standard deviation of the best accuracies for the GT and ORL datasets are 4.08% and 3.07%, respectively. For the object dataset, the best accuracies attained by different classifiers are relatively large. The standard deviations of the best accuracies in the COIL20 and Flavia dataset are 13.62% and 6.23%, respectively.
Besides the regularization parameter λ, we adjusted the number of candidates in the whole training set for TPTSR and CFFR methods. For TPTSR, the candidates are the samples and for the CFFR, the candidates are the classes. We changed the ratio of candidates over the whole training samples to obtain the optimal performance of these two methods. We selected the optimal λ based on the results of Figure 12. The performances of these two methods under the different candidates' ratios are shown in Figure 13. We can observe that for different datasets, better performances are achieved by different methods. TPTSR showed a stable performance on all seven datasets. CFFR had larger fluctuations on the GT and Flavia datasets (see Figure 13 (a), (g)). For TPTSR, the best ratio of candidates was within 10% to 20%, while for CFFR, the best ratio of candidates is within 40% to 70%.

2) RESULTS
Now that the optimal parameters have been selected for the LRC methods, we next show its recognition rates on 7 datasets with an increasing number of training samples. Table 2 illustrates the experimental results, with Figure 14 the standard deviations of the LRC methods on different datasets. We can observe that the LRC methods showed diverse standard deviations for these datasets, indicating the stability of a certain method depends on the data being classified. The highest value of each dataset using a certain number of training samples in each class is marked in bold font. Generally speaking, there is no classifier that achieved the highest accuracy on all datasets. For the GT database, CFFR achieved the best accuracy of 77.61% by using 10 training samples. For the AR database, ProCRC is the best classifier with an accuracy of 91.94% using 20 samples. In the ORL database, both ProCRC and KCRC were the best classifiers with the same highest accuracy of 96.25% using 8 samples per class. However, the standard deviation of KCRC (0.02) is lower than ProCRC (0.86), which is considered as a better classifier due to its stronger classification stability.
For the COIL20 dataset, TPTSR achieved the highest accuracy of 74% when using 20 training samples per class. In the the FEI dataset, KWCRC was the best classifier with an accuracy of 83.5% when using 8 training samples per class. As for the Yale B database, the highest accuracy was obtained by NRC with an accuracy of 81.77% using 35 samples. KSRC produced the highest accuracy of 71.41% by using 35 samples per class on the Flavia database. It should be pointed out that some classifiers were able to achieve the highest accuracy on one dataset using any number of the training samples, showing its superiority over other classifiers. For example, according to Table 2, NRC achieved the highest accuracies on the YaleB database using the number of training samples per class ranging from 15 samples to 35 samples per class. KWCRC obtained the highest accuracies on the FEI dataset using the number of training samples per class ranging from 6 to 8, according to Table 2. TPTSR achieved the highest accuracies on the COIL20 dataset using the number of training samples ranging from 10 samples to 20 samples per class. Among all classifiers, NRC had the 7 highest accuracies on the different datasets. Also, we can observe that the weighted strategy effectively enhances the classifier's performance. For example, AWCRC held the 6 best accuracies compared with others. Similarly, the kernel version of the LRC methods also showed its effectiveness in the enhancement. KSRC and KCRC achieved 2 and 5 of the highest accuracies in the comparison. For the object recognition task (COIL20 and Flavia), the gap between the highest accuracy and the lowest accuracy was relatively larger. The standard deviation of accuracies with different classifiers in COIL20 and Flavia were 12.98% and 13.27%, respectively.

D. DISCUSSION
We can discuss the following items in terms of the experiments from the previous sections: The regularization parameter of the LRC methods will influence the accuracy of image classification. A change in accuracy ranging from 3.07%-4.08% (except for the extreme case) was caused on the face dataset, and a change of 6.23%-13.62% (except for the extreme case) in accuracy occurred on the object dataset. Besides, different classifiers will be influenced by different extents. SRC will have its accuracy rapidly decreasing when the regularization parameter is approaching 1. SARC will be unstable when the regularization parameter changes. However, for the other classifiers, the influence caused by the parameter affects the performs less.
There is no one LRC method dominating over the other methods in image classification. The highest accuracies achieved in Table 2 are distributed in different cases. The LRC method with the highest number of accuracies is NRC. Furthermore, there are few LRC methods that achieved the best accuracy among all training samples per class (except for NRC in the Yale B database), indicating some methods  cannot ensure the best classification ability when given insufficient information/samples even though it achieved the highest accuracy in this dataset. Since different classifiers have different image space assumptions in the image space, its recognition abilities will be more apparent when a specific assumption is met in the real classification scenario.
The weighting strategy and kernel extension of the LRC methods can truly enhance the performance. AWCRC, as an extension of the CRC using a weighting strategy, outperforms CRC on almost all the cases (except for the AR dataset using 20 training samples per class). Moreover, the AWCRC shows its superiority in classifying using insufficient information/samples on GT, AR, and Flavia datasets. The additional attention to the critical samples in the representation brings the enhancement of the weighting strategy. For the kernel extension, both KSRC and KCRC achieved a better classification performance than its original version on the majority of cases. This is because the kernel extension has the ability to process in the non-linear space. However, KWCRC, which is the kernel extension of Weighted CRC, does not outperform KCRC in most cases, indicating there is no accumulated enhancement when imposing both kernel extension and a weighting strategy on the LRC methods.
The sample removing strategy also shows its effectiveness according to the experiments. As typical LRC methods that are based on the sample removing strategy, TPTSR and CFFR showed better performance in some cases. TPTSR had a better classification result on object recognition, which achieved 4 of the highest accuracies (COIL20 (15,20,25), Flavia (25)). For CFFR, it had a better performance on the face recognition task, where 2 of the highest accuracies (GT (10) and AR (16)) were achieved. However, both of these methods are influenced heavily by the number of candidates. According to Figure 13, in the Flavia dataset using 40 training samples per class, the accuracy of CFFR changed from 52.15% (70% of all classes) to 64.59% (10% of all classes). Similarly, in TPTSR, it had the largest accuracy gap between the highest accuracy and the lowest accuracy of 6% on the GT database using 7 training samples per class. These phenomena were related to the sparsity in the representation: the number of candidate settings should ensure sufficient sparsity in the representation to guarantee its classification ability.
LRC shows different properties in different image classification tasks. For the face recognition tasks, different methods achieved relatively similar performances with lower standard deviations. In contrast, in the object recognition task, the performance gap is larger. This maybe because one object in the image set usually has many views, which makes the object subspace more complicated than the face subspace in face recognition.
The LRC methods still suffer from a lack of sufficient information/samples when performing image classification. By observing Table 2, there are still large gaps between the most training samples per class used and the least training samples per class used, implying that the number of training samples per class used significantly impacts the performance of LRC methods. For example, in the FEI dataset, SRC achieved an accuracy of 69.33% when using 8 samples per class, while it only achieved 46.94% accuracy when using 5 training samples per class. The difference of 3 samples per class brought a 22.39% reduction in accuracy. The corresponding interpretation for it can be: insufficient samples have a higher probability of failing to construct a fine classspecific subspace.
Based on previous works and the above discussion, here, we point out the challenges and potential future research points of linear representation for image classification: 1) The LRC methods still require sufficient data in each class when performing classification. 2) The LRC methods will achieve a poor performance when the dataset is highly imbalanced.
3) The parameters in the LRC methods still have a large impact on its performance. 4) It is necessary to develop an efficient algorithm for LRC methods to process high-dimensional data. 5) t is necessary to investigate the fusion among properties of the coefficients, such as sparsity, collaboration, and non-negativity.

VI. CONCLUSION
This survey reviewed the linear representation-based classification methods for image classification, termed as LRC methods. We provided a clear definition of the LRC methods and summarized them in a specific algorithm. The various LRC methods can be categorized into 6 classes: 1) linear representation-based classification methods with norm minimizations, 2) linear representation-based classification methods with constraints, 3) linear representationbased classification methods with feature spaces, 4) linear representation-based classification methods with structural information, 5) linear representation with subspace learning, and 6) linear representation in semi-supervised learning and unsupervised learning. Moreover, we discussed three application areas in image classification that extensively apply the LRC methods. Finally, we performed comprehensive experiments on 7 image datasets to analyze and show the performances of different LRC methods.