A Manifold Laplacian Regularized Semi-Supervised Sparse Image Classification Method With a Variant Trace Lasso Norm

Since the cost of labeling data is getting higher and higher, we hope to make full use of the large amount of unlabeled data and improve image classification effect through adding some unlabeled samples for training. In addition, we expect to uniformly realize two tasks, namely the clustering of the unlabeled data and the recognition of the query image. We achieve the goal by designing a novel sparse model based on manifold assumption, which has been proved to work well in many tasks. Based on the assumption that images of the same class lie on a sub-manifold and an image can be approximately represented as the linear combination of its neighboring data due to the local linear property of manifold, we proposed a sparse representation model on manifold. Specifically, there are two regularizations, i.e., a variant Trace lasso norm and the manifold Laplacian regularization. The first regularization term enables the representation coefficients satisfying sparsity between groups and density within a group. And the second term is manifold Laplacian regularization by which label can be accurately propagated from labeled data to unlabeled data. Augmented Lagrange Multiplier (ALM) scheme and Gauss Seidel Alternating Direction Method of Multiplier (GS-ADMM) are given to solve the problem numerically. We conduct some experiments on three human face databases and compare the proposed work with several state-of-the-art methods. For each subject, some labeled face images are randomly chosen for training for those supervised methods, and a small amount of unlabeled images are added to form the training set of the proposed approach. All experiments show our method can get better classification results due to the addition of unlabeled samples.


I. INTRODUCTION
Image classification is one of the most active applications in image processing, computer vision and machine learning and has been extensively studied by numerous researchers. Meanwhile, numerous image classification and representation methods have been proposed. Wright et al. proposed a Sparse Representation Coding (SRC) method by applying the l 1 -norm based sparse representation to Face Recognition (FR) [1]. SRC has shown interesting results in image classification and recognition and has been widely used and extended, as The associate editor coordinating the review of this manuscript and approving it for publication was Yudong Zhang . evidenced by its many followup papers. Some later works, on the other hand, began to investigate the role of sparsity in image representation [2]- [5]. Yang et al. [4] gave an insight into SRC and provided some theoretical support for its effectiveness. They argued that it is l 1 constraint rather than l 0 that makes SRC effective. Lei Zhang et al. indicated that most literatures emphasized too much on the role of l 1 -norm sparsity in image classification. They demonstrated that it is actually the Collaborative Representation (CR), i.e., using the training samples from all classes to represent the query sample, but not the l 1 -norm, that plays the essential role in SRC. Therefore, they proposed the CR based classification with regularized least square (CRC_RLS) [5], which has significantly less complexity than SRC but leads to very competitive classification results. Zhang et al. [6] extended their CRC to the robust version, robust collaborative representation classification (RCRC), using the Laplacian estimator to deal with severe random pixel noise and illumination changes. Grave et al. proposed a Trace Lasso (TL) norm [7]. It is proved that TL interpolates between l 2 -norm and l 1 -norm. Its behaviour is adaptively related to the correlation of the training data. For a recognition task, if the labeling information is available, by integrating the labeling information with TL, Jian Lai et al. proposed a method named Supervised Trace Lasso (STL) [8]. This method can cluster the samples from the same subject but with different variation information together, which conforms to the goal of identification.
All the above works consider the query sample lies in the linear space spanned by the training samples from the same class. However, in practice, taking face image for instance, when reflectance is typically non-Lambertian and the pose of the subject varies, the data do not necessarily conform to linear subspace models. On the other hand, these methods are supervised methods, that is, they require the labeling information of all training samples. They are effective only in the small-sample-size case because when samplesize gets larger the computing burden for these methods will get heavy and particularly the cost of labeling data may be unaffordable. In recent years, a lot of methods based on deep learning [9]- [13] have been proposed. These methods obtained very competitive results on image classification and recognition task. Even though there are only a small amount of labeled samples, deep learning can also be used through semi-supervised network as long as the quantity of samples is enough. However, in some practical applications, especially for some cold research fields or some institutions with limited condition, public data sets may not be able to meet their requirement and data acquisition is also very difficult. As a result, they have to face the small-sample-size situation, which is the focus of our work here. We will show that when a small number of unlabeled samples are added to training set the classification result can be effectively improved.
Manifold regularized semi-supervised learning (MRSSL) is one of the most successful methods in computational imaging. A set of N by M images may be better modeled by a manifold embedded in an NM-dimensional Euclidean space, called an image manifold [14]. MRSSL exploits the local structure of data distribution including both labeled and unlabeled samples to leverage the generalization ability of a learning model. There are many representative works in MRSSL, in which the most prominent is Laplacian regularization which determines the underlying manifold by using the graph Laplacian [15], [16]. With the merits of simple calculation and promising performance, Laplacian regularization based semi-supervised learning has received extensive attention and many algorithms have been developed, including Laplacian regularized support vector machines, Laplacian regularized kernel least squares [15], [17], and Laplacian regularized nonnegative matrix factorization [18]. In addition, P-Laplace regularization is proposed in [19] to preserve the local geometry and is applied to support vector machines and kernel least squares. Reference [20] presented a hyper graph P-Laplacian regularization for remotely sensed image recognition. P-Laplacian is a natural generalization of the standard graph Laplacian. Besides Laplace regularization, [21] presented Hessian regularized multiset canonical correlations for multiview dimension reduction.
In this paper, a semi-supervised sparse image classification model on manifold is presented as follows arg min (1) The query image y ∈ R m is first collaboratively represented by the whole training samples A = [a 1 , a 2 , · · · , a n ] ∈ R m×n , whether they are labeled or unlabeled. x is the representation coefficient. We assume the images of one class lie on a submanifold. Since manifold locally is a linear space, the query image can be approximately represented as the linear combination of its neighbour data, namely only the coefficients correlated with these data are not zero in the collaborative representation.
Assume among the whole n training samples, only samples whose index belongs to S have identity information. Let z j ∈ R c (1 ≤ j ≤ n) be the label vector of sample a j , here c is the number of classes. Since we don't know c, the dimension of z j just needs to be set larger than c and at most equals to min {m, n}. Here, for simplicity, we still use c to represent the dimension of z j . The label matrix is Z = [z 1 , · · · , z n ] ∈ R c×n . Assume g j = g ij ∈ R c (j ∈ S) is the label vector of the labeled sample a j . The entries of g j is very simple. g ij = 1 when a j is in the ith class, while g ij = 0 otherwise. If a j already has a label, namely j ∈ S, then z j = g j . This can be fulfilled by the last term in (1), in which U ∈ R n×n is a diagonal matrix, U jj takes a large value when a j is labeled, otherwise U jj = 0. If a j is not labeled, we set g j to be a zero vector and z j needs to be solved.
We present two regularization terms. The first regularization term, a variant Trace Lasso norm ZDiag (x) * , forces the group sparsity instead of the sample sparsity, which means the query image is presented by a small number of groups, and in each group, the training samples are fully used. This sparsity between groups and density within a group is preferred to the aim of image classification. The second one is the manifold Laplacian regularization n i,j=1 z i − z j 2 2 S ij , with which distance is calculated along the manifold of samples and for each class label can be appropriately propagated from labeled samples to unlabeled samples, then the accurate classification can be reached. λ and υ are two parameters used to balance the roles of two regularization terms.
The rest of the paper is organized as follows. The related works to our method are analysed in Section 2. In Section 3, we present the manifold Laplacian regularized semi-supervised sparse image classification model. And two regularization terms are presented. The first one is a variant Trace Lasso regularization by combining semi-supervised samples with Trace Lasso norm. The second one is the manifold Laplacian regularization. The numerical methods, i.e. the Augmented Lagrange Multiplier (ALM) scheme and Gauss Seidel Alternating Direction Method of Multiplier (GS-ADMM) are given in Section 4 to solve the problem numerically. In Section 5, through experiments with several commonly used databases, we compare the performance of the proposed method with several state-of-the-art methods to show the effect of our method. Concluding and discussing remarks are made in Section 7. As far as we know, it is the first time to use unlabeled samples in the sparse representation based image classification methods. Compared to the methods based on deep learning, our method belongs to the type of knowledge-based modeling approaches which has clear structure and better theoretically interpretability.
Some notations used in this work are defined as follows.
|x i | is the l 1 norm, x 0 is the l 0 norm which counts the number of nonzero elements of vector x. Diag (x) is a diagonal matrix whose diagonal entries are x. For a matrix X = X ij , x j and x j represent the jth column and jth row of X separately, X F = X ij 2 denotes the Frobenius norm, X * = σ i (X) is the nuclear norm, here σ i (X) is the ith singular value of X. X T refers to the transpose of X. Diag (X) describes the diagonal matrix with diagonal components being X ii , diag (X) is a vector with entries X ii . tr (X) is the trace function of the square matrix X.

II. RELATED WORKS A. SRC
Wright et al. considered the following model arg min where the query sample is represented as a linear combination of all labeled training samples. Suppose the training samples from the same subject to be in a single subspace, therefore the l 0 -norm forces the query sample to be sparsely represented and we hope that the training samples significantly contribute to the representation are from the same subspace with the query. However, (2) is an NP-hard problem. It has been proved that, if x is sparse enough, the solution of l 0 minimization problem (2) is equivalent to the solution of the following l 1 minimization problem called SRC arg min Many efficient methods have been proposed to solve this problem [22]- [24]. Further analysis showed that if training data are highly correlated, to achieve the sparse goal, SRC may randomly select one sample. This randomness could cause the SRC unstable and lead to misclassification by selecting the sample from the wrong subject.

B. CRC_RLS
Lei Zhang et al. indicated that it is the CR, but not the l 1norm sparsity, that plays an essential role for classification in SRC. Therefore they proposed to use the l 2 -norm, which can have similar classification results but with significantly lower complexity arg min However, when columns of A are orthogonal to each other, we need many samples to faithfully represent y, then the discrimination ability of (4) becomes weaker.

C. TL
Grave et al. proposed a new norm named Trace Lasso (TL) norm ADiag (x) * . It is shown that TL interpolates between l 2 -norm and l 1 -norm. Its behaviour adaptively depends on the correlation of training data. When all columns of A are the same, the result of TL is the same as that of l 2 -norm of x. When all columns of A are orthogonal to each other, the result of TL is the same as that of l 1 -norm of x. So TL norm shares the advantages of l 1 -norm and l 2 -norm. Using it as the regularization term, the following problem is obtained arg min TL naturally clusters the highly correlated initial sampling data A. However, it is well known that face images include much information, such as identity and variations (e.g. illumination and expression). In the uncontrolled environment, variation information can be more significant than identity. In this case, the correlation will depend more on variations than identity. Therefore, face images from different subjects with similar variations could have a higher correlation than those from the same subject but with different variations. As a result, TL naturally clusters the samples with similar variations together. The outcome of TL is contradictory to the goal of identification, which is to cluster the samples according to their identities.

D. STL
For a recognition task, if the labeling information is available, by integrating the labeling information with TL, Jian Lai et al. proposed a method named STL. Their method is as following arg min in which a class dependent matrix G ∈ R m×n is introduced in the trace lasso term, where G = [G 1 , G 2 , · · · , G c ] and G i ∈ R m×n i , here n i is the number of training samples in the ith class and c is the number of training classes. In G i , all elements in the ith row are one and those in the other rows are zero. With G, the correlation of column vectors within the class is one and that between the classes is zero. This VOLUME 8, 2020 method can cluster the samples from the same subject but with different variation information together, which conforms to the goal of image recognition. In all the above methods, CR is used, that is, the query image is considered as a linear combination of all training samples. The query image lies in the linear subspace spanned by the training data from the same subject. When this subspace has sufficient samples and can be expanded by these samples, namely it is complete, the query sample can be faithfully represented and representation error approaches zero. Unfortunately, sometimes image classification may be a typical small-sample-size problem, even the amount of samples may not meet the completeness requirement, not to mention having to label all the training samples. When only a small number of labeled images are available, they will lead to wrong classification results.

III. PROPOSED APPROACH A. THE PROPOSED MODEL
Nowadays, it is easy to collect unlabeled samples because of the convenience supplied by Internet. MRSSL successfully exploits the local structure of data distribution including both labeled and unlabeled samples. With the unlabeled samples, which are from various different subjects, the number of samples from the same class with the query is firstly increased and therefore the representation ability is improved. Besides, the unlabeled images can be automatically labeled using the proposed model rather than manual participation.
Manifold usually means the graph locally having the property of Euclidean space. We assume all samples lie on a low dimensional manifold which is embedded in a high dimensional Euclidean space. Images of the same class have the same label and lie on a sub-manifold. Since manifold locally can be approximated as a linear space, any point on it can be approximated by the linear combination of the neighboring points. Consider the query sample y ∈ R m as a collaborative representation of all training samples A = [a 1 , a 2 , · · · , a n ] ∈ R m×n , then only the linear representation coefficients, which are correlated with the data on the same submanifold with y and at the neighborhood of y, are nonzero and the other coefficients are all zero. This is equivalent to find a kind of sparse representation of y about all training samples. Fig.1 (a) is a practical example where there are two classes of data and only two samples (one for each class) are labeled. These two labeled data are marked with blue circle and orange cross respectively. The other points are all unlabeled so we need to find the labels of all these data. This is a very difficult clustering task due to the lack of labeled data. For the convenience of illustration, Fig.1 (a) is simply shown as Fig.1 (b). The data in Class one are shown as small triangles and data in Class two are shown as black dots. The two curves represent two sub-manifolds associated with the two classes. In full-supervised case, that is, only labeled samples can be used. Therefore the point marked with red star can only be represented with the blue circle point and orange cross point. Since these three points are in the same linear space (the black straight line through the three points), the red star point can be represented as the linear combination of the blue circle point and orange cross point. And both representation coefficients may not be zero, which implies the red star point can be classified to Class one or Class two. This may lead to a wrong classification result. While as the unlabeled points are added and under the assumption of image manifold, the red star point can be approximately represented as the linear combination of the points on the tangent plane of the submanifold (the blue straight line), and all these points are from Class two. This is the classification result we expect.
We measure the reconstruction error with l 1 -norm, which is much more robust than l 2 -norm to handle real-world contamination.
arg min There are two regularization terms in our model.
The first term is the manifold Laplacian regularization υ 2 n i,j=1 (8) υ is the regularization parameter used to adjust the smoothness of manifold. Assume S = S ij n×n is a matrix with element S ij being the similarity between two samples a i and a j . The similarity matrix is used to obtain the labels of the unlabeled training samples. Let For alleviating the number of parameters, here the similarity S ij only takes value 0 or 1. S can be simplified as (10) is the exceptional case of (9) as σ → ∞. Equation (8) means when similarity degree between a i and a j is 1, their labels should be as same as possible. Using all the training samples a j (j = 1, 2, · · · , n) as nodes, a i and a j have a connection between them as S ij = 1 and no connection as S ij = 0. Then we can obtain a graph of all the samples. Since for each sample the most similar sample must come from the same class with it, under an appropriate threshold k (note that a small k will be fine in formula (10)), each node must connect with at least one node on the same sub-manifold. Then through the function of manifold Laplacian regularization (8), the label can be propagated from the labeled nodes to unlabeled nodes along the connections. The connected nodes (samples cluster) therefore can share the same label. This can be simply illustrated by Fig.2, where there are two classes of data and only two are labeled (one for each class and tagged with blue circle and orange cross separately). The other points are all unlabeled. For each class the label can be properly propagated to unlabeled data along the connections because of the function of the manifold Laplacian regularization.
The second regularization, a variant Trace Lasso norm is proposed as follows The TL term ZDiag(x) * can be considered as an approximation to the rank of ZDiag(x). We set Z i = z i 1 , z i 2 , · · · , z i n i ∈ R c×n i is composed of the label vectors of all samples from ith class. Since the label can be accurately propagated from labeled data to unlabeled data among the samples on the same sub-manifold, Z i will have the structure that all the elements in ith row are one and those in other rows are zero. Therefore, the formula (11) can automatically seek a sparsity of the number of classes, which means the query image is represented by a small number of groups. Once one class is selected, it is in favor of using more samples from the same class, just as that illustrated in [8]. Therefore, the second regularization forces the group sparsity, and in each group, the training samples are fully used. This sparsity between groups and density within a group are preferred to the aim of image classification.
As a sum of above, the complete model we propose is arg min This model is a generalization of STL. If all the training samples are labeled, namely Z is known, the third and fourth terms will automatically disappear, then the formula (12) is the same with that of STL. If all training samples or a part of them are unlabeled, we can obtain the unknown labels at the same time of identifying the query image by (12). Assume G = g j n j=1 ∈ R c×n and D jj = n i=1 S ij , D is a diagonal matrix, L = D − S is the graph Laplacian matrix. Then (12) can be reformulated as arg min Since the first two terms of formula (13) are not differentiable, this makes it impossible to achieve the solution directly through optimization methods such as gradient descent. The original problem is converted to the following equivalent constrained problem arg min e, J, Z, x VOLUME 8, 2020 We use the ALM scheme to derive the following unconstrained optimization problem arg min e, J, Z, x L (e, J, Z, x) = e 1 + λ J * + υtr ZLZ T where Y ∈ R c×n and θ ∈ R m are the Lagrangian multipliers, µ > 0 is the penalty parameter. Instead of optimizing all arguments simultaneously, we solve them individually and iteratively using GS-ADMM. By fixing J, Z, x, we optimize e by the following subproblem arg min The solution of (16) can be achieved via soft-thresholding.
To update J, the following sub-problem is solved arg min Problem (17) can be solved by singular value thresholding operator.
The optimized x can be obtained as arg min This problem can be solved by solving the following linear system As the left multiplied matrix A T A + Diag(Z T Z) is inversible, x can be solved directly.
By fixing e, J, x, we optimize Z by the following subproblem arg min It can be solved using the following equation The Lagrangian multipliers are updated as The steps (16), (17), (19), (21), (22), (23) are repeated until the convergence conditions are attained. Algorithm 1 summarizes the procedures to solve the optimization problem. The numerical experiments in the next section can confirm the convergence of this algorithm.

C. CLASSIFICATION
Once the matrix Z is obtained, for the unlabeled data, the element Z ij describes the probability of jth data belonging to ith class. If Z ij is the element with the largest absolute value in vector z j , we set Z ij = 1 and all other elements are set to zero.
We classify the query sample according to the representative coefficients vector x. The l 1 -norm is still used to measure the reconstruction error to be consistent with the first term of (13). The reconstruction error of each class is 97366 VOLUME 8, 2020 Here, product Z i × diag (x) is to extract the representation coefficients correlated to ith class. r (i) is the representative error of using all the training samples from ith class to represent y. Finally, the query sample is labeled to the class with the minimum residual as following

IV. EXPERIMENTAL RESULTS
As an important application of image classification, face recognition is mainly considered in this section. Of course, our method can be extensively applied to other data classification task as long as the data distribution conforms to the manifold assumption. The proposed method is compared with the state-of-the-art approaches including SRC, RCRC, TL, STL, nuclear norm based matrix regression (NMR) classification [25], weighted group sparse classifier (WGSC) [26], iterative re-constrained group sparse classification (IRGSC) [27]. We use three popular face databases: Extended Yale B database [28], AR Face Database [29] and ORL [30]. For the first two databases and the methods SRC, TL and STL, we use the similar setting as that used in [8] and directly cite some results reported in [8]. We compute the recognition accuracy (RA) as Recognition Accuracy =

Number of correctly recognized testing data
Total number of testing data (26) The average RA (ARA) are the results of over 10 runs across various methods for each testing image of every subject. We directly utilize the grey level as the feature in all experimental scenarios for all approaches. The best results are shown with bold font in all the tables below. There are three parameters needed to be tuned in our method: λ, υ and k, where λ and υ are used to balance the roles of two regularization terms. k is the parameter used to choose the most relevant samples for each data in formula (10). Because the label can be accurately propagated from labeled data to unlabeled data by manifold Laplacian regularization, k can take a small value. In the following experimental scenarios, k = 2 achieves good results. Fig.3 shows variations of ARA with parameters λ and υ. Here for each subject from Extended Yale B database we randomly choose 13 images with 8 labeled images and 5 unlabeled images as training set. Other 32 images are used for testing. Then we run our algorithm 10 times and calculate ARA. We can see when the value of λ is taken from interval [1,10], the highest ARA can be obtained, here paramenter υ is fixed as 1. As λ < 1, ARA rapidly decreases. In the same way, we can get the best choice for υ is [1,22] as fixing λ = 1. The ARA is more sensitive to small λ and υ. Experiments show these choices also achieve the best results in all the following experimental setting. The parameters for each other method are also finely tuned to achieve its best result.

A. EXTENDED YALE B DATABASE
There are 38 subjects in Extended Yale B database. Each subject includes about 64 face images captured under different illuminations. All the images are down-sampled to 48 * 42. For per subject, we randomly select t = 8 images for training for the methods SRC, RCRC, TL, IRGSC, NMR, WGSC and STL. These are all fully supervised methods, which means they require all the training samples to be labeled. Based on the t labeled images, unlabeled samples are added then we check the recognition effect of our method. The number of unlabeled training images is denoted as s, therefore the training images for our method is t + s in total. 32 images are used for testing for all the eight methods. The ARA are reported in Table 2. With the choice of s = 24, our method can achieve best result than the other seven methods. Table 3 shows the influence of the value taken for s on ARA in our method. When s equals to 0, namely our model degenerates to that of STL, the ARA of our method is the same with that of STL method. While when the unlabeled samples are added, our method achieves higher and higher ARA with increase of the number of unlabeled images.

B. AR DATABASE
AR database includes 126 subjects. For each subject, 26 face images are taken in two separate sessions. Each session is with the expression, illumination and disguise variation. In this paper, a subset of 100 subjects is used with each subject getting 14 images selected and only with expression or illumination changing. All images are down-sampled to 50 * 40. For each subject, t = 4 face images from Session 1 are used for training for the methods SRC, RCRC, TL, IRGSC, NMR, WGSC and STL, and all these images are labeled. s = 3 unlabeled images of Session 1 are added to form the training set of our method. All the samples from Session 2 are used for testing. Table 4 shows the results of all involved methods. With the addition of the unlabeled samples, our method can achieve better results than all the other seven methods.

C. ORL DATABASE
The ORL data set consists of face images of 40 distinct subjects, each subject having 10 face images under varying lighting conditions, with different facial expressions and facial details. In our experiment each image is down-sampled from 112×92 to 32×32. For each subject, t = 3 labeled face images are used for training for the methods SRC, RCRC, TL, IRGSC, NMR, WGSC and STL, and s = 2 unlabeled images are added for training in our method. 5 images are used for testing. Table 5 gives the ARA of different methods. We can observe that our method can get better classification results than other methods due to the addition of the unlabeled samples, which further confirms the role of the unlabeled data.

V. CONCLUSION AND DISCUSSION
For the small-sample-size case especially when only a small number of labeled images are available and for the use of the unlabeled samples, a semi-supervised sparse image classification technique is proposed. The query image is collaboratively represented by the whole training data, whether they are labeled or unlabeled. Based on the assumption that images of the same class lie on a sub-manifold and the local linear property of manifold, an image can be approximately represented as the linear combination of its neighbouring data. There are two regularization terms. A generalized trace lasso regularization term is proposed by combing semi-supervised samples with a variant trace lasso norm. This term seeks the sparsity of the number of classes instead of the number of training samples, which directly coincides with the objective of data classification. By using manifold Laplacian regularization, the label of labeled images can be propagated to unlabeled images within a class along the distance of samples manifold. Both aims of image recognition and finding out the unknown identities of samples are achieved simultaneously. ALM Method and GS-ADMM are applied to solve the whole model.
Nowadays a discussion hot point in computational imaging is if it is the time to discard the classic methods and fully replace them by deep learning based methods. On the one hand, a prerequisite for deep learning based methods is a huge amount of samples. However, there are indeed some situations where there are only a small number of samples, at this time the knowledge based modeling methods are more suitable. On the other hand, classical methods have clear structure and theoretical guarantee. They are based on the knowledge of the problem we are trying to solve rather than seeking for best performance by intuitively choosing architectures or trial an error. In the future work, it is possibly better to integrate the classical knowledge based approaches into the deep learning architecture, making the algorithm enjoy both the flexibility of the deep learning based methods and the clear structure of the classical approaches. For example, the result of our algorithm is dependent on the selection of the similarity matrix S, if S is not properly selected the label can't be accurately propagated. We will try to solve this problem and all the parameters that need to be determined by designing a deep network.