Prototype Relaxation With Robust Principal Component Analysis for Zero Shot Learning

Zero Shot Learning (ZSL) has been attracting increasing attention due to its powerful ability of recognizing objects of unseen classes. As one type of ZSL methods, the low rank based strategy has achieved remarkable success. However, traditional low rank based methods are often based on the assumption that a variety of visual features from a same class can be projected to a single attribute by ignoring the background information and other noisy interference in visual features. This assumption is unreasonable and often leads to bad performance when there is big variance within a class. In this paper, a novel method called Prototype Relaxation with Robust Principal Component Analysis (RPCA) is proposed to relax this assumption by adding a sparse noise constraint. In addition, to avoid the confusion between similar classes, an orthogonal constraint is employed to disperse all the class prototypes, including both seen and unseen classes, in latent space. Furthermore, to alleviate the domain shift problem, vectors from latent space are exploited to reconstruct visual features and semantic attributes respectively. Besides, the hubness problem is also mitigated by applying the max probability model in all three spaces. Extensive experiments are conducted on four popular datasets and the results demonstrate the superiority of this method.


I. INTRODUCTION
Due to the rapid development of deep learning, the technique of image classification has gained remarkable achievement [1], such as Deep Residual Network (ResNet) [2] can achieve over 95% top-5 accuracy on the 1000-category image dataset -ImageNet [3], which has been proven to surpass the recognition ability of human beings. However, these image classification methods can only survive on the close set setting, when there comes a sample of new category which did not appear in the training set, they will definitely get wrong result.
To solve the open set image classification problem, Lampert et al. proposed a novel method called Zero Shot Learning (ZSL) [4], which can recognize the samples of unseen classes by transferring the knowledge learned from seen classes with semantic attribute bridging. For example, one can recognize a new category of objects which he has The associate editor coordinating the review of this manuscript and approving it for publication was Krishna Kant Singh . never seen before by being told how they are similar to other known objects [5]. Similarly, ZSL transfers this knowledge, which is called semantic embeddings or attributes, from seen classes to unseen ones [6]- [9].
So far, an increasing number of researchers have been devoting to ZSL, and many methods have been proposed. These methods can be simply divided into two classes by their compatibility with new category. One is unseen feature synthesis based method, which first trains a generative model with Generative Adversarial Net (GAN) [10] to synthesize sample of unseen classes, and then trains a fully supervised model. Another is direct method, which trains a classification model with only the samples of seen classes and then directly classify unseen data with this model. In the second category, low rank based methods can achieve state-of-the-art performance, they constrain the projection matrix from visual space to attribute space to be low rank, and add this constraint to other loss functions. For example, Ding et al. attached it to a attribute dictionary constraint and utilized an ensemble learning strategy to optimize them [11]. Liu et al. embedded it to a Semantic Autoencoder (SAE) [12], and employed a trace optimization method to solve it [13]. Niu et al. tried to use the Internet browsed images as auxiliary weakly supervised information, and combined the low rank constraint to train a webly supervised model [14]. However, all these methods are based on an unreasonable assumption that varieties of visual features from one category can simultaneously be projected to a single attribute, and they ignore that the visual features still contain background information or other noisy factors, which will prevent them to be projected to a single vector. The illustration of the low rank constraint is shown in Fig. 1(A).
To solve this problem, we relax this assumption and add a sparse constraint to it to contain the noise information, which is named Prototype Relaxation with Robust Principal Component Analysis (RPCA) based ZSL method and illustrated in Fig. 1(B). This relaxation can perfectly model the projection and well solve the interference from the noise. In addition, because some attributes in unseen classes are very similar to others, it will lead to wrong classification, especially for the more realistic and challenging GZSL setting. For example, ''the humpback whale'' has very similar attribute to that of ''blue whale'', only several entries like color are different among all the 85 dimensions. Therefore, to disperse class prototypes, we build a medium latent space, where the prototypes projected from different attributes are constraint to be orthogonal to each other and normalized. Furthermore, the projection from visual space to attribute space or latent space often suffers from the domain shift problem [12], so we employ the concept from SAE [12] to reconstruct the visual features and semantic attributes respectively. In the prediction stage, to alleviate the hubness problem, which is caused by using only the attribute space or the latent space, we combine them with the visual space together and take the maximum probability as the final classification result. As a supplement to inductive ZSL, we also extend this method to transductive setting by employing the learned parameters of inductive model. The contributions of our work can be summarized as follows, 1) To avoid the unreasonable projection assumption in traditional low rank based ZSL methods, we proposed a prototype relaxation with RPCA based approach to relax it by adding a sparse item to incorporate the noisy redundant information; 2) An orthogonal constraint in latent space is employed to disperse all class prototypes by making them normalized and orthogonal to each other; 3) Self reconstruction is applied to alleviate the domain shift problem, and a probabilistic prediction in all three spaces are utilized to further mitigate the hubness problem; 4) Experiments for both inductive setting and transductive setting are conducted on four popular datasets, and the results on both ZSL and GZSL demonstrate the superiority of the proposed method.
The main content of this paper is organized as follows: In section II we briefly introduce the existing methods for ZSL and GZSL. Section III describes the proposed method in detail. Section IV gives the experimental results of comparison with existing methods on several metrics. Finally in section V, we conclude this paper.

II. RELATED WORKS A. ZERO SHOT LEARNING
Zero Shot Learning (ZSL) models try to classify unseen samples by transferring the knowledge learned from seen classes with semantic embeddings bridging. So far, thousands of researchers have been devoting to this research domain. The earliest efforts such as Direct Attribute Projection (DAP) [4] estimate the labels by learning probabilistic attribute classifiers. In Attribute Label Embedding (ALE) [15] and SJE [16], Akata et al. projected visual features into semantic space via a bilinear compatibility constraint. CONvex combination of Semantic Embeddings (CONSE) [17] and Semantic Similarity Embedding (SSE) [18] try to build unseen attributes automatically from the instances of seen categories to reduce the requirement of manual attributes. Furthermore, some researchers such as Kodirov et al. [12] introduced the concept of Auto-Encoder and directly use the Euclidean distance to constrain the similarity of projected vectors in both visual and attribute spaces. In addition, Long et al. in [19] proposed to use the attributes of unseen classes to synthesize unseen visual features, and then train a supervised model with seen and synthesized unseen visual features. Thereafter, due to the powerful ability of sample synthesis, an increasing number of Generative Adversarial Net (GAN) [10] based ZSL method has been proposed [20]- [22]. However, these generative methods all suffer form the same problem as the closeset classification that when there is a totally new category it should be retrained by adding the synthesized samples of the new class.  The most relevant to ours are the low rank based methods [11], [13], [14], [23]. Ding et al. in [11] assumed that the projection matrix from visual space to attribute space should have the characteristic of low rank, and exploited a constraint on singular values to solve the problem. Liu et al. in [13] made the same assumption and apply the concept of SAE to constrain the projection matrix, which is solved by replacing the low rank constraint to a nuclear norm. Meng et al. adopted the subspace learning strategy to ZSL by adopting Low Rank Representation (LRR) for visual features and utilizing locally linear subspaces to approximate the nonlinear manifold [23]. In addition, Niu et al. try to exploit web images to improve the fine-grained classification performance on zero shot setting [14].

B. GENERALIZED ZERO SHOT LEARNING
Different from conventional ZSL, which assumes that all the test samples are only from unseen categories, Generalized ZSL (GZSL), firstly proposed by Chao et al. in [24], enlarges the search scope to both seen and unseen classes. Since we cannot obtain the information that whether the test data only belongs to the unseen classes beforehand in most scenarios, GZSL is a more realistic and challenging task. Besides, it is noteworthy that Xian et al. [25] in 2017 put forward a new split of several popular datasets for GZSL testing, and released a benchmark of some recent ZSL methods, which has greatly promoted the development of ZSL research. From then on, many methods have been proposed on this more realistic setting. For example, Zhang et al. proposed a probabilistic approach to solve the problem within the NNS strategy [26]. Liu et al. designed a Deep Calibration Network (DCN) to enable simultaneous calibration of deep networks on the confidence of source classes and uncertainty of target classes [27]. Pseudo distribution of seen samples on unseen classes is also employed to solve the domain shift problem on GZSL [9]. Besides, there are many other methods developed for this more realistic setting [28], [29]. . . , a u q } ∈ R d a ×q represent for the corresponding seen and unseen class level semantic representations, such as attributes or word embeddings, where d a is the dimension of attribute vector.
Given a set of labeled training data of seen classes X s = {x s 1 , . . . , x s i , . . . , x s N s } ∈ R d x ×N s , where d x is the dimensionality of a single feature vector x s i , and N s is the number of training data. Each feature x s i is simultaneously associated with a label y i ∈ S and its corresponding attribute represents for a set of test data, which is not assigned with its corresponding labels and semantic representations, where N u is the number of test data. The objective of ZSL is to predict the labels of the test data X u by learning a classifier F : X u → U with the training data X s and the whole attribute set A s ∪ A u .

B. PROTOTYPE RELAXATION WITH RPCA BASED ZSL
In this subsection, we will present our novel prototype relaxation with RPCA based ZSL method, followed by an effective solution, and the whole framework of the method is shown in Fig. 2. We first define a d dimensional latent space, which is hoped to be discriminative for both seen and unseen classes.
Suppose K ∈ {0, 1} (p+q)×N s is the prototype matrix, including both seen and unseen classes, in latent space, the projected vectors of visual features X should satisfy the following constraint, where, B ∈ {0, 1} (p+q)×N s is the one-hot label matrix for visual feature X s , E ∈ R d ×N s is the sparse noise around the class prototypes, and W 1 ∈ R d x ×d is the projection matrix from visual space to latent space. It is known that B is low rank because there are many samples belonging to a single category, which leads to that L is also low rank. Besides, E is hoped to be sparse noise around the prototypes, thus we assume that it is 1 norm sparse and define the following constraint function, where, · * is the nuclear norm, which constrains the value to be low rank, and λ 1 is the balancing coefficient.
Since the semantic preserving can alleviate the domain shift problem [12], we apply the autoencoder in latent space for both visual features and semantic attributes, where, W 2 ∈ R d a ×d is the projection matrix from attribute space to latent space. Because W T 1 X = L + E is a hard constraint, we use a relaxed manner to define the loss function for this autoencoder, (4) where, λ 2 and λ 3 are balancing coefficients, and · F is the Frobenius norm.
The latent space is hoped to be discriminative, but the dimension of latent space is limited, so it is impossible to scatter all the prototypes far way from each other. Here, we use an alternative strategy that all the class prototypes are constrained to be normalized and orthogonal to each other. Therefore, we define a loss function as follows, where, λ 4 is also a balancing coefficient. Combing all the items in Eq.2, Eq.3 and Eq.5, we can obtain the final loss function, where, λ 5 is a balancing coefficient too, and it controls the importance of the last item.

C. SOLUTION
Since Eq. 6 is not joint convex over all variables, there is no close-form solution simultaneously. Thus, we propose an iterative optimization strategy to update a single unresolved variable each time. Because proper initialization parameters can not only improve the model performance but also increase the convergence speed, we further split the solution into two sub problems, i.e., initialization for W 1 , W 2 and K, and iterative optimization of all the variables.

1) INITIALIZATION
In conventional iterative optimization usually exploits random initialization for some variables. However, random initialization often lead to slow convergence and bad performance when there are too many variables, especially in Eq. 6, we need to randomly initialize four variables, which will make it hard to converge. Here, we first utilize a truncated version of Eq. 6 by removing the parts of L and E to initialize W 1 and W 2 , which can be represented as, where λ 2 is simply set to 1, so we remove it. In Eq. 7, there are only three variables, and W 1 and W 2 are irrelevant, thus only K requires random initialization. First, we fix W 2 and K, then Eq. 7 can be truncated as, By taking the derivative of Eq. 8 with respected to W 1 and set it to zero, we can obtain the following representation, By settingÂ = XX T ,B = KBB T K T andĈ = 2XB T K T , Eq. 8 is converted to a well-known Sylvester equation which can be solved efficiently by the Bartels-Stewart algorithm [30], and it can be implemented with a single line of code W 1 = sylvester(Â,B,Ĉ) in MATLAB. 1 Similarly, W 2 can also be solved with the same Sylvester equation by Similar as W 1 and W 2 , we take the derivative of Eq. 7 with respected to K, and obtain the following equation, Because there is a third order item in Eq. 10, it is hard to directly solve the closed-form solution. Here, we adopt the Stochastic Gradient Descent (SGD) strategy, where, α is the learning rate.

2) OPTIMIZATION
Instead of adopting the traditional Singular Value Thresholding (SVT) [31] to solve the nuclear norm minimization in Eq. 6, we exploit a regularization term that guarantees the low rank characteristic of optimized L. Mathematically, we have the following equation, where H = (LL T ) − 1 2 . Besides, due to the existence of the hard constraint, we apply the Augmented Lagrangian Multiplier (ALM), and Eq. 6 can be converted as, where µ is a penalty parameter and R is a Lagrangian multiplier. After the initialization, W 1 , W 2 and K are close to the correct values. Therefore, we exploit them as the initial values to optimization all the five variables and the Lagrangian multiplier R until the termination criterion is met. Updating R and µ is trivial, which can be found in supplementary material. In the following, we will describe how to update E, L, K, W 1 and W 2 one by one.
Update L: The subproblem of Eq. 13 w.r.t. L is as follows, By taking the derivative of L L w.r.t. L, we can obtain the following equation, Then, we can obtain the solution of L as, The subproblem of Eq. 13 w.r.t. K is as follows, Then the derivative of Eq. 18 can be obtained as, and it can be solved with same SGD method as Eq. 11. Update W 1 and W 2 : The subproblem of Eq. 13 w.r.t. W 1 is as follows, By taking the derivative of L W 1 w.r.t. W 1 and set it to zero, we can obtain the following equation, If we setÃ = (2λ 2 + µ) end for 8: end for 9: Randomly initialize R; 10: Initialize H with H = (LL T ) − 1 2 and L = W 1 X; 11: for all k = 1 → iter 2 do 12: Update E with Eq. 14; 13: Update L with Eq. 17; 14: Update H with H = (LL T ) −  21: Update the parameter µ by µ = ρµ; 22: end for 23: Return the learned W 1 and W 2 .

D. TRANSDUCTIVE SETTING
Since W 1 and W 2 are generated only with seen data, it is impossible to fully tackle the domain shift problem. The most effective method is to include the unlabeled unseen data in training phase, which is called transductive setting and first proposed by Fu et al. [33]. It is known that the projection matrices W 1 and W 2 of seen classes are different from that of unseen classes due to the domain shift problem, but they are very similar because of the application of semantic attributes. Therefore, we modify the inductive equation with unlabeled unseen data to convert it to transductive setting, which can be represented as, where β 1 , β 2 , β 3 , β 4 and β 5 are balancing coefficients, 1 is the all one vector. By replacing nuclear norm with trace form, and converting the constraint into Augmented Lagrange Multiplier item, Eq. 22 can be represented as, where H u = (L u L T u ) − 1 2 . Besides, we also remove the orthogonal constraint from Eq. 23 because there are only unseen classes needed to be processed, and the item W 2 − W 2u 2 F can constrain it to maintain the characteristic of orthogonality.
Eq. 23 is not joint convex for all the variables, thus we use the same iterative optimization strategy as inductive setting to solve it.
Update B u : The subproblem of Eq. 23 w.r.t. B u is as follows, Due to the existence of the discrete constraint, Eq. 24 cannot be directly solved. We expand it with the form of trace, According to the discrete constraint, we can obtain the following solution for a single column j of B u , Update E u : The subproblem of Eq. 23 w.r.t. E u is as follows, where L u is initialized with L u = W T 1 X u . Update L u : The subproblem of Eq. 23 w.r.t. L u is as follows, By taking the derivative of L L u w.r.t. L u , we can obtain the following equation, Then, we can obtain the solution of L u as, Update K u : The subproblem of Eq. 23 w.r.t. K u is as follows, Set the derivative of Eq. 31 w.r.t. K u to zero, we can obtain, which can be solve with Sylvester equation by settingÃ = Update W 1u and W 2u : The subproblem of Eq. 23 w.r.t. W 1u is as follows, By taking the derivative of L W 1u w.r.t. W 1u and set it to zero, we can obtain the following equation, we can also obtain the solution of W 2u with the same Sylvester equation. The whole algorithm can be found in Alg. 2.

Algorithm 2
The Detailed Algorithm of Transductive Setting for RPCA Based ZSL Input: The unseen data X u , unseen class level attributes A u , and the initial projection matrices W 1 and W 2 learned from inductive setting; The hyper-parameters β 1 , β 2 , β 3 , β 4 , β 5 , µ, and ρ; The iterative number iter. Output: The learned label matrix B u of unseen classes. Initialize ) Update the parameter µ by µ = ρµ; end for Return the learned B u .

E. ZERO SHOT IMAGE CLASSIFICATION
Due to the fact that the hubness problem often appears when conducting prediction in low dimensional spaces [34] such as attribute space or latent space, so it is necessary to transfer the classification into visual space or all the three spaces. Since the projection matrices W 1 and W 2 have been obtained after the above optimization, we can compute the embeddings of a visual feature x i in latent space and visual space, and which can be represented as W T 1 x i and W 2 W T 1 x i respectively. Similarly, the embeddings of a semantic attribute a j in latent space and visual space can be represented as W T 2 a j and W 1 W T 2 a u j respectively. Therefore, we can make classification in all three spaces. First, we define the classification probabilities in each space as follows, where d(·, ·) is the Euclidean distance. We select label of the highest probability in Eq. 35 as the final result, which can be represented as, Eq. 35 and Eq. 36 are defined for conventional ZSL and transductive ZSL, and they can be extended to GZSL by replacing a u j to a j ∈ A s ∪ A u .

IV. EXPERIMENTS A. DATASETS
In our experiments, we employ four popular benchmark datasets, i.e., SUN (SUN attribute) [35], CUB (Caltech-UCSD-Birds 200-2011) [36], AWA(Animals with Attributes) [4] and aPY(Attribute Pascal and Yahoo) [37]. SUN is a type of fine-grained dataset, which contains many different visual scenes and CUB is also a fine-grained dataset, and it is consisted of 200 bird-species. AWA is a coarse-grained dataset of 50 classes of animals. APY has 20 classes from Pascal VOC [38] for training and 12 classes from Yahoo [37] for test. The other details of these datasets canbe found in Tab. 1, where 'SS' refers to number of Seen Samples in training, 'TS' is the number of samples from unseen classes for test, and 'TR' is for seen ones. In addition, we adopt the split strategy which is proposed by [25].

B. EXPERIMENTAL SETTING
We employ the extracted deep features with ResNet [2] for all the four datasets as our input, and the attributes and class split strategy are the same as that in [25]. Since there are many hyper-parameters in our methods, we adopt the cross validation to choose them. Here, we emphasize the difference of ZSL cross-validation to conventional machine learning approaches. Compared to inner-splits of training samples within each class, ZSL problem requires inter-splits by in turn regarding part of seen classes as unseen. In addition, when initializing W 1 and W 2 in the first step, the ratio of the optimal parameters of λ 2 , λ 3 and λ 4 is fixed for the next step, thus the cross validation for the second step only contains three hyper-parameters, including λ 1 , λ 5 and the ratio of them to the item in the first step. In transductive setting, the ratio of β 1 , β 2 , β 3 and β 5 is set to be the same as that of λ 1 , λ 2 , λ 3 and λ 5 , which can greatly reduce the search time for cross validation. Besides, the initial value of µ is 0.1, and the ρ is set as 0.1 too. The three parameters of the iteration times are all set as 80. The core part of the source code and more experimental results can be found in the supplementary material.

C. RESULTS ON ZSL
In this subsection, we test our method on the setting of conventional ZSL setting on all four datasets listed in Tab. 1, and report the results in Tab. 2. Concretely, besides the baseline results recorded in [25], we also provide the results of four other methods, including TVN [46], VZSL [47], LESAE [13] and LESD [11], among which LESAE and LESD are low rank based methods and most related to ours. From Tab. 2, it can be clearly observed that our method can obtain best results on three datasets, especially on aPY that we can exceed the best method LESAE by 6.4%. The accuracy of our method on CUB is only lower than the best method SYNC by 1.5% and can achieve the third place among all the listed methods. However, for a more fair comparison, we compute the average values on all four datasets and report them in last column of Tab. 2, and the results show that our method can surpass the best method by 2.6%. In addition, since our method has been extended on tranductive setting, we also compare our method with three transductive methods, including QFSL-T [48], GFZSL-T [45] and VZSL-T [47], and record the results at the bottom part of Tab.2. GFZSL-T achieves the best performance on SUN, while our method can win on other three datasets. To be specific, our method can exceed the best methods by 0.5%, 0.9% and 11.1% on CUB, AWA and aPY respectively, and have 1.9% lower than QFSL-T on SUN. However, QFSL-T is a nonlinear method while our method is a linear one. In addition, the average accuracy on all four datasets show that our method can exceed QFSL-T by 6.3%.

D. RESULTS ON GZSL
Since GZSL is more realistic and challenging, we add more state-of-the-art methods into comparison, such as PRE-SERVE [49], CDL [50], LAGO [51], PSEUDO [52] and KERNEL [53]. Besides, due to the fact that the transductive setting assumes the test data belongs to the unseen classes, which conflicts with the definition of GZSL, we do not conduct experiment on this setting. The results of our method and the baselines are all recorded in Tab. 3. From this table, it can be seen that our method can outperform all the other methods on the metrics of ts and H , especially on AWA and aPY that we can obtain 8.2% and 10.3% respectively on H comparing to the best method CDL. All the baseline methods have high value on tr but perform bad on ts, and thus leads to bad performance on H , which means that these methods are over-fitting on seen classes, while our method can obtain a balance between seen classes and unseen classes. The main reason of such phenomenon is caused by the adaption of orthogonal constraint, which processes the attributes of both the seen classes and the unseen classes simultaneously.

E. ABLATION STUDY 1) THE EFFECTS OF RPCA AND RELAXATION
In this experiment, we will discuss two cases. The first one is that whether the prototype relaxation strategy adopted in our method can really improve the performance, and the second one is whether the sparse constraint on E is effective. For the first one, we remove L * +λ 1 E 1 from Eq. 6, and optimize the remaining part with same parameters as that used in Tab. 2 and Tab. 3. The results of three metrics on four datasets are illustrated in Fig. 3, from which it can be clearly discovered that the accuracies with RPCA constraint are significantly higher than that without it. This phenomenon reveals that the RPCA constraint plays a much important role in improving the performance. For the second one, we only remove the item λ 1 E 1 from Eq. 6 to show whether the declaration of sparse constraint in first section is reasonable. The results of ts and H on four datasets are shown in Fig. 4. We can find that the performance with sparse constraint is a little better than that without it on both ZSL and GZSL. Combined with Fig. 3, the performance without E is between that with RPCA and without RPCA, which shows that the low rank constraint on L is effective and the sparse constraint can further improve the performance.

2) THE EFFECTS OF AUTO-ENCODER (AE) AND ORTHOGONAL CONSTRAINT (OC)
Since  Fig. 5, where the X-axis stands for the dimension of the latent space, and it is represented with the multiple of the number of categories. Y-axis denotes ts on ZSL in Fig. 5(a) and H on GZSL in Fig. 5(b). It can be clearly discovered that there is a common property on the four datasets: both ZSL and GZSL can reach the best performance when the dimension is about two times to the number of categories, and it is lower when the dimension is increased or decreased. Since it is known that there is at least the same dimension as the category number to make all classes to be orthogonal to each other, the dimension should be larger than the category number. Besides, too high dimension may bring too much redundant information and lead to performance degradation.

G. CONVERGENCE ANALYSIS
Since the proposed method is optimized with an iterative strategy, it is necessary to show whether it can converge.
In this subsection, we illustrate the convergence curves on four datasets in Fig. 6. From this figure, it can be clearly find that the proposed method can converge when the iterative number is larger than 80. To be specific, the convergence  speed on SUN is a litter slower than that on other three datasets, which is cause by that SUN has much more categories than the others, so it need more iteration times to find feasible projection model to disperse all the classes.

H. DISTRIBUTION IN LATENT SPACE
The objective of orthogonal constraint in latent space is to disperse all classes and make them more discriminative, and it is the reason why the performance of our method can outperform the baselines, especially on GZSL. To have a more intuitive understanding, we employ t-SNE [54] on AWA to illustrate the distributions of the samples of unseen classes in this space. Meanwhile, we also show the distributions of the original visual features, the embedded vectors in attribute space with LESD, and the projected embeddings in attribute space with LESAE. The distribution maps are illustrated in Fig. 7. Our method can obtain the best distribution, and LESD plays the worst, which is caused by that our method can disperse all the unseen classes while the other two methods cannot. LESAE is better than LESD because LESAE applies SAE to reconstruct visual features, which encourage the  embeddings to preserve the characteristic of visual features and make the distribution of them to be similar as that of original features.

I. ZERO SHOT RETRIEVAL RESULTS
In this zero shot retrieval task, we apply the semantic attributes of each category as the query vector, and compute the mean Average Precision (mAP) of the returned images. For the convenience of comparison, we employ the standard split of the four datasets, which can be found in [25], and the results are shown in Tab. 5. The values of baseline methods are directly cited from [55]. The results show that our method can outperform the baselines on all four datasets, especially on the fine-grained dataset CUB, which reveals that our method can make the prototypes in latent space more discriminative. Furthermore, we randomly select five class attributes from AWA [25] as the query vectors, and the returned top-5 similar images for each class are illustrated in Fig. 8. From this figure, we can clearly find that all the returned images are correct, which also verify the effectiveness of the proposed method.

V. CONCLUSION
In this paper, we have proposed a novel prototype relaxation with RPCA based ZSL method. To fix the hard assumption of the traditional low rank based methods, a sparse constraint is added to it to tolerate noise. An orthogonal constraint is employed to disperse all the class prototypes, including both seen and unseen classes, in latent space to avoid the confusion between similar classes. Furthermore, vectors from latent space are exploited to reconstruct visual features and semantic attributes respectively to alleviate the domain shift problem. In addition, the hubness problem is also mitigated by applying the max probability model in the combined three spaces. Extensive experiments on four dataset are conducted and the results show the superiority of the proposed method. WENBO WANG is currently pursuing the master's degree with the School of Information Engineering, Nanjing Audit University. His main research interests include image recognition and robotics.