Generalized Zero Shot Learning via Synthesis Pseudo Features

Compared with conventional zero-shot learning (ZSL), generalized ZSL (GZSL) is more challenging because the test instances may come from seen and unseen classes. The most existing GZSL methods learn a visual-semantic mapping function to bridge the knowledge transfer from seen to unseen classes by using semantic information and other labeled training data. However, these methods often suffer from severe performance degradation because they ignore similar structures between different classes. To solve these problems, we propose a GZSL method that transforms GZSL problems to conventional supervised learning ones by synthesizing pseudo features for unseen classes. This technique has two key aspects. The first one is the synthesis strategy; the proposed strategy directly synthesizes the pseudo features of unseen classes contrary to current synthesis-based methods, which synthesize pseudo instances. Our method regards the combination of N features of instances as the pseudo features. These N features belong to N different classes that are similar to unseen ones. This synthesis strategy is in line with the cognitive style of human beings. The second key aspect is that we preserve the similar structures between seen and unseen classes. Inspired by the center loss method, we assign each semantic vector as the center of deep features in the training stage. This way preserves the similar structures between the classes. Such preservation can be beneficial for improving classification accuracy. The experimental results on four benchmark datasets demonstrate that our model outperforms state-of-the-art methods for the GZSL. The source code is available at https://github.com/guizilaile23/SPF-GZSL.


I. INTRODUCTION
Supervised learning methods have achieved significant successes in many areas with sufficient labeled training data provided for each class [1], [2]. However, the collection and annotation of the large amounts of training data for growing classes are time-consuming and expensive. Consequently, certain classes only have a small quantity or even no training data, resulting in the failure of the conventional supervised method [3].
Unlike traditional supervised learning methods, zero-shot learning (ZSL) [4] aims to recognize instances in which The associate editor coordinating the review of this manuscript and approving it for publication was Zijian Zhang. no training data are available during training. ZSL methods usually learn knowledge from training sets that belong to seen classes, wherein sufficient labeled instances are provided [4]- [7]. Moreover, with the help of auxiliary information that contains descriptions of seen and unseen classes, ZSL methods can generate predictions for instances that belong to unseen classes despite that the seen and unseen classes are disjointed.
The ZSL models are usually evaluated in a restricted setting where the training and test classes are disjointed. The test examples only come from unseen classes and the search space is also limited to the unseen classes only. However, people are also concerned with the ability of classifying instances on seen and unseen classes. The challenging setting where the test instances may come from seen and unseen classes is known as generalized ZSL (GZSL).
GZSL and ZSL requires auxiliary information to achieve knowledge transfer [6]. Such auxiliary information usually describes the seen and unseen classes at a uniform level, acting as side information from which the latter could be inferred rationally. In existing work, the strategy of involving auxiliary information is inspired by the way human beings recognize the world. Humans can perform ZSL using semantic background knowledge. A human can recognize a zebra with the description ''a zebra has the same shape as a horse, but the color of zebra is black-and-white stripes'' even without having seen one before, as long as they know the features of a horse and the ''stripe'' pattern [8]. In this way, auxiliary information, such as attributes [9], [5] or wordvectors [10], [11] is involved in the existing ZSL methods. Attributes are obtained through manual annotation or automatic learning, while word-vectors are obtained with language processing technology on a large text corpus.
With the help of semantic information, mapping-based methods learn a mapping function between visual and semantic spaces. However, these methods generally encounter the hubness problem due to information loss during projection. Few related approaches, such as generation-based and synthesis-based methods, follow different perspectives. Generation-based methods use the semantic information and visual instances belonging to seen classes to learn a generation model. Then, pseudo instances are generated on the basis of semantic information of unseen classes. This way allows the method to easily train a classifier using the generated instances. However, such techniques are sensitive to the quality of semantic information, and obtaining a welltrained generation model is usually difficult. Meanwhile, synthesis-based methods use visual instances from seen classes to synthesize pseudo instances for unseen classes. Thus, the classifier can be trained as the generation method. Naturally, different synthesis strategies will lead to varied results. Existing synthesis-based methods all adopt complex synthesis strategies and ignore the role of inter-category structure.
In this work, we propose an effective method for GZSL. Our method transforms the GZSL and ZSL problems to conventional supervised learning ones by synthesizing pseudo features for unseen classes. In contrast to the existing method [11]- [13], we first calculate the distances between each seen class based on the semantic information and treat these distances as similarity score. The smaller the distance, the more similar these two classes will be. We then choose instances from each N similar class randomly among the seen ones for each unseen class. Subsequently, the deep features of these similar instances are combined using a linear combination model. The combination is treated as a pseudo features belonging to the unseen class. Given the features for each unseen class, we can train a classifier directly which similar to the conventional supervised learning method. Our method also adopts the idea of center loss method during the training stage to learn the discriminative features. However, we assign the semantic vectors directly as the center of each class instead of learning the center of deep features itself. Accordingly, the discriminative feature representations are learned and similar structures among each class are preserved. In this way, our method could avoid the potential domain shift problem. Our method will be introduced in detail in Section 3.

II. RELATED WORK
In this section, we summarize the existing ZSL methods into three strategies based on the strategy that they adopt. the strategies of most current GZSL approaches generally include three types: mapping-based method, generate-based method, and synthesis-based method.

A. MAPPING-BASED METHODS
The widely used methods are mapping-based ones, such as [14], [15]. Mapping-based methods first learns the mapping function from an image feature space to a semantic one or vice versa. These methods also map image feature and semantic spaces to a latent one and then predict the true label based on the nearest searching methods.
Among the mapping methods, SOC [5] maps the image features into the semantic space and then searches the nearest class embedding vector. DeViSe [16] also learns linear mapping between an image and a semantic space by using an efficient ranking loss formulation, which is evaluated on the large-scale ImageNet dataset. DEM [17] uses the visual space as an embedding space and projects a semantic vector to the visual space instead of the widely utilized semantic space as the embedding space. SAE [18] propose a semantic auto encoder to regularize the model by enforcing the image feature projected to the semantic space to be reconstructed. TVN [19] project the image and semantic features into a latent space where the seen classes are orthogonal to each other and then add three constraints to guarantee fast convergence speed and preserve information losses. RN [20] adopt the architecture of a pseudo siamese network and learn to distinguish whether the image and attribute vector belong to same classes. DCN [21] maps the visual features of images and semantic vectors of class prototypes to a common embedding space. During the test stage of GZSL, whether the test instances come from seen or unseen classes are predicted and the different parameter classes are adopted to improve the compatibility in the GZSL setting. ZSKL [22] applies the well-established kernel methods to learn nonlinear mapping between the feature and attribute spaces, which contrast the existing ones that learn a linear mapping function. NIWT [23] use training instances and corresponding semantic information to learn a mapping function between the class-specific semantic and the importance of individual neurons within a deep network. The learned mapping function can predict neuron importance from knowledge regarding unseen classes and then optimize classification weights. Accordingly, the resulting network aligns with the predicted importance.
Although many methods adopt mapping-based strategies, but the classification ability of these methods is inefficient under the GZSL setting mainly due to their weak discriminative power.

B. GENERATE-BASED METHODS
Pseudo instances are generated based on the generative model, which is also an effective method. Most of these methods first train the generator and discriminator using the seen class instance and semantic information, some methods also adopt an auto-encoder architecture. The instances are generated based on the semantic information of unseen class. Then a classifier could be trained based on the generated instances.
F-CLSWGAN [24] generates the visual features of unseen classes by optimizing the Wasserstein distance regularized through a classification loss. This task could be conducted by encouraging the generator to construct features that can be classified correctly using a discriminative classifier trained on the input data. TFGNSCS [25] also integrates a Wasserstein generative adversarial network with classification and transfer losses to generate sufficient convolutional neural network (CNN) features. This method can not only consider the semantic structural relationship between seen and unseen classes but also learn the difference among the generating features. SE-GZSL [26] develops a generative model using VAEbased architecture composed of probabilistic encoder and conditional decoder. The learned model can generate pseudo instances given their respective class attributes. LESAE [27] use the encoder-decoder paradigm to seek jointly a lowrank mapping that links visual features with their semantic representations. The encoder aims to learn low-rank mapping from the visual feature to the semantic space, whereas the decoder manages to reconstruct the original data with the learned mapping.
The results of generation-based methods are better than those of mapping-based techniques. However, the generated instances may contain considerable bias from the real ones when the distribution between seen and unseen domains differs. As a result, the learned ZSL classifiers for unseen classes cannot work well.

C. SYNTHESIS-BASED METHODS
The synthesis-based methods transform ZSL and GZSL as traditional classification tasks by synthesizing pseudo instances for unseen classes. Recent work showed that such methods have better results compared with the mappingbased ones.
CSSD [13] first maps the semantic information of seen classes into a latent space to simultaneously learn a classspecific encoding matrix for each class and a dictionary matrix for reconstructing the visual features within a dictionary learning framework. The pseudo instances of unseen classes are then synthesized with the semantic information of affinity seen classes and their corresponding encoding matrices. ABS-NET [12] first learns credible feature representations for each attribute by utilizing a prepared dataset of seen classes. These attribute representations are then summarized into one according to the specified attribute descriptions of each unseen (ZSL) or seen class. UVDS [28] first projects semantic features to a high-dimensional visual feature space and then synthesizes the pseudo instances with the learned projection function. The domain shift problem is addressed by preserving the local structure of the latent embedding space. Reference [29] proposes a novel one-step ZSL framework by synthesizing pseudo instances for unseen classes; they treat the instance selection as a quadratic optimization problem and propose an efficient solution on the basis of the augmented Lagrange multiplier framework to guarantee the diversity of the synthetic pseudo instances.
In these methods, different synthesis strategies will lead to varied results. Techniques that adopt complex synthesis strategies are highly susceptible to the noise in semantic information.

A. PRELIMINARY
In this section, we first introduce notations and definitions used in this work. S = c s i |i = 1, . . . ,N s is the set of seen classes, where each c s i is a seen class. U = c u i |i = 1, . . . ,N u is the set of unseen classes, where each c u i is an unseen class. Note that is the labeled training dataset belonging to seen classes; represents the corresponding labels, where y s i is the corresponding class label of instance x s i .
is the set of testing instances, where x u i is the ith testing instance and without label; and Y u = is the corresponding class label for X u , which will be predicted.
Each class in Y u + Y s presents a corresponding vector representation in the semantic space, which is referred to as the class prototype of this class The ZSL aims to use the training dataset D s and semantic information A s + A u to learn a classifier f u (·) that can categorize the testing instances X u belonging to the unseen classes U . The goal of GZSL is the same at that of ZSL, but the testing instances could come from X u and X s .

B. NETWORK ARCHITECTURE
In GZSL, the instances of unseen classes are unavailable, an effective method is to synthesize the pseudo instances for unseen classes. We can transform GZSL and ZSL to conventional supervised classification problems by supposing that we already have these pseudo instances. The synthesis method will be provided in detail later. VOLUME 7, 2019 Our model follows the conventional classification neural network. The embedding part, denoted as E (·), is used to extract features. The classification part, denoted as C (·), is used to classify the features into class numbers.
In the conventional supervised classification method, we can learn a classification function f by minimizing the loss function given the training instances x ∈ X and corresponding labels y ∈ Y .
where x denotes the input instances, y denotes the corresponding label, ω denotes the parameter of the classifica- is distance metric function that measures the distance between the predictions of f and the true labels y, and ω 2 is the regularization term. A decrease in the loss function L s (ω) implies that the classifier can make correct predictions.
Reference [30] proposes a new auxiliary loss function called center loss to improve the discriminative power of the deeply learned features. The center loss function is formulated in (2): where v i denotes the deep features of the ith training instance x i extracted from a hidden layer of a CNN, v i = E x s i in our method, and c yi denotes the ith class center of deep features in this layer. The formulation characterizes the intra-class variations effectively. Fundamentally, c yi should be updated as the deep features change. At the training stage, we optimize the tradition and center losses jointly to learn the discriminative features. The loss function is formulated as follows: The L c minimizes the intra-class distances of the deep features, while the conventional loss L s keeps the features of the different classes separable. A scalar λ is used to balance the two loss functions. The conventional loss can be regarded as a special case of this joint supervision if λ is set to zero.
With the two key learning objectives, we can train a robust CNN to obtain the discriminative deep features. Here, we present the results of MNIST [31] classification example used in [30] to illustrate the effect of the center loss method. The authors modified the LeNet-5 [31] to a deep and wide network but reduced the output number of the last hidden layer to two (i.e., the dimension of the deep features is two). Accordingly, the features on a 2D surface can be plotted for visualization. The resulting 2D deep features are plotted in Figure 3.
The deep features of different classes are distinguished by decision boundaries because the last fully connected layer acts as a linear classifier. Figure 3(a) illustrates that the deeply learned features are separable but are insufficiently discriminative under the supervision of softmax loss used in [30]  because they still show significant intra-class variations. Figure 3(b) shows that the deep features are gathered, the distances of intra-class are quite small, and the discriminative power of deep features can be enhanced significantly under the joint supervision of softmax and center losses. Therefore, the joint supervision benefits from the discriminative power of deeply learned features.
A large number of ZSL studies [13], [32]- [34] have proven that the probability of an instance being classified to different categories indicates the relations with their class prototypes. For example, for a well-trained classifier, the probability that a zebra is recognized as a horse than ship is much higher because the zebra is more like former that the latter.
We further assume that if two class prototypes are similar, then their corresponding feature centers should be also near. Semantic information, a kind of class prototype description, presents one-to-one correspondence to the class prototype, which means that a i = a j only if i = j. The semantic information of different classes in the semantic space could reflect the similarity between various classes that play a key role in building the knowledge transfer in ZSL. In semantic space, the distance of two different semantic vectors is near if these two class prototypes are similar. For example, the semantic information of ''rat'' and ''mouse'' should be close in the semantic space. However, the distance between ''mouse'' and ''car'' will be far.
We directly assign the semantic vector of each class as the center of deep features instead of learning the center of deep features itself. Thus, our optimization goal becomes: This equation could be easily solved by using popular optimization methods, such as Adam [26] and SGD [31].
This strategy can not only skip the procedure of updating the centers, but also preserve the inter-class similar structure, which is useful in the ZSL problems.

C. SIMILAR SAMPLE SELECTION AND PSEUDO FEATURE SYNTHESIS
We will introduce the manner by which to select samples and synthesize the pseudo instances. Consider the zebra example we previously mentioned. Humans can recognize a zebra with the description ''a zebra has the same shape with horse, but the color of zebra is black-and-white stripes'' even without FIGURE 2. Illustration of our proposed method for GZSL. We first compute the similarity score on the basis of the semantic, which is pre-acquired. An embedding neural network is used to extract features. Then, instances A, B, and C, which belong to the seen class, are taken as input. The pseudo features of the unseen class D are also synthesized using the extracted features of the seen class (that is, A, B, and C). The classification neural network can be trained using the features of classes (that is, A, B, C, and D) during the training stage. We compute the MSE loss between the extracted deep features and the semantic vectors. We also calculate the cross-entropy loss between the prediction and the true label. After the training stage, the labels of the test data can be predicted directly.
having seen any zebra before, as long as they know the features of a horse and the ''stripe'' pattern [8].
In our method, we imitate this learning process, which directly uses the combination of similar instances as pseudo instance. For example, the pseudo instance of a zebra could be the combination of a horse and stripe. We formulate our assumption as follows: where x s horse and x s strip denote the instances of horse and strip, which are similar to those of a zebra; and θ horse and θ strip denote the shape and color feature selectors, respectively.
During this process, two key problems need to be solved. The first problem is choosing the similar instances, and the second one is combining them.

1) SIMILAR SAMPLE SELECTION
A suitable similar instance could benefit from the synthesis. For example, if we want to synthesize a motorcycle, then choosing a bicycle and car rather than a cat and dolphin is more suitable.
Existing studies [32], [33], [37]- [39] have proven that semantic similarities among classes are consistent with the visual features. This notion indicates that if two semantic data belonging to two different classes are similar, then the visual instances should also be similar to each other. We can choose similar instances based on the semantic vector distances between different classes.
The similarities (distances) between different classes can be evaluated in many possible ways, including cosine and Euclidean distances. Due to the cosine distance pays more attention to the difference of direction between two vectors, it is more suitable to measure the similarity of semantic information, therefore, we choose cosine distance as the measurement of similarities between different classes in our method, the similarity score µ ij is compute as follow: where a u i denotes the semantic vector of the ith unseen class, a s j denotes the semantic vector of the jth seen class, and s i j denotes the distance of the unseen class i and seen class j. s max denotes the maximum distance. We compute the distance between each seen class and choose the N classes as the similar seen ones for each unseen class.

2) SYNTHESIS METHOD
We formulate our assumption for the pseudo feature synthesis method as follows: where x si j denotes the jth similar instance, which is similar to unseen class i, and θ j is the feature selector. These feature selectors are determined using the semantic information between different classes. However, current semantic information usually contains only the description of the class prototype itself. Obtaining the description between different classes is difficult, thereby leading to difficulty in calculating the feature selector [37], [40]. We adopt an alternative approximation method based on this limitation. We take the summation of these features of similar instances directly  and weighted by the similarity scores. Our synthesis method could be formulated as follows: The embedding net of our method can be treated as regression from the instances to the semantic vector. Thus, the pseudo features we synthesized should also correspond to the semantic vector of unseen classes. If our embedding part has a strong regression ability, then v s i ≈ a s i , and v u i ≈ a u i . The pseudo feature v u i could be calculated by N j=1 µ ij a si j based on (9), which will lead to a bias between a u i and N j=1 µ ij a si j . We present an example to prove that the bias will not affect the performance as follows.
We visualized the inter-class similarity score of Caltech-UCSD Birds-200-2011 (CUB) [41] datasets and measured the cosine distances between this semantic vector and the others for each semantic vector. The small distance indicates these two vectors are close. Figure 4 shows the results of similarity matrix.

IV. EXPERIMENTS
We evaluate our approach on four benchmark datasets to illustrate the effectiveness and superiority of our proposed method. These datasets are widely used in ZSL and GZSL. The experiments are implemented based on PyTorch [42].

A. DATASETS AND SETTING
The first dataset is Animals with Attributes 1 (AWA1) [9]. This dataset is a coarse-grained one with a total of 30,475 images for 50 animal classes. Forty classes with 24,295 images are used for training and the remaining classes with 6180 images were utilized for testing, thereby providing an 85-dimensional class-level attribute vector.
The second dataset is Animals with Attributes 2 (AwA2) [7]. This dataset is a fixed version of AWA1, wherein many unseen classes of AWA1 in the conventional splits [44] have severe overlap in the ImageNet. AWA2 exhibits 50 animal categories similar to AWA1 and 37,322 images are collected. This dataset presents a new train-test split as suggested by [7]. Forty classes are used for training and 10 others for testing in AWA2 dataset. Eighty-five associated class-level attributes are also provided.
The third dataset is CUB [41]. The CUB dataset contains 200 kinds of birds with 11,788 images in total. This dataset is a fine-grained one with respect to the number of images and classes. One hundred and fifty classes are used as the seen classes and 50 classes are used as the unseen ones. CUB also provides an instance-level attribute vector. However, we only use the 312-dimensional class level attribute vector in this work.
The last one is SUN Attribute (SUN) [43]. SUN contains 717 kinds of different common scenes with 14,340 images. Each class is annotated with a 102-dimensional attribute vector. This dataset is also a fine-grained one. In accordance with the split suggestion, 645 classes are used for training and 72 for testing.
We evaluated our method under the new split setting provided by [7] to enhance cooperation with the other methods. More details on the settings can be found in [7].
We use strictly the 2048-dimensional feature of each image extracted from the pre-trained ResNet-101 [45] provided by [7] similar to the others.
Only the attribute vectors provided by each dataset are used. We use the continuous 85-dimensional class-level attribute vector for AwA1 and AWA2, which has been used in recent studies. A continuous 312-dimensional class-level attribute vector is used for CUB. A binary 102-dimensional class-level attribute vector is used for SUN. Table 1 summarizes the details of the four datasets.

B. IMPLEMENTATION DETAILS
Our model follows the common CNN architecture because it transforms ZSL to a conventional supervised learning problem.
The embedding neural network has three fully connected layers to regress the attribute vector as the features extracted from ResNet-101 [45] are used. Each layer is followed by a ReLU [46] layer except for the last one. The number of input units is 2048. The output unit number follows the attribute vector dimension provided by each dataset. The hidden units for CUB are 1024 and 512. The hidden units for AWA1, AWA2, and SUN are 1000 and 512.
The classification part contains one fully connected layer, which will be utilized in making prediction. The numbers of input and output units follow the dimension of attribute vector and the number of classes provided by each dataset.
We add weight decay (L2 regularization) in the embedding and classification parts. The entire model is optimized by Adam with a mini batch size of 1500.
Before the training stage, we calculate the similarity score of each unseen class based on the semantic vectors and select N instances belonging to N similar classes. The N is set to five for AWA1, AWA2, and SUN and three for CUB.
At the training stage of GZSL, we generate randomly a batch training label belonging to Y u + Y s . If each label in the batch comes from the seen class, then we select randomly an instance belonging to this class and feed it into the net. If the label belongs to the unseen class, then we let similar instances go through the embedding net and synthesize the output features based on (8). In our method, similar classes are selected based on (7). We also randomly select the similar instances in these similar classes during training. Therefore, we do not need to restore the synthesized features of the unseen classes.
. The synthesized feature will then be fed into the classification net. We calculate the MSE loss between the attribute vector and the output of embedding and cross entropy loss between the prediction and true labels similar to the center loss method.
During the label generation stage, we set up a new parameter η to control the proportion between seen class label number and unseen classes in a batch. For example, η = 0.4 means that 40% of the training labels belong to the seen classes. We set η = 0.5 for AWA1 and AWA2 and η = 0.3 for CUB and SUN.
We follow the setting in [7], wherein we select randomly 20% of the instances from each seen class as the seen class test set for the GZSL setting.

C. COMPARATIVE RESULTS OF GZSL
In the GZSL setting, the instances for evaluation may come from seen and unseen classes. Our aim is to have high accuracy on both classes. Thus, we choose harmonic mean as our main evaluation indicator instead of the arithmetic mean because considerably high class accuracy will significantly affect the overall results with the latter. The harmonic mean can be computed by the following function [7]: H = 2 × acc tr × acc ts acc tr + acc ts , where acc tr and acc ts are average per-class top-1 (T1) accuracies of the test images from seen and unseen classes, respectively. The average per-class T1 accuracy is measured as follows: #correct predictions in c #samples in c .
We select ten up-to-date approaches for comparison because the previous studies, such as DAP [4], ConSE [47], DeViSe [16], SYNC [48], and SJE [15], have been reported in past years and the performance of these methods were already low. The methods we select were published in the past two years (2018 and 2019), except for DEM [17], which was published in 2017.
The results of NIWT [23], RN [20], and DEM [17] are obtained from the released codes on the authors' GitHub page. The rest of these results are cited directly from their published papers. Table 2 summarizes the optimal results of different models. Table 2 shows that our method achieved high value in both H-mean and ''ts'' for AWA1 and AWA2. Our method showing a significant improvement of 4% and 11% for H-mean compare to second one, and for ''ts'', we lead the second place by 15.7% and 10.1%. We still achieve the second-best position for CUB on H-mean and ''ts''. We achieve the best result in all three evaluation indicators for SUN, obtaining increments of 6.7%, 20.7%, and 11.4% compared with the existing best one on ''ts'', ''tr'', and H-mean, respectively. These results shows that the method presented in this paper has achieved remarkable results.
Among the baseline methods, those with large ''tr'', the ''ts'' and ''H-mean'' values are generally small, such as RN [20], reflect the model bias toward the seen classes, and will be generalized ineffectively to new classes. In our method, the tr and ''ts'' are closed on AWA1 and AWA2. We can also make ''ts'' and ''tr'' closer by adjusting η on CUB and SUN, which will be illustrated later. These results demonstrate clearly that our model can mitigate significantly the GZSL problem of the bias towards seen classes. GZSL classification accuracy of different approaches. ts = acc ts (T1 per-class accuracy on Y u ), tr = acc tr (T1 per-class accuracy on Y s ), H = harmonic mean. We measure T1 accuracy in %. The best results are marked in bold, and the second best ones are underlined. FIGURE 6. Accuracy improvement after assigning the semantic vector as the center of deep features. The blue bar denotes the method ''without-assignment'' setting and the pink ones denote the ''with-assignment '' setting, which assign each semantic vector as the center of deep features.

D. FURTHER ANALYSIS
There are two strategies in our method that are very effective in improving the accuracy of GZSL, we further analyzed the impact of these strategies through two experiments.

1) INFLUENCE OF η
As mentioned previously, our method adopts a hyperparameter η during the training stage to control the proportion of seen class training sample number and unseen classes in a batch. Different η will lead to varying classification abilities. We explore further the influence of different η for the three evaluation indicators on four datasets. Figure 4 illustrates the results.
The curves in Figure 5 show that the H-mean can achieve enhanced value in a wide range. However, ''ts'' drops and ''tr'' increases as the η increases. We can achieve optimal results for AWA1 and AWA2 when η values are set to 0.5 and 0.6. Moreover, 0.2 or 0.3 would be good for CUN and SUN. We can achieve good trade-off between ''ts'' and ''tr'' with η.

2) IMPACT OF ASSIGNING THE CENTER OF DEEP FEATURES
Our method also benefits from assigning the semantic vector as the center of deep features, which can preserve the structure between different classes. We set λ = 0 in (4) to reveal the necessity of this setting. Specifically, we do not assign the semantic vector as the center of deep features gathered around and consequently, the model becomes a typical deep neural network. We refer to this configuration as ''without-assignment'' and compare the result of the ''without-assignment'' setting with that of the previously reported original one denoted as ''with-assignment''. Figure 6 illustrates the results.
Under the ''without-assignment'' setting, the three evaluation indicators decreased in CUB and SUN. In AWA1 and AWA2, ''tr'' increased, whereas the two others decreased. The accuracy on the seen classes was slightly affected because they had labeled training instances. Meanwhile, the accuracy on the unseen classes with the pseudo instances suffered considerable decrease. This phenomenon can be attributed to the deep features learned under ''withoutassignment,'' which did not contain any similarity structure between the seen and unseen classes. Thus, the synthesized pseudo instance encountered substantial bias from the true features. Given that the semantic information described the classes in a unified level, the similarity structure between the seen and unseen classes obtained from the semantic information was the key to avoiding the domain shift problem.

V. CONCLUSION
In this work, we proposed a novel GZSL method that adopts a simple idea inspired by the way humans perceive the world. We synthesized the pseudo features for unseen classes to transform GZSL to conventional supervised learning problems. In the synthesis method, we used the combination of N features from N similar classes as pseudo features for the unseen classes. Our method also adopted the idea of the center loss method. We assigned the semantic vectors directly as the center of deep features instead of learning the center of deep features to preserve the local structure between different classes. The experimental results showed that our strategy notably improved the classification ability. We compared our method with 10 up-to-date methods on four benchmark ZSL datasets. Our technique exhibited remarkable results in mitigating the GZSL problem because it demonstrated enhanced accuracy results and good generalization ability to unseen classes.