GT-GAN: A General Transductive Zero-Shot Learning Method Based on GAN

Most zero-shot learning (ZSL) methods based on generative adversarial networks (GANs) utilize random noise and semantic descriptions to synthesize image features of unseen classes, which alleviates the problem of training data imbalance between seen and unseen classes. However, these methods usually only learn the distributions of seen classes in the training stage, ignoring the unseen ones. Due to the different distributions of seen and unseen samples, i.e., image features, these methods cannot generate unseen features of sufficient quality, so the performances are also limited, especially for the generalized zero-shot learning (GZSL) setting. In this article, we propose a general transductive method based on GANs, called GT-GAN, which can improve the quality of generated unseen image features and therefore benefit the classification. A new loss function is introduced to make the relative positions between each unseen image and its $k$ nearest neighbors in the feature space as consistent as possible with their relative positions in the semantic space; this loss function may be easily applied in most existing GAN-based models. Experimental results on five benchmark datasets show a significant improvement in accuracy compared with that of original models, especially in the GZSL setting.


I. INTRODUCTION
Supervised learning methods have achieved great success in the domain of machine learning. They are widely applied in various fields, e.g., image classification, and the performance is no worse than that of humans. However, supervised learning requires large amounts of labeled data for training, and the learned classifier can only recognize samples in classes that appear in the training stage, which is not suitable for certain specific real-world applications. Among the reasons for this deficiency are an insufficient number of training samples for some classes, high labeling costs for some samples, and changes in the target classes over time. To solve these problems, zero-shot learning [1]- [4] (ZSL, trained with seen samples and tested by unseen ones) has been proposed.
Given that the key point of ZSL is to address the lack of labeled unseen samples and that the supervised classification models based on labeled data are relatively advanced, a natural idea is to generate corresponding data so that a conventional ZSL can be converted to a supervised The associate editor coordinating the review of this manuscript and approving it for publication was Donato Impedovo . classification problem. Fortunately, with the development of generative adversarial networks (GANs) [5], researchers have proposed novel approaches that can use noise vectors and semantic information to generate fake samples. Fig. 1 shows the main idea of conventional ZSL methods [6]- [8] with GANs. First, the GANs are trained. A GAN [5] is a framework used for capturing the distributions of real seen samples (image features) to generate fake samples from the same distributions. It consists of two distinct subnetworks, a generator network (G) and a discriminator network (D). The job of G is to leverage random noises and semantic representations of seen classes to generate fake seen samples that resemble the real ones, while the job of D is to determine whether the inputs are real or fake. These two models are trained in an adversarial way to synthesize better features. In addition, these methods also use other losses or regularizations to achieve better training effects, which are not illustrated in the figure. Second, the trained G is used to generate fake unseen samples with noise and unseen-class semantic descriptions and train a classifier. Third, in the testing stage, the trained classifier is leveraged to classify the real unseen samples and output the classification results. However, these methods have a common problem, i.e., their GANs can only learn the distributions of seen samples in the training stage, ignoring the unseen ones. Due to the different distributions of seen and unseen samples, the generator cannot synthesize unseen features of sufficient quality to train a discriminative classifier. Thus, the classification performance is also limited. Specifically, for the generalized zero-shot learning (GZSL, trained with seen samples and tested by both seen and unseen ones) setting, because the test samples come from both seen and unseen categories, the classifier may bias the predictions toward the seen classes, which may lead to a more significant drop in accuracy. To alleviate these problems, we would ideally implement a GAN that can also learn some information on the unseen classes at the training stage to generate better unseen samples that are more similar to the real ones without considering their real categories.
In this work, we propose a novel transductive GAN-based method called GT-GAN, which can learn from unlabeled unseen features, to alleviate the inequality between seen and unseen classes in the training phase. First, in addition to the conventional training dataset consisting of seen categories, we also introduce unlabeled unseen image features and use a softmax classifier, which is trained with seen features, to classify them. We regard the outputs, which are rendered as probability distributions, as the similarities between these features and the seen classes. Then, we compute weighted semantic descriptions for each unseen feature with similarities to and the semantic information of the seen classes and select the k-closest unseen semantic descriptions to generate fake features. We introduce the average samples of these fake features and regularize them such that each one is close to the original feature differently, according to the distances of the semantic descriptions, e.g., if one unseen semantic vector is closer to a weighted one, the features generated with it should be closer to the original feature. A new loss function is introduced to achieve this goal and make the relative positions between each unseen image and its k-nearest neighbors in the feature space as consistent as possible with their relative positions in the semantic space. In addition, to the best of our knowledge, most GAN-based methods do not use unlabeled unseen samples for training or only implement their transductive versions for their own models. Our proposed method can be used in various existing GAN-based models to achieve better classification results than the original methods. We summarize our contributions as follows.
(1) We propose a general transductive ZSL method based on GANs, called GT-GAN, which introduces unlabeled unseen samples for training, to improve the performance of the generator.
(2) A new contrastive loss function is proposed to make the relative positions between each unseen image and its k-nearest neighbors in the feature space as consistent as possible with their relative positions in the semantic space. This loss function can also help synthesize unseen image features closer to the real features. It can also alleviate the effect of the misclassification of each unseen feature because we select k unseen-class semantic descriptions instead of one for each unseen feature for the first classification to generate fake features. Furthermore, this function can be easily applied to various existing GAN-based ZSL methods.
(3) Experimental results on five benchmark datasets prove that the existing models improved by our method can achieve significantly higher accuracy than the original ones, especially in the GZSL setting.

II. RELATED WORKS A. GENERATIVE ADVERSARIAL NETWORK
A generative adversarial network (GAN) [5] is a framework designed for capturing the distributions of inputs, such as images, to generate new data from the same distribution. However, it suffers from some problems, such as unstable training and mode collapse. Hence, some improved models have been proposed to mitigate these problems. WGAN [9] introduces the Wasserstein distance instead of the Jensen-Shannon divergence to measure the deviation between synthetic and real data distributions. LSGAN [10] replaces the cross-entropy loss with a least-squares loss to tackle the vanishing gradient problem at the training stage. In most recent studies, the proposed models can generate higher-quality samples than the previous methods as well as solve these problems. Saliencygan [11] demonstrates a better ability to capture the data distributions without mode collapse. TPSDicyc [12] achieves better alignment between the source and target data while maintaining superior image quality. DAGAN [13] can stabilize the generator and provide superior generation with preserved perceptual image details. MS-GAN [14] achieves more stabilized and efficient training and improves the perceptual quality of the superresolved results. These advanced methods may become potential targets of our proposed method.

B. OVERVIEW OF ZERO-SHOT LEARNING METHODS
Zero-shot learning (ZSL) was proposed to recognize unseen samples, which are disjointed from training samples. ZSL methods are usually categorized into two categories: classifier-based methods and instance-based methods [15]. The former aims to directly learn a classifier for unseen classes in distinct ways, e.g., in [16]- [18], the methods learn a correspondence function between feature space and semantic space, while the methods in [19], [20] leverage relationships among classes. The latter focuses on obtaining labeled samples of unseen classes. Recently, assisted by data generation models, such as GANs [5] and variational autoencoders (VAEs) [6], [21], many new synthesizing methods have been proposed to directly generate unseen samples from their semantic attributes, which can transform zero-shot learning to a conventional supervised learning problem.

C. GAN-BASED ZSL METHODS
Leveraging GAN to generate features can convert the ZSL problem to a conventional classification problem.
Because GAN suffers from unstable training problems and mode collapse, most recent models adopt improved GANs, e.g., WGAN [9], to undergo a stable training process and avoid mode collapse. f-CLSWGAN [7] initially uses the conditional WGAN to directly generate image features to train a classifier and introduces a classification loss to make the generated features more discriminative. GAZSL [8] utilizes ACGAN added with a visual pivot regularizer (VPG) to ensure that generated samples of each class are close to the average of the corresponding real features. LisGAN [22] introduces soul samples (average samples) of each class, which can represent the most semantically meaningful aspects of each image feature in the same class, to improve the generated features of the conditional WGAN. In our study, we mainly describe our method as it is applied to models based on the conditional WGAN because it is more widely used in existing works. Certainly, the idea of our method can also be used in other GAN-based models.
However, because GAN-based ZSL methods have to train the generator on seen features and apply it to unseen ones, an inevitable issue, named feature confusion, has been highlighted recently. Feature confusion refers to the fact that the synthesized unseen features are prone to seen references and are incapable of reflecting the real aspects of the unseen instances. AFC-GAN [23] leverages a boundary loss that estimates and then maximizes the decision boundary between the seen and unseen features to handle the issue. It also introduces a multimodal cycle-consistent loss to promote diversity and preserve the semantic consistency of the synthesized features, which can also alleviate feature confusion. In our work, we directly introduce unlabeled unseen features to make the generator learn from them at the training stage and then synthesize better unseen samples that are closer to the real samples.
In addition, the different distributions of seen and unseen features lead to a typical domain adaptation problem, which can also influence the effects of the models. [24] and [25] recently addressed this issue. In [24], a novel ATM method is proposed to minimize the interdomain divergence, maximize the intraclass density and address the equilibrium challenge issue in adversarial domain adaptation, while in [25], a novel approach is proposed to jointly exploit feature adaptation with distribution matching and sample adaptation with landmark selection. These studies may also be leveraged in our method to improve the classification results.

III. PROPOSED METHOD A. DEFINITIONS AND NOTATIONS
We first define S = {(x, y, t)|x ∈ X s , y ∈ Y s , t ∈ T s } as the training samples of the seen classes, where x ∈ R d represents the image features, t ∈ R m stands for the semantic attributes, and y is the class label of each sample. In addition, we also have X u and U = {(y, t)|y ∈ Y u , t ∈ T u } as auxiliary training data of the unseen classes for a transductive setting. X u denotes a set of unlabeled visual features, while U consists of semantic vectors and the corresponding labels.  Specifically, we have Y s ∩ Y u = ∅. With these training examples, the task of ZSL is to learn the function f ZSL : X u → Y u , while GZSL is tasked to learn the function

B. MAIN IDEA OF GT-GAN
In this section, a novel transductive GAN-based method is presented to alleviate the inequality between seen and unseen categories in the training stage. It can be decomposed into four steps, as illustrated in Fig. 2.
First, suppose that a seen-class image classifier, i.e., a softmax classifier, has been trained by the real seen-class image features from X s with their corresponding labels from Y s ; we then classify the unlabeled unseen image features from X u with this classifier. The output can be regarded as a probability distribution matrix P ∈ R n×k s , which represents the probabilities that each sample is assigned to each of the seen classes, where n is the number of input features, and k s stands for the number of seen classes.
Second, we regard P as the similarity matrix between each unlabeled unseen sample and seen class. In this case, a new matrix T s ∈ R k s ×m , consisting of all the seen-class semantic attributes, is introduced to compute a weighted semantic matrix T w ∈ R n×m with P as the weights: Hence, every unlabeled unseen sample has its own semantic vector t w ∈ R m from each row of T w . Third, since each weighted semantic attribute t w and unseen-class attribute t u are both m-dimensional vectors, we calculate the L 2 distances between each t w and all t u , where the latter constitutes the unseen-class semantic space T u . Then, we select k t u with the smallest distances. This is the equivalent of mapping each weighted semantic vector to the unseen-class semantic space T u and selecting the k-nearest unseen semantic vectors. The distance of the i-th sample and its j-th selected unseen semantic vector can be formulated as: where i ∈ [1, n] and j ∈ [1, k]. It should be noted that for traditional ZSL models, selecting the nearest one as the label of the unseen-class image means completing the classification task. Finally, we utilize k unseen semantic vectors and random noises to generate C sample features and calculate the soul sample [22], i.e., the average sample, for each x i , which is illustrated as follows: where s ij denotes the soul sample corresponding to the j-th nearest neighbor for x i , and x (l) ij is the l-th generated fake image feature. Then, these k soul samples can be compared with the original image feature, which corresponds to the weighted semantic vector t w , and a contrastive loss with semantic distances can be computed: where x i is the i-th original unlabeled unseen image feature, and λ ij is the weight parameter of the loss between the j-th soul sample of the i-th weighted vector, i.e., s ij , and x i .
that the closer t u is to t w , the generated fake sample should be closer to the original one. Obviously, L R u makes the relative positions between each image and its k-nearest neighbors in the feature space as consistent as possible with their relative positions in the semantic space.

C. OVERALL MODEL
In this section, we will introduce the GT-GAN model, which is based on LisGAN [22]. The full procedure is illustrated in Fig. 3, where the two blocks at the top of the image represent the training process and the two blocks below represent the classification.

1) LOSS FUNCTIONS
The conditional WGAN [7] consists of a conditional generator G and a conditional discriminator D, where G takes the attribute vector t and noise z as inputs to generate a fake image feature x = G(z, t), and D takes real image feature x, fake feature x and t as its inputs to discriminate whether a feature of one category is real or fake. The loss function of the conditional WGAN can be defined as where x = εx + (1 − ε) x with ε ∼ U (0, 1), and β is a hyperparameter. The first two terms approximate the VOLUME 8, 2020 Wasserstein distance, and the last term is the gradient penalty, which enforces the Lipschitz constraint [26]. In this article, we set β = 10, which has been found to work well across a variety of architectures and datasets in [26].
To generate better unseen features that are closer to the real features, we introduce L R u in Section III-B as a regularization of L G , so the full loss function of G can be formulated as follows: where γ u is the hyperparameter weighting of this regularization, whose settings are described in Section IV-B. Specifically, the existing models may use different regularizations to modify L G , but the ratio of L R u and −E[D( x, t)] can be set as a constant, i.e., γ u .

2) TRAINING THE MODEL
As shown in Fig. 3, the top two blocks represent training process for the generator, where the first represents the training process of the general GAN-based model, e.g., LisGAN, and the second represents the use of L R u for training. The specific training process for each epoch can be described as follows: For the first block, given dataset S and random noises z ∼ N (0, 1), we can leverage the generator G that utilizes the seen-class attribute vectors t s and z to generate fake seen image features x s = G(z, t s ). To ensure that the generated features are as close to the real features as possible, the discriminator D usually takes real seen features x s , x s and t s as its inputs to determine whether a feature of the specific class is real or fake. In addition, to make the synthesized features more discriminative, a softmax classifier (the seen-class image classifier in Fig. 3), which has been trained with real seen features from X s , is also introduced to classify x s and output a classification loss. In addition, GT-GAN also leverages soul sample regularization to improve feature generation. Both the classification loss and soul sample regularization are the same as those in LisGAN [22].
In this case, we already have the loss function of G(L G ), including a Wasserstein loss −E[D( x, t)], a classification loss and soul sample regularization. The details of the loss function and the specific settings of the weighting hyperparameters can be found in [22]; we use the same settings here. We also obtain the loss of D(L D ), which is the same as in Eq. 6. We will first perform the backpropagation and update the parameters of D with L D but will not train G because we need to obtain loss L R u in the second block.
As described in Section III-B, we introduce unlabeled unseen features x u u to obtain a contrastive loss L R u by leveraging the seen-class image classifier and the semantic attributes of both seen and unseen categories. Hence, the full loss function of G(L f G ) can be composed of L R u and L G as defined in Eq. 7. Then, the parameters of G can be updated, completing the training of one epoch. Obviously, the main idea of GT-GAN is easily applied to other GAN-based models; in the following experiments, we also add L R u to f-CLSWGAN [7] to observe the improvement in performance.

3) CLASSIFICATION
We utilize the same classification process as in the general GAN-based model, i.e., the two lower blocks in Fig. 3. First, the fake unseen image features x u are synthesized by the trained generator with noise z and unseen semantic descriptions t u . Then, we can train a softmax classifier, i.e., the unseen-class image classifier, on these features to classify the real unseen features x l u . In addition, we leverage the confident samples that are considered to be correctly classified to fine-tune the classification results as in [22].

D. FULL ALGORITHM PROCEDURE
In this section, we summarize the training and the classification process of GT-GAN in Algorithm 1 and Algorithm 2, respectively. It should be noted that we omit the generator's parameters θ G and the discriminator's parameters θ D in the loss functions formulated earlier. Actually, if the generator and the discriminator are used in the calculation of a loss function, this loss function should be related to their parameters and the gradient can also be calculated.
Calculate the loss L D by Eq. 6 and the gradient of Update the parameters θ D by backpropagation. 10: Calculate the loss L G according to [22].

11:
Get a batch of real unseen features {x 12: Calculate the loss L R u according to Section III-B.

13:
Calculate the full loss function L f G by Eq. 7 and the gradient of 14: Update the parameters θ G by backpropagation. 15: end for 16: end for Algorithm 2 The Classification Process of GT-GAN Input: z, random noises. t u , unseen-class attribute vectors.
x u , real unseen features. Output: The final classification results of x u . 1: Generate fake unseen features x u by the trained generator with t u and z. 2: Train a softmax classifier on these synthesized features x u . 3: Classify the real unseen features x u with the trained classifier. 4: Fine-tune the classification results with the confident samples according to [22].

IV. EXPERIMENTS A. DATASETS AND EVALUATION PROTOCOL
We evaluate our method on five popular benchmark datasets, i.e., Caltech-UCSD-Birds 200-2011 (CUB) [27], Oxford Flowers (FLO) [28], SUN Attribute (SUN) [29], Animals with Attributes (AWA) [30], and aPascal/aYahoo (aPY) [ Table 1, which are the same as those described in [2] and [32]. To obtain the image features for further training, we use ResNet-101, which is pretrained on ImageNet, to extract 2048-dimensional features. For the semantic attributes, we utilize the default attributes in CUB, SUN, AWA and aPY. Since FLO has no attribute annotations, we use the 1024-dimensional CNN-RNN features from a previous work [32].
At the testing stage, we adopt the average per-class top-1 accuracy as in [7], [22] to evaluate the performance of each method with ZSL. In the GZSL setting, we report the harmonic mean (H) to measure the overall performance on the seen and unseen categories, which can be defined as where S and U represent the average per-class top-1 accuracy for the seen and unseen classes, respectively.

B. IMPLEMENTATION DETAILS
Because the proposed GT-GAN works on the existing GAN-based ZSL models, the generator it leverages is the one from the corresponding model. The classifier it uses is a standard k s -way softmax classifier, which has been trained with seen samples in advance. Specifically, in this article, we set γ u = 0.04, k = 4 and C = 20 for the ZSL setting.
In the GZSL setting, we set γ u = 0.1, k = 5 and C = 20. Actually, we have run experiments on the different values of the three parameters for both ZSL and GZSL settings, as shown in Fig. 4. According to the results, we select these two different parameter settings for ZSL and GZSL to achieve the best overall effects, respectively.

C. GENERAL RESULTS
In this section, we evaluate GT-GAN and some other inductive and transductive models on five popular benchmark datasets, i.e., CUB, FLO, SUN, AWA and aPY with ZSL and GZSL settings, and compare the best effects with those from the classic models and the state-of-the-art models. We also apply L G u of GT-GAN to f-CLSWGAN [7] to evaluate the performance, which is represented as GT-GAN f-CLSWGAN in the result tables.

1) ZERO-SHOT LEARNING
We first provide the results of each model for these five datasets in Table 2. For the ZSL setting, the classification result for each sample is restricted to y ∼ Y u . It can be seen that GT-GAN and GT-GAN f-CLSWGAN perform better than LisGAN and f-CLSWGAN, respectively. For f-CLSWGAN, the improved model achieves 1.0%, 0.3%, 1.1% and 2.4% improvements on CUB, FLO, AWA and aPY and  almost the same accuracy on SUN. In addition, the GT-GAN also achieves 0.7%, 1.7%, 1.0% and 1.9% improvements on CUB, FLO, AWA and aPY and a very close accuracy on SUN with respect to LisGAN. Specifically, it achieves almost state-of-the-art performance. According to these results, our method can actually help recognize unseen samples by making the generator synthesize better unseen features that are more discriminative and closer to real features.

2) GENERALIZED ZERO-SHOT LEARNING
The GZSL experiment results are demonstrated in Table 3.
In this setting, both seen and unseen samples need to be recognized, i.e., the classification result of each sample is y ∼ Y s ∪ Y u . Hence, we adopt the harmonic mean (H) to measure the overall performance of each method on the seen and unseen categories. From the results, we can see that the GT-GAN f-CLSWGAN method improves the accuracy over the original f-CLSWGAN by 4.9%, 4.5%, 3.1%, 7.5% and 6.9% on CUB, FLO, SUN, AWA and aPY, respectively, and GT-GAN achieves 4.0%, 4.9%, 2.6%, 5.4%, and 5.8% improvements over LisGAN. We can also see that GT-GAN generally outperforms the other models on all five datasets.
Although some models reach a higher accuracy on either seen or unseen classes, they are unable to show maximal accruacy for both classes combined. Hence, our method can make the models more generalized and stronger than their original forms. Comparing the effects of ZSL and GZSL, it can be seen that our proposed GT-GAN can actually be applied to various existing GAN-based ZSL models and improve their performances. We can also observe that our method achieves more substantial gains in GZSL. This is because in the ZSL setting, the final classifier is trained on fake unseen samples, and it only classifies real unseen samples at test time. Therefore, the generated unseen features in each category during classifier training do not need to be very similar to the real features but are nevertheless relatively discriminative and can guarantee correct classification. However, our method helps synthesize highly related unseen samples with real samples, which cannot obviously improve the accuracy of recognition. In the GZSL setting, we utilize the final classifier trained on real seen samples and generated unseen samples to recognize both seen and unseen real samples. In this case, if the generated unseen features cannot reach a sufficient quality, the classifier may bias the predictions to the seen classes, which may lead to a drop in accuracy, i.e., the feature confusion problem. Since our method helps generate unseen samples that are close to the real samples, it can alleviate this issue and obtain significantly higher accuracy.

D. MODEL ANALYSIS
In this section, we analyze our model by performing an ablation study, investigating hyperparameter sensitivity, testing stability and convergence, and assessing the effect of the number of synthesized unseen samples based on GT-GAN. In addition, a further analysis is also performed to directly prove that our method actually helps synthesize unseen samples that are close to real samples instead of inferring as such from the improvement in accuracy.

1) ABLATION STUDY
In this paper, we apply our GT-GAN to LisGAN [22]. Since the previous ablation study in [22] was sufficient and the experiments in the current paper illustrated the effect of our proposed method, we will not perform more ablation studies.

2) HYPERPARAMETER SENSITIVITY
We investigate the sensitivities of three hyperparameters, γ u , k and C, for both ZSL and GZSL settings. γ u denotes the weighting coefficient of contrastive loss, whose effect is reported in Fig. 4(a) and Fig. 4(b). k is the number of selected attributes as well as the nubmer of soul samples, and its influence is shown in Fig. 4(c) and Fig. 4(d). Similarly, we also introduce C as the number of synthesized features of each attribute used to measure how many generated samples can represent the overall effect of the generator, whose sensitivity is shown in Fig. 4(e) and Fig. 4(f). From these results, we can see that the weighting coefficient γ u should be relatively small for ZSL and GZSL; otherwise, the performances on most datasets would degrade dramatically as γ u increases. In addition, our model achieves the best result on each dataset with different k but is not very sensitive to this parameter. Thus, we set k = 4 and k = 5 for ZSL and GZSL, respectively, to achieve the best overall performances. Similarly, it is easy to observe that our model is not sensitive to C, so we set this parameter 20 to reduce the computational costs.

3) STABILITY AND CONVERGENCE
Since our proposed model is based on GAN, which needs several training epochs to achieve the balance between the generator and the discriminator, it is essential to test the stability and convergence of our model during training. We report the results for both ZSL and GZSL settings with increasing training epochs in terms of testing errors in Fig. 5. It can be seen that the training process of our model is generally stable although there are some small fluctuations. In addition, our model can converge to a stable state with 20 epochs.

4) EFFECT OF THE NUMBER OF SYNTHESIZED SAMPLES
We also analyze how the number of generated unseen samples N for classifier training influences the test accuracy, as shown in Fig. 6. We can see that the accuracy on all datasets increases rapidly as N increases from 1 to 10, and the changes decrease as N increases to 50. When N >50, the model is not very sensitive to it. In addition, we can also observe that the accuracy has larger improvements on CUB and SUN than on the other datasets, which may be because they have more unseen classes and need more synthesized samples to help recognize real features among these classes.

5) FURTHER ANALYSIS
We introduce an evaluation to directly prove that compared with LisGAN, GT-GAN actually synthesizes unseen samples that are close to the real samples, as shown in Fig. 7. As a soul sample can be regarded as the centre of a cluster, we calculate the square of the distance between the real soul sample s c and the fake soul sample s c of each unseen class to evaluate the difference between the overall generation and real sample: where k u is the number of unseen categories. In Fig. 7, we can see that GT-GAN can generate better samples, which are close to the real soul samples, for each epoch to train the classifier. Thus, we can achieve higher classification accuracy than the original model.

V. CONCLUSION
In this paper, we propose a general transductive GAN-based method called GT-GAN, which leverages unlabeled unseen samples for training to improve the performance of the original generator. A new loss function is also introduced to help generate unseen features that are close to the real features and make the relative positions between each unseen image and its k-nearest neighbors in the feature space as consistent as possible with their relative positions in the semantic space; this loss function may be used in most existing GAN-based models. Extensive experiments on five benchmark datasets show that our method can be applied to various existing GAN-based (conditional WGAN) models and achieves substantially higher accuracy than the original models, especially in the GZSL setting. The model analysis also demonstrates the sensitivity of each parameter and verifies that our model is able to better generate features for the unseen categories. However, there are several limitations or directions for improvements. First, the synthesized unseen samples cannot represent most aspects of one class because we cannot obtain soul samples of unlabeled features and learn from them at the training stage. Second, we cannot guarantee that the categories represented by the selected k semantic vectors are all highly related to the real category of the original unlabeled unseen feature. In other words, the generator still suffers a risk of being incorrectly trained. Further work will be required to solve these problems. JUNHAO  His research interests include natural language processing, machine learning, and computer vision.
HAOYU WANG was born in Chaoyang, Liaoning, China, in 1997. He received the B.S. degree in information engineering from the Beijing University of Posts and Telecommunications, Beijing, China, in 2019, where he is currently pursuing the degree majoring in information and communication engineering.
His research interests include natural language processing and computer vision.