Transfer Feature Generating Networks With Semantic Classes Structure for Zero-Shot Learning

Feature generating networks face a very important issue, which is the fitting difference (inconsistency) of the distribution between the generated feature and the real data. This inconsistency further influences the performance of the network model because training samples from seen classes are disjointed with testing samples from unseen classes in zero-shot learning (ZSL). In generalized zero-shot learning (GZSL), testing samples are from not only seen classes but also unseen classes to be closer to the practical situation. Therefore, most feature generating networks have difficulty achieving satisfactory performance for challenging GZSLs by adversarial learning the distribution of semantic classes. To alleviate the negative influence of this inconsistency for ZSL and GZSL, transfer feature generating networks with semantic classes structure (TFGNSCS) are proposed for constructing a network model to improve the performance of ZSL and GZSL. TFGNSCS not only can consider the semantic structure relationship between seen and unseen classes, but also can learn the difference of generating features by transferring classification model information from seen to unseen classes in networks. The proposed method can integrate the transfer loss, the classification loss and the Wasserstein distance loss to generate enough CNN features, on which softmax classifiers are trained for ZSL and GZSL. Experiments demonstrate that TFGNSCS outperforms state-of-the-art models on four challenging datasets: CUB, FLO, SUN, and AwA in GZSL.


Introduction
Figure 1: Comparison between generative feature network method in (a) (for example CLSWGAN [1]) and the proposed method (TFGNSCS) in (b).GAN means generative adversarial network.
Based on large amounts of labeled data training, deep learning can capture the various patterns of data for large-scale recognition problem.However, in many practical application, we usually lack annotated data, which needs lots of time-consume to manually annotate.Therefore, data generation [2] [3] [4] [5] [1] with labels has become an important method for obtaining enough annotated data.Generative adversarial net-works (GAN) [2] can synthesize the approximate images on object classes [3][5], but can not generate sufficiently discriminative images or features without classification information.Especially, because training samples from seen classes is disjointed with testing samples from unseen classes in ZSL or GZSL, generative features for different classes don't accurately match with the respective distribution in GAN.In other words, there is some data shift between generative features for unseen classes and their real distribution, since generate networks model is often trained by the samples of seen classes.Existing CLSWGAN [1] considers the classification loss of seen classes for improving the performance of ZSL or GZSL.However, the classification loss of unseen classes is also important for ZSL or GZSL.Therefore, our motivation is how to transfer classification information from seen to unseen classes to construct the classification loss of unseen classes (this loss is called transfer loss) for learning generate networks model.ZSL [6][7] [8][9] [10] is an arguable problem about the extreme condition of few samples.Some classes (seen classes) have visual samples , while others (unseen classes) have no visual samples during training in ZSL.In this work, we focus on the transferability of generative adversarial model, and expect to use transfer information to process the generating feature inconsistence of the unbalance learning (In Fig 1, we explain this point, which means learning model from seen classes to generate features for unseen classes)between seen and unseen classes for constructing learning model in ZSL and GZSL.Generative features for unseen classes by transfer generative adversarial model is used for the traditional supervised learning to solve ZSL and GZSL.Therefore, the main contribution in our paper is the proposed TFGNSCS based on existing CLSWGAN [1] to find the importance of transfer information for processing the unbalanced learning between seen and unseen classes.Especially, we look into the influence of the different transfer loss for generating features in ZSL or GZSL.We mainly discuss two transfer losses.One (the detail is defined by equation (3) in section 3.1) is the consideration of the structure relationship between seen and unseen classes by structure propagation (the details in section 2.2), and the other (the detail is defined by equation (4) in section 3.1)is balancing the difference of generating features between seen and unseen classes by discriminator information.In this motivation, we proposed a novel generative feature GAN method-namely TFGNSCS that is learned with a novel transfer loss improving over existing GAN-models for generating features.(c) Our model is generalized to different transfer information ways for evaluating the performance of generative features model.

Related Works
The related works of the proposed method involve generative adversarial networks(GAN), structure propagation, zero-shot learning (ZSL) and generalization gero-shot lLearning (GZSL).

Generative Adversarial Networks
GAN [2] can initially learn a generative model to follow an arbitrary distribution by a discriminative model adjustment, for example images distribution fitting.In terms of GAN theory development, this process involves three aspects.The first aspect is GAN training improvement by additional information, such as deep convolution neural network in DCGAN [11], the style and structure networks in improved DCGAN [12], and the mutual information between the latent variables and the generator distribution in InfoGAN [13].The second aspect is conditional GAN by feeding the related information into networks, for instance class label [14] and sentence descriptions [3].The third aspect is about stability GAN training by the relevance constraint [15], which can be Wasserstein distance [16] or Lipschitz constraint [4].Recently, CLSWGAN [1] can utilize WGAN idea with classification loss for generating image feature, and demonstrate the promising results in ZSL and GZSL.
In this paper, we intuitively find that feature generated by the state of art CLSWGAN [1](this model only is trained by the samples of seen classes) are not enough fitting the distribution of unseen classes for learning a classifier.Hence, we present a novel GAN framework to synthesize CNN features to learn a discriminative classifier for ZSL and GZSL.Integrating the promising CLSWGAN [1] loss and transfer loss which transmits the information of the different classes to generate the discriminative feature, our proposed GAN framework outperforms CLSWGAN [1] owing to the regularizer of transfer loss.

Structure Propagation
To best of our knowledge, structure propagation is firstly proposed for image completion as a global optimization problem by enforcing structure and consistency constraints [17].Structure can be defined as the graph structure among data samples and plays a very important role for visual information discrimination.In recent works, there are two kinds of the impressive methods.One is structure information propagation in label space, such as dynamic structure fusion and label propagation to refining the relation of objects for semi-supervised multi-modality classification [18] and information propagation mechanism from the semantic label space, which can be applied to model the interdependencies between seen and unseen class labels [19].The other is structure information propagation between seen and unseen classes, for instant structure fusion and propagation to update the relevance of multi-semantic classes by the iteration computation for ZSL [20] [10], structure propagation constraining the encoderdecoder mechanism of the bidirectional projection for ZSL [21], and absorbing Markov chain process propagation constructing semantic class prototype graph for ZSL [22].
Although those papers have shown the information transferability of structure propagation, structure propagation is not used for generating feature in adversarial mechanism to balance the difference between seen and unseen classes.
In this paper, we expect to construct a novel GAN framework by transfer loss adding into CLSWGAN [1].Transfer loss includes two parts.One is balancing transfer information by generative model iteration, and the other is structure propagation for accurately computing classification loss of all classes.In contrast, CLSWGAN [1] only calculates classification loss of seen classes.Therefore, structure propagation can further extend CLSWGAN [1] model to relieve the generating feature inconsistence of seen classes training model for following the distribution of unseen classes.

Zero-shot Learning
In ZSL, seen classes of model learning in training and unseen classes of model evaluation in testing are disjoint [23].According to the utilization of deep network framework in ZSL, ZSL methods can be divided into two categories to bridge the gap between seen and unseenclasses.One involves non-deep network framework for ZSL, such as semantic attribute classifiers learning [24] [25] [6], seen class proportions combining unseen class [26] [27] [28] [8], and the learning compatibility between images and classes [29] [30] [31] [32] [33] [9].The other category utilizes deep network framework for ZSL, for instance DeViSE model [34],latent discriminative feature learning (LDF) model [35] and quasi-fully supervised learning (QFSL) model by deep network optimizing the visual or semantic model, synthesizing example [36] or preserving semantic relation [37] by autoencoder architecture, multi-label zero-shot learning (ML-ZSL) [19] or graph convolutional network for zero-shot learning [38] with the benefit of knowledge graph,and semantics-preserving adversarial embedding network (SP-AEN) [39] or feature generating network [40] [1] based on generation adversarial mechanism.
In summary, generation adversarial frameworks demonstrate the promising results, and especially visual feature generation outperforms image generation based on the same adversarial frameworks for ZSL.However, these frameworks rarely consider transfer information based on structure propagation for finding the more discriminative feature in ZSL.Therefore, for considering the transfer loss constrains,we expect to construct a novel generation adversarial frameworks to capture discriminative information for tackling ZSL or GZSL.

Feature generation
In this section, we discuss the feature generation model CLSWGAN [1] based on GAN framework as the proposed model basis.The main idea of GAN is playing a game between a generative network G and a discriminative D to optimize data generation following the specialization distribution.In CLSWGAN, D need identify as much as possible real feature from generated feature, while G need trick the discriminator by generated feature that has deviation compared with the real feature.Compared CLSWGAN with GAN, the differences are the addition of the classification loss and the metric method change of the Wasserstein distance loss.According to the inspiration of conditional GAN [14][3], we expect to extend CLSWGAN to the proposed TFGN-SCS with a conditional transfer transformation to both G and D. In the following we describe the details of TFGNSCS, the novelty of which is that we introduce transfer information into the conditional GAN to generate the more discriminative features for ZSL or GZSL.It is worth noting that the proposed TFGNSCS not only can synthesize the good fidelity features of unseen classes in S, but also can refine the performance of the model by the generated features of unseen classes.
We can extend CLSWGAN [1] to the proposed TFGNSCS by transferring the probability model of the generated features from seen to unseen classes for the adversarial training between the generator and the discriminator.The loss has four parts.The first part is constructed based on the improved WGAN [4] and conditional WGAN [1] with the class embedding c(y s ).
where xs = G(z, c(y s )) is the generative feature of seen classes, z ∈ Z ⊂ R dz is random Gaussian noise, c(y s ) ∈ C is class embedding of seen classes, x = αx s + (1 − α)x with α ∼ U (0, 1), and λ is the trade-off coefficient.In the loss L W GAN , the first two terms compute the Wasserstein distance, and the third term constrains the gradient of D to become unit norm following the straight line between pairs of real and generated point [1].
The second part of the loss is expected to generate CNN feature for adapting a discriminative classifier.In other word, the construction of the classifier can constrain the feature generation of G for balancing their relationship.Therefore, we can maximize the probability of the generated feature xs in the classifier trained by the real feature x s , and further minimize the classification loss, which is defined by the negative log likelihood.
where xs = G(z, c(y s )), y s is the class label of xs in seen classes, P (y s |x s ; θ) is the probability of xs with the class label y s .The probability can be modeled by the θ parameterizing classification methods, for example the linear softmax classifier or support vector machine.These classification methods can be learned by the real feature and the class label pairs in seen classes.
The first two parts of the loss are the main ideas of CLSWGAN [1].We propose the novel transfer loss( includes the third part and the forth part of the loss) based on CLSWGAN [1] for considering transfer information.Therefore, we expect that the third part of the loss can process the classification loss of the generated feature xu in unseen classes.However, we can not construct a classifier trained by the real feature x u that is lost in ZSL or GZSL.Because semantic information is complete in the concept of all classes, we can draw support from the relationship between seen and unseen classes in semantic embedding to transfer the classifier model P (θ).we define that 2), and the third part of the loss is the model transfer loss L T RA1 that is The forth part of the loss is constructed based on the discriminator D. We expect to capture the identification information of unseen classes in the discriminator D.
Therefore, the loss L T RA2 can be defined as where xu = G(z, c(y u )),c(y u ) is class embedding of semantic description in unseen classes.In the loss L T RA2 , we consider the information of unseen classes for balancing the bias of the discriminator D in seen classes.Therefore, the total loss can be built by the above four parts losses, and full objective can be reformulated as following,

Transferring and Classification
In the proposed model, we need obtain three classifier model to generate the discriminative feature for classifying unseen classes.
The first classifier model P (θ) can be trained by samples of seen classes for describing L CLS to improve the classification performance of the generative feature of where P (y s |x s ; θ) = exp(θ T s xs) s exp(θ T s xs) , P (θ) = [P (y 1 |x 1 ; θ), P (y 2 |x 2 ; θ), ..., P (y K |x K ; θ)], θ s is a column of θ ∈ R dx×K that is transformation matrix corresponding to image feature x to K classes probabilities mapping, T = S, and N T is the number of T.
The second classifier model Q(θ) can not be learned by unseen classes of samples, which is lost in ZSL and GZSL.Therefore, we construct Q(θ) by T (P (θ)) for representing L T RA1 to enhance the discrimination of the synthesized feature of unseen classes.Given P (θ) = A s W ss , P (θ) can be decomposed into the sharing part A s and the unique part W ss in probability pattern of seen classes.Q(θ) = A u W uu means that Q(θ) can be divided into the sharing part A u and the unique part W uu in probability pattern of unseen classes.A u = A s W su indicates the relationship of the sharing parts between seen and unseen classes.Therefore, transfer transformation function T (P (θ)) can be deduced as whereW ss ∈ R K×K is the similarity matrix among seen classes,W uu ∈ R M ×M is the similarity matrix among unseen classes,W su ∈ R K×M is the similarity matrix be- where W ss (i, j) is a element of W ss , i = {1, 2, ..., K}, j = {1, 2, ..., K}, c(y s ) i or c(y s ) j denotes the semantic embedding of any class i or j in seen classes,N c(ys)j is the neighborhood of c(y s ) j .For selecting the related semantic embedding, the neighborhood number often is set as 5.
where W uu (i, j) is a element of W uu , i = {1, 2, ..., M }, j = {1, 2, ..., M }, c(y u ) i or c(y u ) j denotes the semantic embedding of any class i or j in unseen classes,N c(yu)j is the neighborhood of c(y u ) j .For selecting the related semantic embedding, the neighborhood number often is set as 5.
where W su (i, j) is a element of W su , i = {1, 2, ..., K}, j = {1, 2, ..., M }, c(y s ) i stands for the semantic embedding of any class i in seen classes, while c(y u ) j is the semantic embedding of any class j in unseen classes,,N c(ys)j is the neighborhood of c(y s ) j .For selecting the related semantic embedding, the neighborhood number often is set as 5.
The third classifier model is be constructed based on the real feature of seen classes s and the synthesized feature of unseen classes u for transforming ZSL to supervised learning.We can learn model parameter φ by following equation that is where that is regarded as the weight matrix to project the feature x to N categories in a fully connected layer of deep network.In ZSL, T = Ũ , while T = S Ũ in GZSL, Ũ = {U, xu }.The prediction function f (x) can be defined as where in ZSL, x ∈ {x u },x u is the image feature of unseen classes and y ∈ Y u , while in GZSL, x ∈ {x s , x u } and y ∈ (Y s Y u ).
The pseudo code of the TFGNSCS can be shown in Algorithm 1, which has five steps.The first step (line 1) initializes the structure representation of semantic embedding by equation ( 8), ( 9), (10)and (11).The second step (line 2) trains the classifier model of seen classes by equation ( 6) and transfers the classifier model for unseen classes by equation (7).The third step (line 4) updates the discriminator D with equation (5).The forth step (from line 5) updates the generator G with equation ( 5).The fifth step (from line 7) implements the classifier model training and the label estimating of unseen classes by equation ( 12) and (13).

Experiments
We firstly explain our experimental configuration, and then we demonstrate (a) the comparison results between the proposed method and the state of the arts for ZSL and GZSL on four challenging datasets, (b)our analysis for the base-line methods based on the different loss combination, (c)our extending experiments on the different transfer method and (d)our parameter analysis for image feature generation.

Algorithm 1 The pseudo code of the TFGNSCS algorithm
Input: S and U Output: ŷ (the estimation value of y u for ZSL or the estimation value of y s and y u for GZSL) 1: Computing the semantic embedding of the structure representation W ss and W su by equation ( 8), ( 9), ( 10) and ( 11) Training and transferring the classifier model by equation ( 6) and ( 7) Updating the discriminator D by equation ( 5) Updating the generator G by equation ( 5) 6: end for Training the classifier model and estimating the label ŷ of classes by equation ( 12) and ( 13)

Datasets
We implement and evaluate the proposed method TFGNSCS for ZSL or GZSL in four challenging datasets, which are Animals with Attributes (AwA) [6], CUB-200-2011 Birds (CUB) [41], SUN Attribute (SUN) [42] and Oxford Flower (FLO) [43].AwA includes 30475 images, 50 categories and 85 attributes, and belongs to the coarsegrained datasets.CUB, SUN and FLO pertain to the fine-grained datasets.CUB contains 200 birds classes with 312 attributes for a grand total of 11788 images.SUN involves 14340 images from 717 scenes with 102 attributes.FLO has 8189 images from 102 flower classes that can be annotated by the visual description [44].Table 1 shows the statistics of these datasets.

Visual and Semantic Feature
ZSL can recognize the visual samples of unseen classes by the completed semantic relation.Visual feature and semantic class embedding should first be extracted or described.Deep network shows the outstanding performance for extracting the discriminative feature from visual or semantic information.Therefore, we use the same description in [1].We can represent the entire image as the 2048 dimension visual by ResNet [45] feature from the top layer of the pre-trained 101-layer ResNet [45] based on ImageNet 1K without image pre-processing, network fine-tuned and data augmentation.We can utilize pre-annotated attributes as semantic class embedding, such as AwA with 85 dimension vector, CUB with 312 dimension vector and SUN with 102 dimension.For FLO without the pre-annotated attributes, we extract 1024 dimension feature based on CNN-RNN of fine-grained visual description [44].In the whole feature extracting process, we obey the ZSL rules that any information of Y s and Y u have no crossed set.

Classification protocols
In ZSL, the test image can be corresponding to an unseen class label in Y u , while

Comparison with the state-of-the-arts
In this section, because generation adversarial architecture and structure constrains are basic ideas for constructing TFGNSCS, we compare the proposed method with five related state-of-the-arts.The first method is dual-verification network (DVN) constructs and verifies the orthogonal projection between features and attributes with a pairwise manner in the respective spaces [46].The second method is a hybrid model (HM) includes random attribute selection (RAS) and conditional generative adversarial network (cGAN) for adversarial unseen visual feature synthesis [47].The third method is visual center learning (VCL) can align the projected semantic center and visual cluster center by minimizing the distance between the synthetic and real center in visual feature space [48].The forth method is triple verification network (TVN) can construct a unified optimization of regression and compatibility functions for integrating the complementary losses and the mutual regularization [49].The fifth method is feature generating networks (FGN) can pair a wasserstein GAN with a classification loss to generate sufficiently discriminative CNN features for training softmax classifier in ZSL or GZSL [1].One thing to note, the above methods are inductive methods, which do not use test datasets for training the learning model, for strictly following ZSL or GZSL setting.
Tab.2 shows the comparison results between the proposed method (TFGNSCS) and five state-of-arts (DVN, HM, VCL, TVN and FGN)for ZSL.The proposed method TFGNSCS has the better result in the various datasets.The performance of TFGNSCS respectively improves 1.7% for AwA, 0.6% for CUB, 2.4% for SUN, and 0.9% for FLO at least.Tab.3 demonstrates the comparison results between the proposed method (TFGN-SCS) and five state-of-arts (DVN, HM, VCL, TVN and FGN)for GZSL.TFGNSCS outperforms five state-of-arts in all datasets.Harmonic mean H can measure the performance of the different methods in the various datasets.The higher value of H indicates the better result for GZSL.H of TFGNSCS respectively improves 4.9 for AwA, 3.3 for CUB, 1.2 for SUN, and 3.1 for FLO at least.

Comparison with the base-line methods
The proposed method (TFGNSCS) is constructed based on FGN [1] framework, and extends two loss terms for building transfer feature generating networks.We implement the ablation study for comparing the proposed method with the base-line methods.Therefore, the related base-line methods include FGN [1](the optimization In FGN, the model only considers of the classification loss (L CLS ) of seen classes based on the generative adversarial framework.In TFGNSCS-1, the model integrates the classification loss (L T RA1 ) of unseen classes with FGN model based on semantic structure transfer.In TFGNSCS-2, the model combines the discriminator loss (L T RA1 ) for unseen classes with FGN model.TFGNSCS model considers all of these factors for transferring and balancing the information between seen and unseen classes.Tab.4 shows the experimental results of the proposed method TFGNSCS and the base-line methods for ZSL.The performance TFGNSCS outperforms that of the other methods.The improvement of TFGNSCS is 2.3% for AwA, 2.9% for CUB, 4% for SUN and 0.9% for FLO at least.Tab.5 demonstrates that the performance TFGNSCS outperforms that of the base-line methods for GZSL.H of TFGNSCS respectively improves 1.0 for AwA, 1.2 for CUB, 0.1 for SUN, and 1.0 for FLO at least.For ZSL and GZSL, the performance of TFGNSCS-1 is better than that of TFGNSCS-2 and FGN, while the performance of FGN is worse than that TFGNSCS-1 and TFGNSCS-2.L T RA1 in TFGNSCS-1 is individually considered to ascend the performance of the model, and L T RA2 in TFGNSCS-2 is individually considered also to improve the performance of the model.However, L T RA2 individually enhance the difference of the discriminator between seen and unseen classes, whereas transfer factor L T RA1 can weaken these imbalance information between seen and unseen classes.Therefore, the combination of L T RA1 and L T RA2 in TFGNSCS can make L T RA2 boost the transfer characteristic of L T RA1 for improving the performance of ZSL and GZSL.

Comparison with the different transfer methods
Transfer method is an key point for constructing generating network model.In this paper, we focus on the transfer method from the classifier model P (θ) of seen classes to the classifier model Q(θ) of unseen classes.The equation (7) shows the transformation relationship between P (θ) and Q(θ).Beside this way, while we incorporate image feature of unseen classes into a semantic class prototype graph, ZSL can be regard as an extended absorbing Markov chain process on this graph [22].Therefore, a alternative transfer transformation function T (P (θ)) is where I ∈ R K×K .We use the transfer method to learn model, which can be expressed as TFGNSCS-alt.In Tab.6 and Tab.7, we find that difference between transfer methods is slight for ZSL or GZSL.The main reason is that the different transfer methods can both adjust the imbalance information between seen and unseen classes for improving the discrimination of generative feature by adversarial learning.

Parameter analysis
In TFGNSCS, the number of generative features directly impacts on the performance of ZSL or GZSL.Therefore we select the number of generative features from

Experimental results analysis
In experiments, we compare the proposed method with eight methods, which include five kinds of state-of-the-art methods (DVN [46],HM [47],VCL [48],TVN [49] and FGN [1] in section 4.4), three kinds of base-line methods(FGN [1],TFGNSCS-1 and TFGNSCS-2 in section 4.5) and a alternative transfer method (TFGNSCS-alt in section 4.6).These methods can construct the related model to bridge the gaps between visual and semantic information for ZSL or GZSL.In contrast to these methods, the proposed method focuses on mining the transfer information in generation adversarial framework for the discriminative synthetic feature.From these experiment, we have the following observations.
• The performance of TFGNSCS is better than that of five kinds of state-of-the-art methods (DVN,HM,VCL,TVN and FGN in section 4.4) for ZSL, and the performance of TFGNSCS outperforms that of these methods for GZSL.This situation of the main reason is that TFGNSCS try to balance the difference between seen and unseen classes by transfer losses.Therefore, in ZSL setting, the classification accuracy of unseen classes is higher than other methods on four datasets, moreover, the betterment is noticeable for harmonic mean in GZSL setting.Especially, harmonic mean of FGN and TFGNSCS significantly exceeds that of other state-of-the-art methods for GZSL in the different datasets.
• The performance improvement of TFGNSCS is different in three base-line approaches (FGN,TFGNSCS-1,TFGNSCS-2) for ZSL or GZSL.The advance of TFGNSCS can be found for ZSL in four datasets, while the better improvement can be demonstrated for GZSL in all datasets.In there,the outstanding betterment is harmonic mean of GZSL in AwA and FLO.It shows that transfer method can enhance the discrimination of the generative features by transfer losses in adversarial networks, and further validates that semantic structure transfer is effective for constructing the learning model in ZSL or GZSL.
• The different transfer methods have the similar performance for ZSL and GZSL in all datasets.In this paper, two kinds of transfer method both is built based on the semantic classes of graph, which can represent the distribution structure of classes.The difference of these transfer methods is on the different way of the structure propagation.The structure propagation of that equation (7) in TFGN-SCS is formed by drawing support from the relationship of the sharing parts in respective classification model on the seen or unseen classes, while the structure propagation of that equation ( 14) in TFGNSCS-alt is constructed based on an extended absorbing Markov chain process.In the learning process of the adversarial networks, this difference of the structure propagation is trivial for ZSL and GZSL.
• The two loss parts (the unseen classes classification loss L T RA1 and the generated features discrimination loss L T RA2 ) of transfer loss have the diverse effect to improve the performance of ZSL and GZSL.The unseen classes classification loss L T RA1 can boost the performance of ZSL and GZSL, while the generated features discrimination loss L T RA2 play a fewer role for this melioration.However, the integration of these loss can further ameliorate the performance of ZSL and GZSL.The loss L T RA2 is a assistant method for ascending the performance of TFGNSCS, and it's role depends on the quality of the generative feature by L T RA1 .
• The number of the generative features influences the performance of ZSL and GZSL.The increasing number of the generative features makes the performance improve for ZSL, and this situation is more obvious for GZSL.It shows that the proposed method TFGNSCS can synthesize the more discriminative feature at the same number because of the transfer loss contribution for model construction.

Conclusion
We have proposed transfer feature generating networks with semantic classes structure (TFGNSCS) method to address imbalance between seen and unseen classes in ZSL and GZSL.TFGNSCS can not only adapt the semantic structure relationship between seen and unseen classes to a uniform generative features framework, but also model the difference of generating features by balancing transfer information between seen and unseen classes in networks.Furthermore, TFGNSCS can combine a Wasserstein generative adversarial network with classification loss and transfer loss to generate enough CNN feature for improving ZSL and GZSL.At last, the optimization learning of the TFGNSCS can obtain both the transfer feature generating networks and the more discriminative features.For evaluating the proposed TFGNSCS, we carry out the comparison experiments about the state of the art methods, the baseline methods, the other transfer method and parameter analysis on AwA, CUB, SUN and FLO datasets.
Experiment results demonstrate the TFGNSCS gets the promising results in ZSL and GZSL.
Our contributions have three point as follows.(a)We present a novel adversarial generative model TFGNSCS that synthesizes CNN features of classes by optimizing and balancing the related losses, which are the transfer loss,the classification loss and the Wasserstein distance loss.(b) In four challenging datasets with the different size or granularity, the proposed TFGNSCS outperforms the state of the arts in GZSL setting.

Existing
ZSL methods only use the information of seen classes (label data, image feature data and semantic data) during training, and predict the label of unseen classes by the potential relation of semantic space (In a complete semantic space, semantic concepts have an uniform description, on which their distribution relation can be captured.).The main idea of the proposed model transfers the information of seen classes into synthesized feature of unseen classes by structure propagation, and iteratively constructs the learned model based on the real information of seen classes and the generative information of unseen classes.Therefore, the key point of the proposed method not only draws support from semantic embedding vector but also models transfer relation (structure propagation) to generate CNN features without any images of the class.The transfer relation can alleviate the inconsistence of the generation distribution on all categories.Because we can use synthesized CNN features as samples of unseen classes, ZSL can be converted into supervised learning.We can train Softmax classifier for recognizing unseen classes.We define the following notation for describing ZSL or GZSL.S = {(x s , y s , c(y s ))|x s ∈ X, y s ∈ Y s , c(y s ) ∈ C} is training set, which includes seen classes.x s ∈ R dx is the CNN feature with d x dimension in X feature sets, y s stands for the class label of x s in Y s = {y s |s = 1, 2, ..., K}, and c(y s ) ∈ R dc denotes the class y s of the semantic embedding, which represents the class vector of semantic description(such as attributes), in C semantic embedding sets.In addition, the available information of unseen classes in training is U = {(y u , c(y u ))|y u ∈ Y u , c(y u ) ∈ C} where y u ,c(y u ) and Y u = {y u |u = 1, 2, ..., M } is respectively the class label, the class embedding and the class label set in unseen classes without image and feature.Therefore, the purpose of ZSL is to learn a projection f : X × C → Y s for discriminating images of unseen classes belonging to which one in Y u , while the task of GZSL learns the same projection for recognizing seen and unseen classes of images being which one of Y s Y u , where Y s Y u = ∅.
where xu = G(z, c(y u )), c(y u ) is class embedding of semantic description in unseen classes, Q(y u |x u ; θ) is the probability of xu with the class label y u based on classifier model Q(θ) = T (P (θ)).The probability of unseen classes is parameterized by θ.L T RA1 is the negative log-likelihood of the transformed model T (P (θ)).In L T RA1 , we constrain the discrimination of generative features in unseen classes by transferring the classification model of seen classes, and further alleviate the generative feature inconsistence between seen and unseen classes.

Figure 2 :
Figure 2: The proposed TFGNSCS can minimize the transfer loss over the unseen classes classification loss L T RA1 and the generated features discrimination loss L T RA2 .T (P (θ)) is the transfer transformation function,xs and xu are respectively the generation feature of seen and unseen classes.

Figure 3 :
Figure 3: The architecture of discriminator D. The output of discriminator is a real value that represents the matching degree between feature and semantic embedding in the input of discriminator.

Figure 4 : 1 N
Figure 4: The architecture of generator G.
tween seen and unseen classes based on semantic class prototype graph(In this graph, each class is corresponding to one node, which can be represented by the semantic embedding of each class.The weight between nodes can describe the similarity between classes).These similarity matrices (structure representation) can be measured by cosine distance d(a, b) between the semantic embedding a and b in any two classes.
this image can be related to any one class label in Y s Y u in GZSL.For evaluating the performance of ZSL or GZSL, we compute the average per-class top-1 accuracy by dividing the each class accuracy sum by the number of classes, for example, tr stands for average per-class top-1 accuracy on seen classes, ts denotes average per-class top-1 accuracy on unseen classes, and H = 2 * tr * ts/(tr + ts) represents harmonic mean on Y s Y u in GZSL.We expect to generate the discriminative feature by the proposed model.For comparing with state of the arts, we preserve the architecture of genera-tor and discriminator in[1].These architectures both include multi-layer perception (MLP) with leaky rectified linear unit (LReLU), a single layer with 4096 hidden units and a output rectified linear unit (ReLU) layer for learning top max-pooling units of ResNet-101.The noise z come from Gaussian distribution with a unit variance, and the dimensionality of z is same to that of class embedding.We adopt λ = 10 [4], β = 0.01[1],γ = 0.01 and η = 1 for conveniently comparing the different method result.However, all kinds of hyperparameters often can be obtained by cross-validation based on the specific dataset in fact.

1, 2 , 6 ,
10,30, 50, 100, 200, 300  to construct the different classification model for comparing TFGNSCS with the base-line methods.Figure5shows TFGNSCS outperforms the base-line methods.Especialy there is the significant boost of classification accuracy with number increasing from 1 to 50 of generative features for unseen classes, e.g.42.1 to 67.7 on FLO and 33.3 to 57.0 on CUB in TFGNSCS. Figure 6 also demonstrates the significant boost of harmonic mean with number increasing from 1 to 50 of generative features for unseen classes, e.g.48.1 to 67.7 on FLO and 32.1 to 48.6 on CUB in TFGNSCS.It shows that TFGNSCS has the better adaptability than other methods.

Figure 5 :
Figure 5: Impact of the generative features number on unseen class accuracy for zero-shot learning on FLO and CUB.

Figure 6 :
Figure 6: Impact of the generative features number on harmonic mean for generalization zero-shot learning on FLO and CUB.

Table 1 :
Datasets statistics involve semantic embedding C(att/dimension for attribute per class or stc/dimension for sentences), number of classes in training classes (Ys includes training + validation) and test classes (Yu), visual feature X in experiments.

Table 2 :
Comparison of TFGNSCS method with state of the art methods for ZSL with semantic feature and ResNet visual feature.

Table 3 :
Comparison of TFGNSCS method with state of the art methods for GZSL with semantic feature and ResNet visual feature.tr=average per-class Top-1 accuracy (%) on seen classes, ts=average per-class Top-1 accuracy (%) on unseen classes, H=harmonic mean for GZSL are reported based on the same data configurations in the different datasets splits.TFGNSCS 60.3 69.5 64.5 47.6 59.7 53.0 44.3 37.5 40.6 61.1 78.6 68.7

Table 4 :
Comparison of TFGNSCS method with the base-line methods for ZSL with semantic feature and ResNet visual feature.

Table 5 :
Comparison of TFGNSCS method with the base-line methods for GZSL with semantic feature

Table 6 :
Comparison of TFGNSCS method with the alternative transfer method for ZSL with semantic feature and ResNet visual feature.

Table 7 :
Comparison of TFGNSCS method with the alternative transfer method for GZSL with semantic feature and ResNet visual feature.tr=average per-class Top-1 accuracy (%) on seen classes, ts=average perclass Top-1 accuracy (%) on unseen classes, H=harmonic mean for GZSL are reported based on the same data configurations in the different datasets splits.