Multi-Modality Adversarial Auto-Encoder for Zero-Shot Learning

The existing generative Zero-Shot Learning (ZSL) methods only consider the unidirectional alignment from the class semantics to the visual features while ignoring the alignment from the visual features to the class semantics, which fails to construct the visual-semantic interactions well. In this paper, we propose to generate visual features based on an auto-encoder framework paired with multi-modality adversarial networks respectively for visual and semantic modalities to reinforce the visual-semantic interactions with a bidirectional alignment, which ensures the generated visual features to fit the real visual distribution and to be highly related to the semantics. The encoder aims at generating real-like visual features while the decoder forces both the real and the generated visual features to be more related to the class semantics. To further capture the discriminative information of the generated visual features, both the real and generated visual features are forced to be classified into the correct classes via a classification network. Experimental results on four benchmark datasets show that the proposed approach is particularly competitive on both the traditional ZSL and the generalized ZSL tasks.


I. INTRODUCTION
In recent years, the deep learning techniques have achieved remarkable performances in both computer vision and machine learning areas, constantly pushing the boundaries of what is possible. The progress partly relies on the growing availability of big data. However, in some cases, the data are difficult to collect, e.g., fine-grained classification data. In order to build powerful models in these problematic situations, Zero-Shot Learning (ZSL) [1], [6], [8], [10], [35] has been developed and proven to be a promising direction in the missing data scenarios. The task of ZSL requires classifying the unseen classes that have no visual data available for training, which is achieved by transferring the knowledge from the seen classes to the unseen ones with some semantic information termed as class prototype, e.g., attributes [1] and word vectors [37].
Recently, to address the data missing issue of unseen classes, some approaches [16], [18], [23], [42], [43] try to synthesize pseudo visual features for unseen classes with the generative models. In essence, these approaches take as The associate editor coordinating the review of this manuscript and approving it for publication was Mauro Gaggero . input the class semantics prototypes to learn a model to narrow down the distribution differences between the generated and the real visual features. Once it has obtained the class semantics prototypes of any unseen classes, the learned model may generate the corresponding pseudo visual features as many as possible. However, the existing generative zero-shot approaches mostly focus on capturing the visual distribution information via a unidirectional alignment from the class semantics to the visual features, which may result in the generated visual features less semantics-related and discriminative.
To cope with the above issues, we propose to further explore the visual space from the following two aspects. First, we regularize the generated visual features being highly related to the class semantics by enforcing both the real and generated visual features being well inferred back to the class semantics. Second, we regularize both the real and generated visual features being classified into the ground-truth class labels to capture the discriminative information. Specifically, we propose an auto-encoder framework paired with two respective adversarial networks. The encoder, acting as the visual feature generator, aims at capturing the real visual distribution by formulating the generated and the real visual VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ features into an adversarial network. The decoder, acting as the semantics inference that forces both the real and the generated pseudo visual features being related to the class semantics by formulating the inferred and the real class semantics into another adversarial network. We also add a classification network to classify both the real and the generated visual features into the correct classes, which encourages the generated visual features as much discriminative as the real visual features. The whole framework of the proposed model is illustrated in Fig. 1. Consequently, the decoder network and the classification network help the encoder network to boost the feature generation by enforcing the generated visual features being semantically related and discriminative, respectively. Compared with the existing generative approaches, this architecture is a bidirectional visual-semantic alignment constraint, which facilitates the interactions between the visual and the class semantics modalities and captures the discriminative information derived from the visual feature space.
It is worthwhile to highlight several aspects of the proposed approach here: 1) We formulate the visual generation process and the semantics inference process into an encoder-decoder framework so that they can improve each other through a cyclic fashion. 2) Each process is realized with the combination of an adversarial loss and a least square loss. The adversarial loss serves as a flexible metric to evaluate the consistency of real and generated features, while the least square loss minimizes the differences between the real features and the generated features so that the gradient never vanishes and converges quickly. Thus, the proposed model is more efficient than the existing GAN-based ZSL approaches. 3) To further capture the discriminative information of the generated visual features, we also design a classification network to regularize the generated visual features. Different with the existing GAN-based ZSL approaches that directly train a classification network with the generated visual features, the proposed model trains a classification model with the class semantics to correctly classify both the real visual features and the generated visual features. Experimental results on four benchmark datasets show that our proposed approach achieves significant improvements for the traditional ZSL task and achieves better competitive performances for the generalized ZSL [34], [40] task than the state-of-the-art approaches.

II. RELATED WORK
From the view of constructing the visual-semantic interactions, the existing ZSL approaches could be divided into two categories: the discriminative models and the generative models.

A. DISCRIMINATIVE MODELS FOR ZSL
A simple approach to build visual-semantic interactions is to project the visual features to the class semantics space with a linear [29] or a non-linear model [3], [39], [41]. Some approaches propose to learn a compatible matrix to obtain the compatibility scores of the visual features and the class semantics prototypes with different objective functions, e.g., ranking loss formulation [6], [7], structural SVM loss [8], and the square loss function [9]. Recently, as one of the most related efforts to ours, SAE [11] employs a linear semantic encoder-decoder framework to regularize the model by enforcing the encoder parameters and the decoder parameters being symmetric. [48] proposes kernel methods to learn a non-linear mapping between the feature and attribute spaces. SE [42] is based on a generative model to synthesize exemplars from both the seen and unseen classes, and use these synthesized exemplars to learn the semantic relationships between the feature and attribute spaces. CDL [42] aims to preserve the structure of the semantic space in the embedding space by utilizing semantic relations between categories. Although the visual samples are represented with deep features, they cannot effectively handle the semantic inconsistency between the visual and the class semantics modalities, and commonly suffer from the information degradation issue caused by ''heterogeneity gap''. In this work, we propose an encoder-decoder framework paired with the adversarial networks to jointly capture the distribution information of the generated visual features and reinforce the visual-semantic alignment.

B. GENERATIVE MODELS FOR ZSL
To capture more distribution information from visual space, recent work focuses on generating pseudo features for unseen classes with class semantics prototypes. A simple approach to generating visual features is directly to take as input the class semantics prototype with a linear [22] or a deep model [14]. Compared with the models that project the visual features to the class semantics space, the reversed projection models have the potential to alleviate the hubness issue where some unseen class prototypes (''hub'') tend to appear in the top neighbors of many test instances. We refer readers to [22] for more details. [52] proposes a model which is called as Conditional Variational AutoEncoder (CVAE) to generate pseudo image to learn the relationship between the image features and the class embedding. [53] proposes a novel end-toend model called Cross-Layer AutoEncoder (CLAE), which integrates different ways of semantic mapping and maintains reconstruction information. Although promising results have been achieved, these approaches are hard to align the visual and class semantics spaces well since each class has many visual samples in the visual space but only has one class semantics prototype in the class semantics space.
In recent years, significant progress in the generative approaches suggests yielding the desired distribution with a simple instance via functional approximators. Motivated by this idea, some models are proposed to generate pseudo samples for unseen classes with adversarial networks [16], [18], [23], [44] and variational auto-encoder [21]. Our work is close to [24] in which an adversarial auto-encoder [17] is applied for generating visual features. Different with [24] that employs an adversarial criterion to constrain the latent codes produced by visual features to fit a prior noise distribution, our model reinforces the visual-semantics alignment by employing two adversarial networks to respectively fit the visual distribution and the class semantics distribution.

III. APPROACH
In this section, we first introduce the problem formulation and then discuss in detail the proposed generative model based on the encoder-decoder framework paired with two respective adversarial networks.

Given a list of seen samples defined by
is the corresponding class semantics prototype and y i ∈ Y s is the associated one-hot class label; Y s is the label space of seen classes; p and q are the dimensionalities of the visual and the class semantics spaces, respectively. During the test stage, the unseen class semantics prototypes and the class labels {a t , y t } are provided, where y t ∈ Y t and Y s Y t = ∅. In the traditional ZSL task, the test sample x t ∈ R p comes from unseen classes and is classified into the pre-defined candidate unseen classes Y t . In contrast, in the generalized ZSL task, the test sample x t is either from seen classes or unseen classes and is classified into the set composed of both seen and unseen classes.

B. MULTI-MODALITY ADVERSARIAL AUTO-ENCODER (MAAE)
In this work, we attempt to generate some semantics-related and discriminative samples for unseen classes to address the sample-missing issue in the ZSL task. To this end, we design a generative approach called Multi-modality Adversarial Auto-Encoder (MAAE) to generate visual features for ZSL, as illustrated in Fig. 1. In MAAE, two adversarial branches are formulated into an encoder-decoder framework, which separately captures the semantics-related and the visual distribution information. In the encoder branch, the class semantics prototype a together with the noise vector z is taken as input to generate the pseudo visual featurex with a generative network, which learns a mapping from a joint space of both the class semantics and noises into the visual space. In the decoder branch, the visual sample x is decomposed into two independent vectors with an inference network, which learns an inverse mapping from the visual space to the joint space that is spanned by the class semantics and the noise vector.
The adversarial generative model has been employed in some previous approaches [18], [23]. Different from these methods that mostly focus on synthesizing samples to capture the visual distribution via a unidirectional semantic-visual alignment, we propose to generate visual features with a bidirectional alignment, i.e., semantic-visual and visual-semantic alignments, to ensure the generated visual features to capture both the semantics-related and feature distribution information. First, the generated visual features are forced to be VOLUME 8, 2020 closed to the real visual features to fit the real visual feature distribution. Second, both the real and the generated visual features are taken as input to the inference network to infer the corresponding class semantics, which ensures the generated visual features to be highly related to the class semantics. Finally, both the real and the generated visual features are restricted to be classified into the ground-truth classes, which ensures the generated visual features to be discriminative. Inspired by these three points, the objective of the proposed generative approach can be formulated as: For the encoder part, both the class semantics and the noises are concatenated into a holistic vector for the generator network to generate pseudo visual features, which is supervised with the real image visual features: is the pseudo visual feature generated with the corresponding class semantics prototype a i and a random Gaussian noise vector z; θ is the parameter of the generative network G. This term encourages that the generated visual features are similar to the real visual features. As the visual features are high-level representation, typical reconstruction metrics such as p -norm has difficulty capturing the visual distribution. To this end, we further proceed both the real visual features and the generated pseudo visual features into an adversarial learning process illustrated in Fig. 2, in which the generator tries to approximate the real-like data distribution while the discriminator is to distinguish whether the features are drawn from the generator's output or the real data distribution: where φ is the parameter of the discriminator D; L GP = ( ∇xD φ (x) 2 2 − 1) 2 is the gradient penalty to enforce the Lipschitz constraint;x is the linear interpolation between the real feature x and the generated featurex; γ is a hyperparameter. For the decoder part, the input is either the real or the generated visual features to the semantics inference network that decomposes the input into two separate vectors with two respective subnetworks. One is supervised by the real class semantics and the other is constrained into the noise space. Specifically, the inference network is written as E υ (x) → [ã;z ], where υ is the parameter of the inference network E. Intuitively, the inferred class semantic vectorã should be close to the real class semantic prototype, i.e., Since the samples from the same class share the same class semantic prototype, minimizing Eq. (4) encourages to the generated visual features from the same class to gather together and capture the class semantics. Just as the visual features, the class semantics are high-level representations; the euclidean distance has difficulty capturing semantics information. Hence, we also adopt adversarial learning for the semantics inference, as illustrated in Fig. 3. Specifically, the decoder network is seen as the generative network of the adversarial process. A discriminator is designed to distinguish whether the input is from the generator's output or the real data distribution. The real data are the concatenation of the class semantic vector and the random Gaussian noise vector, i.e., the input of the encoder network. Similar to Eq. (3), the adversarial process is formulated as: where ω is the parameter of the discriminator; L GP is the gradient penalty, [·; ·] is the concatenation operator, η is the hyperparameter. As mentioned above, the whole network is achieved as a closed loop, in which the visual-semantic interaction is reinforced with a bidirectional alignment. However, a robust visual-semantic interaction cannot derive the discriminative power of the generated visual features, which is vital for the classification. To boost the discriminative power of the generated visual features, we design a classification network to take as input the real and the generated visual features to predict the corresponding class labels, which is formulated as: where L ψ (x i , A) and L ψ (x i , A) are the classification losses of real and generated pseudo visual features, respectively. ψ is the parameter of the classification network. This term encourages the generated visual features as much discriminative as the real visual features to be classified into the ground-truth classes. Specifically, where A ∈ R q×M is the class semantics prototype matrix of both the seen and unseen classes, and M is the number of all classes. P(y j |x i , , where a k ∈ A; a j is the corresponding class semantics prototype of class y j ; F ψ is the linear function to project the class semantics into the visual space. The value of x T i F ψ (a j ) is seen as the compatibility score between the visual feature x i and the j-th class semantic prototype a j . If the sample x i belongs to class y j , their compatibility score should be large; otherwise it should be small. In this way, the separability between any two different classes is enlarged. Besides, the unseen class semantic prototypes are also taken into consideration, which prevents the seen data from classifying into unseen classes. The seen to unseen bias issue thus is mitigated obviously.
Overall, the objective function of the proposed model is summarized with: where R(θ, υ) is the regulairizer on the parameters; λ and µ are two balance scalars.

C. APPLY MAAE FOR ZSL
With the proposed MAAE, each unseen class can generate its corresponding pseudo visual features in the visual space with the provided class semantic prototype. During the test stage, the similarities between the test instance and the unseen class semantics prototypes are obtained by calculating the distances of the visual features and the generated unseen pseudo visual features. In this way, the test instance is classified with the Nearest Neighbor (NN) classifier based on the distances. Furthermore, each class may obtain a lot of pseudo visual features with different noise inputs, and the classification is also achieved by training a parametric classifier, e.g., softmax or SVM.

IV. EXPERIMENTS
In this section, we first document the datasets and experimental settings. Then we present the comparison results of the proposed model on both traditional ZSL and generalized ZSL tasks. Finally, we discuss the impacts of both the classifiers and the number of generated visual samples on the proposed generative model.

Datasets.
We conduct experiments on four benchmark datasets: AwA1 [1], AwA2 [2], aPY [32], and SUN [33]. These datasets are all annotated with attributes that are used as the class semantics prototypes. The statistics of the datasets are listed in Table 1. Features. As the visual representations, we use the features released by [2], which are extracted as 2048-dim top layer pooling units of the 101-layered ResNet. The visual features are scaled to [0, 1] with normalization. For the class semantics prototypes, we use the attributes provided by the datasets. Specifically, for both AwA1 and AwA2 datasets, we use the class-level attributes directly and average the image-level attributes to represent the class semantics prototypes for both aPY and SUN datasets.

A. RESULTS OF THE TRADITIONAL ZSL 1) EVALUATION PROTOCOL
For the traditional ZSL task that assumes that the test data all come from unseen classes, we use the average per-class Top-1 accuracy T following the majority of the prior work. For the generalized ZSL task, we compute the average per-class Top-1 accuracy s on the seen classes, the average per-class Top-1 accuracy u on the unseen classes, and their harmonic mean, i.e. H = 2 × (s × u)/(s + u).

2) IMPLEMENTATION DETAILS
The proposed MAAE has many parameters, including the hidden layer number, the neuron number of each hidden layer, the hyperparameters, the number of batch size, and the learning rate. In MAAE, both the encoder and the decoder networks have two layers; each layer is activated with the ReLU function. In practice, we have found that the neuron number of the hidden layer is robust to the final performance when it surpasses 500. Thus we set the neuron number of the hidden layer as 1,024 for both the encoder and decoder networks. The remaining parameters are fine-tuned with a cross-validation procedure in which 20% seen classes are considered as the validation set, allowing us to choose the hyperparameters maximizing the accuracy on the validation set. Specifically, we have found that the proposed MAAE works well when the neuron number of the hidden layer of the discriminator is set as 64. The hyperparameters λ and µ are set 0.01 and 0.001, respectively.
The trained model parameters are initialized with a Gaussian distribution (σ = 0.01) and optimized with the Adam solver with a cross-validated learning rate 0.0001, using mini-batches of size 48. The model is implemented with the Tensorflow framework running on a Tesla K40 GPU. Given a set of hyperparameters, the training process takes around 10 minutes for each model on AwA1 dataset. Our codes will be released publicly. VOLUME 8, 2020 First we conduct experiments on the traditional ZSL task and select sixteen approaches for comparison. They are SSE [4], LATEM [5], ALE [6], DEVISE [7], SJE [8], ESZSL [9], SAE [11], GFZSL [12], CVAE [52], CAUCHY [48], CDL [42], RELATION NET [15], GAZSL [23], CLSWGAN [18], AML [45] and SRAN [46]. All the competitors use the same features and the same experimental settings as ours. The comparison results are summarized in Table 2. From the results in Table 2, we observe that our MAAE achieves the best performance on four datasets. Specifically, MAAE obtains 0.4% and 1.6% improvements over the second best competitors respectively on AwA2 and aPY datasets, and achieves the first parallel performance on AwA1 and SUN, demonstrating that the proposed architecture learns a more robust visual-semantic alignment for ZSL. Specifically, compared with SAE [11], a similar encoder-decoder framework as our MAAE but with linear model, MAAE obtains significant improvements on four datasets, which indicates the effectiveness of the nonlinear and adversarial models. Compared with CLSWGAN + SM [18] that also applies an adversarial network to align the generated distribution and the real visual feature distribution, MAAE also secures better performances, which indicates that both the semantic inference process and the designed regularizers bring positive impacts to the accuracy improvement. Besides, from the results, we observe that the generative approaches i.e., GAZSL [23], CLSWGAN + SM [18], and the proposed MAAE, perform much better than the other competitors, which indicates the effectiveness of the generative strategies. The reason is due to the fact that the generated visual features are more tightly centered around the corresponding real visual distribution, which has the potential to alleviate the ''hubness'' issue.

B. RESULTS OF THE GENERALIZED ZSL
We then conduct experiments on the generalized ZSL task, and select sixteen competitors for comparison. From the results in Table 3, we observe that the proposed MAAE model performs the competitors under the realistic generalized ZSL task on three datasets. Taking the harmonic mean (H) metric as an example, our MAAE obtains superior results with a large margin against the competitors on AwA1, AwA2 and aPY datasets, and is only outperformed by CLSWGAN + SM [18] on SUN dataset. This indicates that the proposed model performs better than the other competitors on alleviating the issue of the seen-unseen bias under the generalized ZSL scenario, which means that the proposed approach can improve the performances of unseen classes while maintaining the seen classes performances. Besides, we observe that the classification performances of the seen classes are much better than those of unseen classes, which indicates that the generated pseudo visual features are unlikely to be as good as real visual features.

C. ABLATION STUDIES
In order to evaluate the impacts of each regularization term, we conduct experiments on AwA1 dataset with and without different components. From the results in Table 5, we observe that the discrimination preserved network, the semantics preserved network and the distribution preserved network all contribute the performance improvement, which indicates that the three networks are all important and indispensable for the model. Specially, when MAAE is without the distribution preserved network, the accuracies of T, u, H decrease rapidly compared with the others in MAAE, which indicates that the distribution preserved network is a very effective component in our proposed MAAE model. Moreover, the other networks are also important to help the generative network to generative more powerful visual features. Besides, the results also indicate that the bi-directional nature of the model indeed improves the baseline model.

D. T-SNE-VISUALIZATION
To evaluate the quantity of the generated visual features, we visualize the generated visual features of AwA1 dataset with t-SNE. From Fig 4, we can observe that the generated visual features can well capture the class distribution and preserve the discriminative information.

E. IMPACTS OF CLASSIFIERS
As for the seen classes, the classification may be achieved either with the synthesized pseudo visual features or the ground-truth visual features. In this part, we conduct experiments on the four datasets to validate the impacts of different classifiers (i.e., NN and softmax) under different settings on the proposed MAAE. The NN classifier mostly evaluates the discriminative information of the synthesized visual features while the softmax classifier evaluates both    the discriminative information and the distribution information of the synthesized visual features. From the experimental results in Table 4, we have the following observations.
(1) The performances T on the traditional ZSL task of softmax classifier are slightly better than those with NN on all datasets, which indicates that the synthesized pseudo visual features are discriminative enough to be classified. (2) The performances s on the seen classes with ground-truth visual features are better than those with the synthesized pseudo visual features, while the performances u on the unseen classes are inferior correspondingly. Besides, the harmonic mean performances H with generated visual features are more robust than those with ground-truth visual features. The bias indicates that the distribution of the synthesized pseudo visual features is still not as good as the real feature distribution. (3) The harmonic mean performances H with NN classifier are more stable than those with the softmax classifier, which indicates that distribution information of the synthesized visual features is more likely to cause the seen-unseen bias, and the discriminative information contributes more to preserve the performances than the distribution information.

F. IMPACTS OF THE GENERATED SAMPLE NUMBER
In this section, we conduct experiments to evaluate the impacts of the visual distribution for the classification performances. Specifically, we evaluate the average per-class Top-1 accuracy T of the proposed MAAE model on the traditional ZSL task via varying the generated sample number of each unseen class. As illustrated in Fig. 5, we observe that the accuracies initially increase and achieve their peaks and then decline with the further increase of the generated VOLUME 8, 2020 visual feature number of each unseen class. This indicates that the performances benefit from the visual distribution within a certain range. With the increase of the generated visual samples of each class, the distribution information may attenuate the discriminative information. Besides, we observe that the performances on the SUN dataset are much more sensitive than those of the other three datasets. The reason is that the SUN dataset is a fine-grained dataset of which the inter-class differences are small, leading to the fact that discriminative information is easier to be affected by the visual distribution.

V. CONCLUSION
In this paper, we have proposed a novel generative approach for ZSL by generating semantics-related and discriminative visual features. It formulates both the visual feature generation and the semantics reference into a cyclic encoder-decoder framework such that these two processes can improve each other. We have also added a classification network to regularize the generated visual features to be discriminative. Extensive experimental results show that the proposed approach achieves state-of-the-art performance on the traditional ZSL task and improves a large margin on the generalized ZSL task under the harmonic mean metric.
ZHONG JI received the Ph.D. degree in signal and information processing from Tianjin University, Tianjin, China, in 2008. He is currently an Associate Professor with the School of Electrical and Information Engineering, Tianjin University. His current research interests include machine learning, computer vision, multimedia understanding, and video summarization.