No Adversaries to Zero-Shot Learning: Distilling an Ensemble of Gaussian Feature Generators

In zero-shot learning (ZSL), the task of recognizing unseen categories when no data for training is available, state-of-the-art methods generate visual features from semantic auxiliary information (e.g., attributes). In this work, we propose a valid alternative (simpler, yet better scoring) to fulfill the very same task. We observe that, if first- and second-order statistics of the classes to be recognized were known, sampling from Gaussian distributions would synthesize visual features that are almost identical to the real ones as per classification purposes. We propose a novel mathematical framework to estimate first- and second-order statistics, even for unseen classes: our framework builds upon prior compatibility functions for ZSL and does not require additional training. Endowed with such statistics, we take advantage of a pool of class-specific Gaussian distributions to solve the feature generation stage through sampling. We exploit an ensemble mechanism to aggregate a pool of softmax classifiers, each trained in a one-seen-class-out fashion to better balance the performance over seen and unseen classes. Neural distillation is finally applied to fuse the ensemble into a single architecture which can perform inference through one forward pass only. Our method, termed Distilled Ensemble of Gaussian Generators, scores favorably with respect to state-of-the-art works.


I. INTRODUCTION
Z ERO-SHOT learning (ZSL) addresses the (image) classification problem of recognizing categories whose visual data are not available [39]. Precisely, while visual observations are typically present for a subset of (seen) classes, visual data of unseen classes are not available at training time. In fact, unseen classes are accessible through a textual description only Manuscript  (inductive ZSL setup [39]). The goal is then to actually take advantage of such "textual descriptions", properly called semantic embeddings. By means of manually-annotated attributes or distributed word embeddings, we can learn how to transfer a classifier trained on the seen classes only to the unseen ones as well, ultimately recognizing classes never seen before. At the inference stage, test data can either belong to the unseen classes only ("standard" ZSL, or ZSL in short), or can involve test instances from the seen classes too ("generalized" ZSL, or GZSL in short). Right after its inception, zero-shot classification was first solved by means of a compatibility function Φ, which learns a matching score or a distance to compare visual features and semantic embeddings [39]. There are multiple evidences [1], [4], [5], [6], [11], [13], [18], [21], [26], [33], [35], [36], [38] that compatibility functions are capable of matching the visual and semantic spaces, up to some misalignment (or error), for the unseen classes (standard ZSL) while suffering in the generalized setup [39].
As a solution, GZSL is nowadays tackled using a shallow softmax classifier trained on real visual features from the seen classes and synthetic visual features from the unseen ones [40]. The stage of feature synthesis is typically accomplished by means of a Generative Adversarial Network (GAN) [2], [9], [12], [16], [19], [24], [34], [41], [48], which is more effective than compatibility functions, being however much harder to optimize [39]. In this work, we would like to provide a method that achieves a more stable training while not losing effective classification performance. To this aim, we propose a novel approach to perform feature generation, by devising a new mathematical framework to cast a generic compatibility function pre-trained for ZSL into a feature generator, at no additional computational cost. Our method builds on the observation that, since a compatibility function Φ can align the visual and semantic spaces, then it is also capable to align their first-and second-order statistics as well. Despite such statistics would require, in principle, visual data to be processed, we propose a new strategy to sample the semantic space and, by means of Φ, generate surrogate visual features even for the unseen categories, which are guaranteed by our proposed mathematical framework to be effective in approximating first-and second-order statistics (see the proof in the Supplementary Material, available online). Afterwards, the feature generation stage can be easily solved with a plain sampling from a pool of class-specific Gaussian This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Fig. 1. The hardest problem in zero-shot learning is to model categories without any corresponding visual data, but just leveraging semantic attributes. In this paper, we exploit a compatibility function Φ (not shown here) which aligns the visual and semantic spaces. Through Φ, we show that we can generate, in the visual space, class-specific Gaussian distributions (like N panda ) from which synthetic visual features can be generated through sampling. Notably, this is possible also for the unseen classes (marked in red). On top of the generated features, zero-shot learning is solved through an ensemble of weak softmax classifiers, each trained after a seen class has been held out. Neural distillation is finally applied to achieve computational efficiency at test time. distributions ( Fig. 1), and classification is then carried out by feeding a softmax classifier to predict both seen and unseen classes (Fig. 2).
While doing so, we need to tackle a well-known problem in ZSL, the hubness [7], [28], [32], [39]. It consists in the fact that, after the alignment by Φ, unseen classes may be wrongly mapped to the region of influence of some of the seen categories. In our case, this problem can be similarly cast because some of the Gaussian distributions from the unseen classes may be "absorbed" from some of the seen ones. To prevent this problem, we aggregate a pool of weak softmax classifiers, each of them trained using a one-seen-class-out strategy (named OSCO ensemble). In this manner, we force the removal of one of the Gaussian distributions from the feature space. Our experimental evidences suggest that this strategy is capable to re-balance the performance between seen and unseen classes, improving zero-shot learning.
Given that the OSCO ensemble requires to train a number of shallow softmax classifiers that linearly scale with the number of seen classes, we need to evaluate several models at test time. In order to remove this computational overhead, we take advantage of neural distillation [15] to create a single softmax classifiers subsuming the favorable generalization capabilities of the OSCO ensemble. Besides, it allows inference with a single forward pass, as done in prior state-of-the-art generative ZSL methods [2], [9], [12], [16], [19], [24], [34], [40], [41], [48].
The rest of the paper is organized as follows. In Section II, we sketch relevant related work. In Section III, we present how we can accommodate for the feature generation stage without using a GAN. In Section IV, we provide a broad experimental validation of our method, including ablation studies and sensitivity analysis, while challenging the state-of-theart methods. Conclusions of our study are finally drawn in Section V.

II. RELATED WORK
Synthesized Exemplars: Prior work has proposed method to synthesize exemplars for the missing categories [4], [5], [6], [18], [22]. To do so, latent embeddings became a quite popular solution to combine in one representation both semantic and visual properties [4], [5], [6], [18], [22]. In those works, the synthesized exemplars modify the classification boundary in a hand-crafted manner. Differently, in our work we directly exploit the generated visual features to train the classification model which is responsible to take decision, letting it free to arrange decision boundaries in a data-driven manner. Although some methods can estimate first-order statistics (such as [4] or [5]), they are accomplishing such task for some auxiliary fake classes, which only helps for a better generalization. Differently, we are approximating first-and second-order statistics for the actual classes to be recognized, no matter if they are seen or unseen at training time.
Adversarial Feature Generation: Zero-shot classification can be solved through a softmax classifier trained on real visual features from seen classes and synthesized descriptors obtained by a GAN conditioned on the semantic embeddings of the unseen classes [40]. This idea has become mainstream and several approaches have investigated different solutions to improve the class-conditioned GAN under the architectural standpoint: using variational inference to boost the generator [12], [16], [24], [34], [41], [48], cycle-consistency [9] or contrastive learning [19]. Episodic training was adopted to improve generalization [25], whereas confidence smoothing was also used for the same reason [2].
Adversarial training requires to handle an optimization for a saddle-point which is, by definition, more unstable to achieve if compared to the "non"-adversarial case in which a global/local Fig. 2. Flow-chart of our proposed Distilled Ensemble of Gaussian Generators (DEGG) method. Given a pre-trained compatibility function, we propose a formalism to estimate the first-(Section III-B) and second-order statistics (Section III-C)μ i andΣ i referring to both seen (i ∈ S) and unseen (i ∈ U ) classes. Our approximation is grounded on a formal mathematical foundation to support its validity (see Supplementary Material, available online), which is furthermore corroborated by experimental verifications on real data (Section IV-B). Using µ i and Σ i , the stage of feature generation can be easily solved through sampling from a class-specific Gaussian distribution. As a result, for the seen/unseen class i, we are able to generate N synthetic visual features z (Section III-D and Algorithm 1). Using {z , a softmax classifier is trained to solve zero-shot recognition (Section III-E1). To address the hubness problem [7], [28], [32], [39], for which unseen classes may be wrongly attracted towards the region of influence of the seen ones, we adopt an ensemble mechanism. We aggregate a pool of weak (softmax) classifiers, each of them trained in a One-Seen-Class-Out manner (OSCO Ensemble). In this way, we posit that we can "free" a portion of the visual space to better model the unseen ones which may be projected inside it (Section III-E2). Finally, we take advantage of neural distillation [15] to reduce the computational overhead of the ensemble at test time (Section III-E3). Light blue boxes denotes modules dealing with the feature generation stage. The actual classification pipeline (Section III-E), represented in green, is capable of scoring a solid performance for both standard and generalized inductive zero-shot learning (Section IV-D). Best viewed in colors. minima/maxima is seeked. In this, respect, our method is more stable (see experimental validation and Fig. 4 in particular). Also, without any ad-hoc loss of modules which double check on that, a GAN may not be capable of generating well visual features for all categories, resulting in phenomena such as mode collapse [39]. By design, we posit that we totally circumvent this problem in our case since we allocate a specific Gaussian distribution to generate each single class -either seen or unseen, to make sure that none of them is "disregarded". Overall, we deem that our method is a simple, yet valid (see Section IV-D), alternative to GAN-based approaches.
Gaussian methods for ZSL: A few prior methods exploit a formalism based on Gaussian distributions to tackle zero-shot classification. In [36], a mixture of Gaussian distributions is fitted on the seen classes and a regression mechanism is responsible for inferring the mean vector and covariance matrices for the Gaussian distributions related to the unseen classes. In [29], Gaussian distributions are adopted to retrieve latent embeddings which subsume the visual and the semantic properties of the classes to be recognized, so that inference is eased in this semantic space. Note that, albeit exploiting Gaussian distributions, none of such works [29], [36] is adopting the feature generation strategy combined with a softmax classifier, and this translates into inferior performance with respect to more recent approaches (including ours). Also, despite [36] provides a regression approach to infer first-and second-order statistics, our formalism is still superior to that: in fact, we do not need to perform any explicit extra-training to perform such an estimation (see Section IV-B and the Supplementary Material, available online). Differently, we will show that a compatibility function can be used as is to estimate such statistics: our formalism is therefore more general, also yielding to a superior performance (see Section IV-D).
Ensemble methods: Although ensemble mechanisms have been previously investigated for few-shot learning [8], we are one of the first works to exploit an ensemble for ZSL together with [10], [43]. Differently to us, in [10], the ZSL variant hereby considered is the easier (and, unfortunately, not comparable) transductive framework: unlabelled images from unseen classes are accessible at training time. In [10], a regression network is used to "tag" the visual data with the corresponding attributes. Such model is paired with a GAN which generate visual features which are passed to an ensemble of multi-modal classifiers, each of them trained on the same chunck of data. Differently, our generation stage is simpler, yet effective (Tables III and IV). Furthermore, our ensemble mechanism is intended to solve a well known problem in ZSL (hubness [7], [28], [32], [39]), whereas the ensemble method in [10] seems more standard and, thus, less tailored to the problem of interest (see the Supplementary Material, available online).
Neural Distillation: To the best of our knowledge, our work is first one to apply knowledge distillation [15] to zero-shot learning. Although, in fact, some works tackle the problem of knowledge distillation, they were only targeting vanilla classification e.g., [30] or few-shot regimes [8] in which visual data can be limited, but are always available, for all the categories. Therefore, in those works, the teacher is always assumed to have access to all (or, at least, some of) the features of the classes to be recognized, while the student only process a few of them. In our case, instead, not only the student network, but also the teacher one, never processes a single real visual descriptor from the unseen classes. Nevertheless, we show the effectiveness of neural distillation to cope with this more challenging setup (see Section IV-D).

III. OUR METHOD: DEGG
In this Section, we present our proposed computational pipeline named Distilled Ensemble of Gaussian Generators (DEGG), and visualized in Fig. 2.

A. Background on Compatibility Functions for ZSL
In zero-shot learning, a compatibility function Φ evaluates the degree of compatibility Φ(x, a) between a visual features x and a semantic embedding a which is in (known) one-to-one correspondence with a visual category one needs to recognize [39].
Choosing one type of compatibility function is one degree of freedom that ZSL practitioners have to deal with and, depending on which regularization methods were used to better generalize towards unseen classes even if trained on the seen only [39], the goal is to achieve Φ(x, a) = 1 for the inner product case and Φ(x, a) = 0 for the distance-based sem-to-vis or vis-to-sem mapping, whenever x and a refer to the same class.
In order to grow in generality for our proposed approach, which is able to turn a generic compatibility function Φ into a feature generator -even if Φ has not been explicitly trained to do so, we will use the following notation to denote that any of the three cases defined in (1) is covered by our mathematical formalism (presented in the next part of this Section and supported by a proof in the Supplementary Material, available online). Therefore, our approach is general enough to cover all the cases ((inner product), (sem-to-vis mapping) and (vis-to-sem mapping)), which are usually encompassed by prior compatibility functions proposed in the literature [39].

B. First-Order Statistics
In this Section, we will explain how to approximate firstorder statistics from an intuitive and operative standpoint. The complete mathematical derivation of our method is reported in the Supplementary Material, available online.
Intuition: Let us assume that the compatibility function Φ falls inside the inner product case of (1), that is Φ(x, a) = x Wa. The remaining semantic-to-visual Φ(x, a) = ||x − Wa|| 2 2 and visual-to-semantic Φ(x, a) = ||W x − a|| 2 2 cases follow a very similar argument 1 . The weights W of Φ are optimized in such a way that Φ(x, a) = 1 when the visual feature x and the semantic embedding a refer to the same class. That is Φ(x, a) = 1 if f a and x belong to the same class.
As a consequence, it is an easy stage to exploit the bilinear nature of Φ to elicit that the vector v := Wa in the feature space is aligned with x (i.e., it is a vector with the same direction, but potentially different magnitude). This is indeed true because 1 = Φ(x, a) = x Wa = x v and the previous relationship holds for every x. Therefore, thanks to the (bi)-linearity of Φ, the vector v must be aligned to μ, the mean of all feature vectors x whose (ground-truth) class label is encoded by the semantic embeddings a. Therefore, up to a scaling factor, v = Wa is the estimate of the first-order statistics that we are looking for. Details on how to solve for this ambiguity are given in Algorithm 1 and Section III-D. Additionally, the mathematical formalization of the previous arguments is available in the Supplementary Material, available online, covering all the inner product, the semanti-to-visual mapping and the vision-to-semantic mapping cases.
A formula for first-order statistics: Following the previous arguments, we propose a formula to estimate the first-order statistics for a generic class represented by the semantic embedding â Please note that, in (2), the inner product and the sem-to-vis mapping cases yield the same estimate forμ a given (1). We leverage the compatibility function Φ as in (1) in relationship to the alignment that Φ is able to achieve between the semantic and visual spaces. Since Φ matches the visual features x (belonging to the class represented by a) with a in the semantic space, we can therefore conjecture that the centroid, i.e., the most prototypical of those feature vectors, will be mapped to a as well. Note that the compatibility function Φ always allows to compute the inverse image of a. In fact, using (1), it is always possible to map in the visual space a generic set of attributes or semantic embeddings. Therefore, we can estimate the first-order statistic of a class, whose textual semantics is known, even if not a single visual descriptor is available.
Please, refer to Algorithm 1 to see how (2) can be useful for our proposed feature generation stage.

C. Second-Order Statistics
To infer Gaussian distributions we need not only first-order statistics, but second-order ones as well.
Intuition: Let us assume that we are given x 1 , . . . , x N ∈ R d as N visual features belonging to the class represented by the semantic embedding a. Then, the standard deviationσ can be approximated with the usual bias-corrected sampling estimator where is the element-wise Hadamard product and is the sampling mean. Note that, in (3), we can measure secondorder statistics by focusing on the component-wise standard deviation only and ignoring the correlation between different components. Such assumption is not actually restrictive since we can always achieve it in practice by using Principal Components Analysis (PCA) to perform decorrelation. As in (3), the estimation of the standard deviation is, on the one hand, possible when the visual features x j are given while, on the other hand, it does not depend on a. Consequently, we can use (3) only for the seen classes. But, here, we argue that it is possible to take advantage of the knowledge of a to generate, by means of Φ, surrogate visual features replacing x j in (3). The next paragraph will explain how we can estimate second-order statistics even if we are totally deprived of any unseen visual features.
A formula for second-order statistics: We capitalize on the fact that we have a compatibility function Φ aligning the visual and the semantic spaces, after the optimization of its weights W. As a result, for any semantic embedding a, we can retrieve the visual feature that are matched in the visual space: we claim that the latter is Wa for both the inner product and sem-to-vis mapping cases, and W a for the vis-to-sem mapping case (see the Supplementary Material, available online).
Setting to zero the attributes' components: the way to estimate standard deviations in the visual space: Let us define a →0 to be the ν-dimensional vector obtained from a ∈ R ν by setting to 0 its -th component, = 1, . . . , ν. Using a pre-trained compatibility function Φ, we can map a →0 from the semantic to the visual space in the following manner We can interpret x ( ) as the most prototypical visual descriptor (i.e., the centroid) for the class which corresponds to a →0 . Therefore, we are exploring the semantic space, starting from the class represented by a, and sampling from it new surrogate classes which can be encoded by all attributes in a but one. For instance, we start from the class "zebra" and we generate new classes which correspond to a "polka-dotted zebra" in which the attribute "has striped" has been suppressed (see Fig. 3).
We posit that those surrogate semantic classes contribute to the decision boundary between two arbitrary (seen/unseen) classes in both visual and semantic spaces. In fact, let us consider two classes i and j, represented by the semantic embeddings a i and a j . Let us also assume that a i and a j differ only by a single attribute, which stored in their -th component. That is, . This yields to a surrogate class in the semantic space which should be one of the most mistakable classes for both i and j. Consequently, we propose the following expression to approximate the standard deviationσ a related to the class Algorithm 1: Pseudocode for Feature Generation.
Scale correction - (8)  17: returnσ a for all a ∈ A S ∪ A U . 18: procedure GAUSSIAN SAMPLING (μ a ,σ a ) 19: for all a ∈ A U do 20: Sample N visual features from a d-dimensional Gaussian distribution N (μ a ,Σ a ) with expected valueμ a and covariance matrix Σ a = diag(σ a σ a ).
represented by the semantic embedding â σ a σ a = 1 To see how (6) can be useful to accomplish feature generation for our computational pipeline, the reader can refer to Algorithm 1. Additional details of setting one single attribute component to zero are available in the Supplementary Material, available online.

D. Feature Generation: A Pseudocode
In this Section, we summarize our proposed approach to generate class-conditioned visual features, building upon the tools from Sections III-B and III-C. We therefore formalize Algorithm 1 which corresponds to the cyan boxes in Fig. 2.
For seen classes (a ∈ A S ), we have access to their visual descriptors. Therefore, first-and second-order statistics can be simply estimated through (4) and (8), respectively. For the unseen classes (a ∈ A U ), we circumvent the issue of the nonavailability of visual descriptors through (2) and (6), respectively.
Let us stress that, for the computation of μ a , we only need to compute a multiplication between the weights matrix W and the semantic embedding a (check (2)). The weights W are the output of the pre-training stage of the compatibility function Φ, using one of the off-the-shelf approaches available in the literature [40]. The semantic embedding a is assumed to be known also for the unseen classes (a ∈ A U ), so we can estimate the first-order statistics of the visual features related to a class for which we do not have access to any of its visual features by (2).
To estimate second-order statistics of the unseen classes (a ∈ A U ), we sample the semantic space: for each the semantic embedding a, we generate the surrogate classes cor- responding to a 1→0 , . . . , a →0 , . . . , a ν→0 . We then use Φ to project the semantic embeddings a →0 , = 1, . . . , ν in the visual space, and we correspondingly generate the surrogate visual features x ( ) as in (5). Those surrogate visual descriptors are adopted to estimate the standard deviation using (6). Before sampling from a Gaussian distribution using the inferred firstand second-order statistics, we can furthermore make the estimation more robust by averaging the obtainedμ andσ across multiple compatibility functions. Additional details and discussions are available in the Supplementary Material, available online.
Since the approximation in (2) for the first-order statistics is valid up to a scale factor, to solve this issue, we exploited a very simple, yet effective, solution: Forμ a andσ a when a ∈ A S , we can unit normalize their norms. For the unseen classes, a ∈ A U , we compute the following rescaling factors In this way, we adaptively re-scale through a weighted similarity of the norms of first-and second-order statistics computed from the seen classes. The aforementioned weights are defined in terms of the semantic matching estimated by the dot product between a generic a ∈ A U and all b ∈ A S .

E. The Classification Pipeline
We describe here our classification pipeline (Fig. 2, green boxes) which uses the first-and second-order statistics we infer in the last Section (Fig. 2, light blue boxes). The classification pipeline is composed of three stages briefly summarized as follows: 1) A softmax classifier is trained over the visual features generated using Algorithm 1, for both seen and unseen classes. 2) To re-balance the classification confidence and avoid the hubness problem [39], we adopt an ensemble in which we aggregate weaker classifiers, each of them trained on all seen classes except one (Section III-E2). 3) To get rid of the computational overhead at test time, we introduce the usage of neural distillation to condensate our ensemble in a single network. So doing, inference can be solved by one forward pass only (Section III-E3). The reader can refer to the Supplementary Material for additional implementation details, available online.

1) Softmax Classifiers for ZSL:
We improve over the mainstream paradigm of [40] in solving zero-shot learning with a shallow softmax classifier, once the unavailability of visual descriptors for the unseen classes is resolved through feature generation.
Actually, we differ from [38] and cognate literature [2], [9], [12], [16], [19], [24], [34], [41], [48] in one technical, but important detail. Despite the conditional GAN-based generating schemes are capable of synthesizing features for both seen and unseen classes, the synthetic visual features for the seen classes are never used to train the softmax classifier. Instead, for the seen classes, the real features (e.g., extracted from a ResNet101 network [39]) are adopted. We postulate that, in our case, this would achieve training instability since we train a classifier using features which are sampled from two different source distributions. We provide evidences for the superiority of this design choice in the Supplementary Material, Table 6, available online.
In our work, instead, we train our softmax classifier on generated visual features only. Additionally, we can generate a balanced training set, ensuring that each category is represented by the same number (N ) of visual features. Furthermore, we are also sure that the distribution from which we sample our training set is the same (i.e., a mixture of Gaussian distributions) across seen and unseen classes. In this way, one may question whether the usage of generated features, instead of real ones, can compromise the recognition of seen classes. However, according to our experimental evidences (see Section IV-B), this is not the case.
2) One-Seen-Class-Out Ensemble: Because the strategy to estimateμ a andσ a is different depending on whether a ∈ A S or a ∈ A U (see Algorithm 1), some of our (unseen) class-specific Gaussian distributions may not be coherent with the other (seen) ones. More precisely, the region of the feature space occupied to the Gaussian distribution N a related to the unseen class a ∈ A U may be "occupied" by some of the Gaussian distributions  N b , N b , N b , . . . related to other seen classes b, b , b , . . . ∈ A S . This problem is known in the literature as hubness [39].
We propose to mitigate this problem by cyclically removing one of those Gaussian distributions N b , b ∈ A S , from the training stage. This corresponds to having a pool of (weak) softmax classifiers, each of them trained on all classes except to a single seen one. We conjecture that the removal of the Gaussian distribution N b , b ∈ A S , can create "space" for the Gaussian distributions N a , N a , . . . related to the unseen class a, a ∈ A U . Such unseen class will then be no longer confused with the seen class we are removing. In principle, "zebras" can be better recognized if we do not consider "dalmatians", so that we can avoid errors related to the similar colorization. While "blue whales" could be better recognized after we discard "killer whales", so that we avoid any error due to the similar shape. At the same time, however, when we remove a single seen class from the training set, we still preserve a sufficiently rich amount of information to be still capable of transferring knowledge from seen to unseen classes.
When we train such a pool of weak softmax classifiers, we still allocate a number of bins in the softmax operator (9) which equals the total seen and unseen classes, #A S + #A U . So, the bin (corresponding to the seen class we are removing) is not optimized while training the classifier, and this lowers the confidence in predicting the removed seen class. In this way, we can re-balance the confidence of all the other classes as well, especially the unseen ones (see Section IV-C), ensuring that any confusion with the class pulled out is eliminated in a hardcoded way. Our design choice has also a favorable property: we can straightforwardly aggregate the predictions of all such weak softmax classifier with a trivial averaging operation -while the removal of the non-optimized bin would have created more difficulty in combining the different classifiers.
We term this procedure One-Seen-Class-Out Ensemble, or OSCO Ensemble, in short.
3) Distillation: For the OSCO ensemble, we need to train a number of classifiers which equals the number of seen classes #A S . Although the training stage of each of such softmax classifier can be easily parallelized, it is still true that our ensemble is affected by the fact that the inference stage cannot be solved in one (forward) step only, but need to involve more models (here, #A S ).
We propose to solve this issue by means of neural distillation [15], aiming at preserving the beneficial effect of ensembling (see Section IV-C), while also recovering the possibility of doing inference with a single forward operation. To this end, we adopt two different strategies, namely, Linear and Teacher-Student Distillation.
Linear Distillation: Let U = U a be the matrix of size (#A S + #A U ) × d storing the weights of the softmax classifier which is trained on all classes except to the seen class a ∈ A S . Even if the classifier adopts a softmax non-linearity s, as defined in (9), and uses a cross-entropy loss L for training, the inference over the test instance x can be done by arg max U a x. In other words, we compute a max operation over the entries of U a x which are in 1-to-1 correspondence with the class to be recognized (both seen and unseen).
We propose a linear distillation method in which we average U a x for all a ∈ A S , so that we can compute and we can solve the classification as arg max u.
In this manner, every test instance x needs only a single forward step to be classified: it can be considered as combining the weights of the softmax classifiers before applying the softmax, as opposed to average its output.
Teacher-Student Distillation: As an alternative strategy for neural distillation [8], [15], we propose to train a new softmax classifier, parametrized by the (#A S + #A U ) × d weights matrix V, by optimizing the following objective λL(y, s(Vx)) + (1 − λ)L 1 #A S a∈A Sp a (x), s(Vx) . (11) In (11), we have a convex combination (controlled by λ) of two cross-entropy losses L. The first loss optimizes the network predictions s(Vx) in order to make them equal to the groundtruth label of x (encoded by one-hot-encoding y, as usual). The second cross-entropy loss tries to match the very same predictions s(Vx) with the average prediction 1 #A S a∈A Sp a (x) of the weak softmax classifiers (i.e., the OSCO ensemble). In the case λ = 1, the second addend in (11) is not considered, and when we minimize the objective with respect to V over the the pool of synthetic visual features extracted using Algorithm 1, we are training one of the softmax classifiers, as we did in Section III-E1. Differently, when λ < 1, we are exploiting the ensemble of weak softmax classifiers (Section III-E2) as a teacher to better guide the learning of the student model x → s(Vx) towards a better zero-shot learning recognition.

A. Datasets and Benchmarks
We run experiments on the following popular image classification benchmarks for ZSL: Animals with Attributes [39] (of which we consider both first and second releases AwA 1 and AwA 2 ), the 2011 version of Caltech UCSD Birds 200 [37], Scene Understanding (SUN) [42], the PASCAL VOC 2008 benchmark augmented by Yahoo datasets (aPY) [44], and the Oxford Flowers 102 (FLO) [31]. On each benchmark, we follow the recommended splits [39]: the reader can refer to the Supplementary Material for additional details on that, available online, as well as for the statistics and some exemplar images from the considered datasets. While providing quantitative performance in ZSL or GZSL, we stick to classical error metrics [39]. For ZSL, we consider top-1 classification accuracy over unseen classes. For GZSL, we provide mean-per class accuracy a s and a u , over seen and unseen classes, respectively. We also summarize the performance in GZSL with the harmonic mean H = 2 a s ·a u a s +a u .

B. Validation of the Gaussian Approximation
Assuming that we have an oracle which provides us first-and second-order statistics for all the classes, we seek to recognize what is the effect of replacing standard visual features with Gaussian-generated ones?
We setup the following experimental validation. To encode images, we consider the classical ResNet-101 features shared by [39]. We then use them as our "oracle" solution to compute first-and second-order statistics for the categories to classify, using (4) for the meanμ and (3) for the standard deviation σ. When seeking to replace "real" ResNet-101 features with Gaussian-generated ones, we consider the Gaussian distribution centered in μ, whose covariance matrix is diagonal and set to be the standard deviation squared entry-wiseσ σ. Afterwards, feature generation is simply accomplished by sampling from a pool of class-specific Gaussian distributions. To provide a more reliable plug-in replacement for visual features, we substitute real descriptors with Gaussian-generated by making sure that the number of instances per class does not change. The results of this procedure are reported in Fig. 4 and in Table I. In Fig. 4, we provide the learning curves and the classification performance of a softmax classifier always tested on real ResNet-101 features, while trained using three different setups. First, we train only on real Resnet-101 features: we denote this case by p = 1, where p denotes the proportion of real versus generated features. Then, we remove half of the Fig. 4. Learning curves: Accuracy (in green) and loss curves (in orange) are provided for both training and test set (continuous versus dotted lines, respectively) for the APY dataset (similar trends are registered in other datasets, see Table I). We perform the experiments using the test cases p = 0, p = 0.5 and p = 1 (see Section IV-B) in which, from left to right, we gradually decrease (until we remove) real with generated features for the sake of training a softmax classifier. Evaluation is still done on real features at test time. No major changes are registered, showing that Gaussian approximation is effectively sound. real features, replacing them with Gaussian ones, i.e., p = 0.5. We also consider the more challenging setup in which training is performed on Gaussian features only (p = 0). Fig. 4 shows clearly that there is no factual difference when swapping real training data with Gaussian generated ones, and this applies to both training and testing performance.
Furthermore, we provide a quantitative evaluation of performance on all the benchmark datasets considered (Table I). In most of the cases (AwA 1 , AwA 2 , SUN, aPY and FLO), the classification accuracy tends to remain stable when p varies. Notably, there are cases where training on Gaussian-generated features is even better. A possible explanation for that can be due to the "cleaner" nature of Gaussian-generated samples in comparison to real descriptors, still being discriminative due to the oracle knowledge of first-and second-order statistics.
Numerical validation of our approximation: Since, in an applicative ZSL scenario, we cannot exploit any oracle to infer the statistics that we need, we demonstrate the effectiveness of our approximation in the following.
We consider a broad pool of approaches to train a compatibility function: the one from the relevant "Embarrassingly Simple Zero-Shot Learning" (ESZSL) by Romera-Paredes et al. [33], the attribute label embedding (ALE) mechanism of Akata et al. [1] -and its latent extension latEM by Xian et al. [38], the synthesized classifiers by Changpinyo [4] -considering the three variants Sync CS , Sync St and Sync OVO , corresponding to the different max-margin training strategies adopted (please see [4] for additional details). We also consider ZSL by an exponential family of distributions (EFZSL) by Verma et al. [36], the Semantic auto-encoder (SAE) proposed by Kodirov and Gong [21], and the coupled dictionary learning (CDL) technique Fig. 5. Quantitative check on the approximation of the first-order and secondorder statistics for AWA 1 dataset. For each class, we concatenate our approximations of first-and second-order statistics, inferred using a compatibility function, with the "oracle" statistics obtained using the sampling mean and sampling standard deviation of the corresponding ResNet-101 features by [39]. We then plot the pairwise Euclidean distance between the estimate for the i-th class (row) and the oracle for the j-th class (column): dark blue correspond to zero distance, while yellow correspond to a significant difference -which is registered, as desired, on the off-diagonal where different classes are compared. Additional (and bigger) visualizations in the Supplementary Material, available online. by Jiang et al. [18]. Please, note that despite the linear form of the compatibility function, the training strategy adopted can nevertheless model non-linear patterns.
For each of the ESZSL, ALE, Sync, EFZSL, SAE or CDL, we exploit publicly available code to obtain W while using the proposed splits by [39]. Once W are obtained, we infer first-and second-order statistics (using (2) and (6)). For the sake of the evaluation, we then concatenate the estimated μ and σ, comparing them with the "oracle" ones, that is, the ones inferred from ResNet-101 features while using (4) and (3) (see Section IV-B). We then exploit the euclidean distance to compare all possible pairwise combinations between "approximated" and "oracle" vectors. We expect that, across all possible pairings, the euclidean distance should be minimized if and only if we compare statistics referring to the same class. We can easily have a check on that using Fig. 5: when plotting the matrices where the (i, j)-th entry provides the euclidean distance between class-i estimate and class-j oracle, we expect to see a "diagonal pattern" from which euclidean distance is minimized only on the diagonal. Effectively, this is what we observe in Fig. 5, showing that our approach is reliable in inferring class statistics.
To sum up, our approximation shows to be reliable, especially when inferring statistics for the "oracle" ones.

C. Ablation Study
We provide an ablation study to assess the impact in ZSL and GZSL performance of different compatibility functions when exploited to infer the first-and second-order statistics of seen/unseen classes. This step is followed by our proposed Gaussian generation which provides visual descriptors used to train a softmax classifier (in this Section, we call this method GG for the sake of our ablation study). Afterwards, we apply our One-Seen-Class-Out (OSCO) ensemble. The main results of this ablation study, on all the datasets from Section IV-A, is reported in Table II while other complementary results are available in the Supplementary Material, Section 6, available online. Discussion. In standard ZSL, the usage of a softmax classifier, trained on Gaussian generated features (GG), usually improves in performance the compatibility function itself used as a ZSL classifier: despite a few controlled cases in which we almost replicate the performance of our baseline (e.g., ALE on AwA 2 ) and in spite of two single cases in which seen classes are severely over-fitted (Sync CS on FLO, SAE on CUB), in the sharp majority of the cases GG is always better than the baseline compatibility function, by margin (e.g., +7% w.r.t. SAE on SUN, +6.5% w.r.t. Sync OVO on aPY). In Table II, in the Supplementary Material, available online, we show that compatibility functions are complementary in boosting the feature generation, since removing even one of them always provides a drop in performance.
In GZSL, feature generation is able to improve the classification scores as well, but this improvement is limited for the seen classes only (GG only increases a s over the baseline). The role of feature generation is also evident in Table 3, in the Supplementary Material, available online, showing its responsibility over the major improvements in performance we scored. To improve a u over the unseen classes as well, our idea of creating an ensemble in a one-seen-class-out fashion comes into play. The effect of the OSCO ensemble is to re-balance the performance obtained by GG so that the performance over unseen features could increases, while the one over seen classes slightly decreases -and, as a matter of fact, we reduce the tendency to overfitting them. In all cases reported in Table II, the proposed OSCO ensembling is capable of improving the baseline compatibility functions, frequently by a sharp margin (e.g., +18.7% for H on AWA 1 with respect to ALE, and +20.1% for H on FLO with respect to Sync OVO ). The reason for this can be found in the gained robustness in averaging upon different classifiers (see the Supplementary Material, Table 4, available online) and, removing one class at the time happens to be better than removing more of them concurrently (see the Supplementary Material, Table 5, available online).

D. Comparisons With the State-of-The-Art
Standard Inductive ZSL: We setup an experimental validation using the datasets considered in Section IV-A on which we compare the performance of our approach against a wide number of prior state-of-the-art works for inductive zero-shot learning (see Table III).
Among the comparative methods, we report a broad cases of compatibility functions devised for ZSL (ALE [1], ESZSL [33], SynC [4] or SAE [21]). We also consider latent embedding models (such as latEM [38] or CDL [18] or KerZSL [47]. And, of course, we compare with the most relevant class of methods, that is, adversarial approaches (see Section II), denoted by a ‡ symbol in Table III. For this experiment, together with the other state-of-the-art comparison, all such compatibility functions have been exploited to generate visual descriptors. In order to decide from which compatibility function we have to sample the descriptor from, we took advantage of a prior distribution depending upon the compatibility cost on the validation set (see the Supplementary Material, Section 5, for further details, available online). Our method based on Gaussian generation (GG), by itself, is capable of improving a number of GAN-based methods: LisGAN [24] and TCN [19] on AwA 1 , CV-ZSL [25] on AwA 2 , BP-ZSL [48] and Cycle-WGAN [9] on CUB, f-CLSWGAN [40]  With the usage of teacher-student distillation (see Section II-I-E3), we can further capitalize on the soft labels obtained from the softmax classifier trained on GG features. In this manner, the student architecture is obtained by minimizing the loss function as in (11) and termed DGG. As the results in Table III clearly show, DGG achieves the second best performance on AwA 2 with a reduced −0.9% gap from KerZSL [47]. The same happens on FLO: −0.8% with respect to LisGAN [24]. Most notably, however, DGG is able to improve the prior best scoring methods in the following datasets: +3.6% on AwA 1 , +2.1% on CUB, +1.0% on SUN, +8.9% on aPY, with respect to the top-scoring methods. This favorable result in the simpler standard ZSL setup corroborates our attempt in benchmarking the more challenging generalized ZSL regime where balancing seen and unseen performance is crucial: this is where our ensembling and linear distillation components can become very useful.
Generalized Inductive ZSL (GZSL): We continue the quantitative validation of our proposed approach on inductive GZSL, in which we assess 1) the performance of Gaussian generated features (GG) fed to a softmax classifier, 2) our OSCO ensemble mechanism (EGG), 3) the effect of linear distillation (EGG+ d) and 4) the results of teacher-student distillation as in (11). The last case correspond to our full computational method, denoted as DEGG in Table IV.
Discussion: In our analysis, we are still considering the methods quoted above, while also adding many other approaches benchmarked in the more challenging GZSL setup. In fact, we also consider approaches designed for few-shot learning and casted to the zero-shot setting (such as CRnet [46]). We account for approaches which tackle zero-shot learning by an attention mechanism over the attributes used to describe each categories (LFGAA [27], ZSL-OCD [20] and DAZLE [17]). Finally, we also add several other adversarial methods. The two most recent  IV  QUANTITATIVE COMPARISON AGAINST THE STATE-OF-THE-ART IN INDUCTIVE  GENERALIZED ZERO-SHOT LEARNING (GZSL) (and effective) ones are those designing a GAN-based algorithm with a computational module to remove redundancy from the generated features (RFF-GZSL [14]), and a generative scheme based on episodic training to better stage the transfer from seen to unseen classes (E-PGN [45]). All these methods, published between 2019 and 2020, improved upon the already solid GAN baseline of f-VAEGAN-D2 [41] and Cycle-WGAN [9] and cognate methods. We register a systematic improvement in performance while adding components in our computational pipeline: both seen/unseen accuracy is increased by the softmax classifier trained on Gaussian-generated features (GG) improved by the ensemble stage (EGG) with the OSCO strategy. Moreover, the unseen accuracy is even more enhanced by our two distillation mechanisms. Actually, we evaluated linear distillation for the sake of having a quick and simple strategy to aggregate several models into a single, and thus more efficient, one to be evaluated. Notably, the linear distillation improved over the ensemble as well: in fact, the performance of EGG+ d is always superior to the one provided by EGG (about +2% of absolute average improvement). With respect to this "vanilla" linear distillation, the more principled teacher-student approach is able to improve further the already solid performance .
Globally, our full method, termed DEGG is capable of scoring a remarkable performance. On AwA 2 , our method is inferior only to RFF-GZSL [14], while, in all other cases, DEGG is capable of improving prior state-of-the-art methods with the following absolute improvements for H: +0.6 on FLO, +1.4% on CUB, +2.3% on SUN, +3.2% on AwA 1 , and +9.9% on aPY.
Sensitivity Analysis: We evaluate on the sensitivity with respect to the number of generated visual features and the balancing factor λ used in the teacher-student distillation loss (11), Fig. 6. Sensitivity analysis for our Distilled Ensemble of Gaussian Generators (DEGG). Left: while ablating on the number of generated features per class (100,300,1000,3000), the classification accuracy H remains stable. Right: in terms of H, the optimal value for the distillation parameter λ in (11) is 0.5, confirming what shown in [15]. In both cases (left and right), the trend is stable across datasets.
using the H metric. The results of our analysis is reported in Fig. 6.
As we seen in (Fig. 6, left), we have not found any major effect on performance by changing the number of generated visual features from 100 to 300, and also to 1000 or 3000. Our findings are, to some extent, similar to [14], which has already shown some robustness towards this factor, but, globally our generating approach seems even more robust. In shed of this behavior, we set the number of generated visual features to 300 (for each seen/unseen) class and we did non change this value while competing for the state-of-the-art in both ZSL (Table III) and GZSL (Table IV) -the same value is used in our ablation analysis (Table II). This makes our method extremely convenient and valid for the application standpoint, since scoring an optimal performance while avoiding annoying hyper-parameter tuning.
The other parameters we inspected is the distillation parameter λ, balancing the teacher-student distillation loss (11) (Fig. 6, right). According to the related literature [15], we found the best performance obtained by balancing in an even manner the teacher and the student. Therefore, we set λ = 0.5 and we never changed it in our prior analysis, furthermore corroborating the robustness of our approach.

V. CONCLUSIONS & FUTURE WORKS
In this paper, we show how to turn a compatibility function for ZSL into a feature generator, sampling from a pool of classspecific Gaussian distributions, whose first-and second-order statistics are inferred using the compatibility function as it is. To balance the classification performance between seen and unseen classes, we adopt a One-Seen-Class-Out (OSCO) ensemble mechanism to tackle the well-known hubness problem [39]. Since an ensemble requires to evaluate different separate models at test time, we recover an efficient inference stage by performing neural distillation. Our approach is also able to improve prior state-of-the-art methods for standard and generalized inductive zero-shot learning. In fact, we improve prior art on AwA 1 (+3.6%), CUB (+2.1%), SUN (+1.0%), and aPY (+9.9%) on top-1 classification accuracy for standard inductive ZSL and we do the same for inductive GZSL (+0.6% on FLO, +1.4% on CUB, +2.3% on SUN, +3.2% on AwA 1 and +10.6 % on aPY). Our method shows also a notable robustness towards hyperparameters choices (Fig. 6), such that the quantitative results reported always refer to a fixed configuration, with distillation parameter λ = 0.5 and sampling 300 features per class.
Future works essentially encompass two alternative directions. On the one side, one may wonder if the pre-training for the weights W can be done end-to-end alongside to the training of the softmax classifier: deep learning experience suggests that this variation could be beneficial to performance, in order to have stronger baseline methods. On the other side, one may posit that modelling a complicated distribution, such as the one referred to a visual category, and applying more complicated ones can be surely interesting to explore and beneficial for the performance.