Content-Attribute Disentanglement for Generalized Zero-Shot Learning

Humans can recognize or infer unseen classes of objects using descriptions explaining the characteristics (semantic information) of the classes. However, conventional deep learning models trained in a supervised manner cannot classify classes that were unseen during training. Hence, many studies have been conducted into generalized zero-shot learning (GZSL), which aims to produce system which can recognize both seen and unseen classes, by transferring learned knowledge from seen to unseen classes. Since seen and unseen classes share a common semantic space, extracting appropriate semantic information from images is essential for GZSL. In addition to semantic-related information (attributes), images also contain semantic-unrelated information (contents), which can degrade the classification performance of the model. Therefore, we propose a content-attribute disentanglement architecture which separates the content and attribute information of images. The proposed method is comprised of three major components: 1) a feature generation module for synthesizing unseen visual features; 2) a content-attribute disentanglement module for discriminating content and attribute codes from images; and 3) an attribute comparator module for measuring the compatibility between the attribute codes and the class prototypes which act as the ground truth.With extensive experiments, we show that our method achieves state-of-the-art and competitive results on four benchmark datasets in GZSL. Our method also outperforms the existing zero-shot learning methods in all of the datasets. Moreover, our method has the best accuracy as well in a zero-shot retrieval task.


I. INTRODUCTION
To classify images, humans can capture the characteristics-the semantic information-about objects, and use it to recognize a class of objects. Thanks to advances in deep learning technology, machines can mimic this ability, given large amounts of data, using supervised learning. Humans can recognize the class to which an object belongs using descriptions of the object from encyclopedias, even when they have never seen the class before. However, conventional supervised learning-based models cannot classify classes which are not seen during training. Therefore, if additional classes are added after training, the models must be re-trained from scratch.
To tackle this problem, several studies in a field known as zero-shot learning (ZSL), have been dedicated to enabling models to classify unseen classes [1]- [4]. The goal of conventional ZSL is to classify unseen classes using knowledge learned from seen classes. For generalized ZSL (GZSL), models are required to have the capacity to classify both seen and unseen classes after training only with seen classes. Existing GZSL research [5]- [8] has branched into embedding-based and generative-based methods. Embedding-based methods aim to classify the unseen classes by mapping visual features into semantic vectors. Generative-based methods generate unseen visual features using the unseen semantic vectors and randomly initialized noise vectors. As recent generative-based methods, CE-GZSL [9] uses contrastive embedding to leverage instance-  wise supervision. AGZSL [10] fuses adaptive and generative mechanisms, and supplements image-adaptive attention for GZSL. In this work, we focus on the generative-based method.
In the ZSL task, it is important to transfer knowledge learned from seen classes to unseen classes. Thereby, side information such as the descriptions mentioned earlier, is required to bridge the gap between seen and unseen classes. Researchers have utilized class prototypes [11]- [13], word embeddings [2], [14], and text descriptions [15], [16] as the side information. Recent work [6], [9], [17] has mainly focused on the class prototypes as side information. Class prototypes contain meaningful semantic knowledge describing the corresponding class. As shown in Fig. 1, the class prototypes represent the characteristics of classes. A set of elements of class prototypes is pre-defined, so seen and unseen classes share the same semantic space, with different intensities per class. A class prototype acts as a set of ground truth class attributes describing the characteristics of the class. Hence, it is crucial to correctly map visual features obtained from deep convolutional neural networks like ResNet-101 [18] into class prototypes. To align visual features into corresponding class prototypes, neural network models have to extract visual features similar to class prototypes, which correspond to class labels. However, as shown in Fig. 1, visual features also contain information which is not involved in class prototypes that can therefore degrade the performance of zero-shot classification.
To mitigate this problem, models need to disentangle the non-class and class attributes of images. We therefore define 1) features irrelevant to class prototypes as semanticunrelated features, 2) features involved in class prototypes as semantic-related features, and 3) features extracted from ResNet-101 as visual features. As the prior state-of-the-art, SDGZSL disentangles the semantic-unrelated and semanticrelated features using a single encoder. Then, it aligns the semantic-related features and class prototypes. However, we argue that semantic-unrelated and semantic-related features need to be extracted using independent encoders, because the feature spaces of the two groups are not identical. SDGZSL also uses a concatenation operator when reconstructing an original visual feature from two disentangled features. We also claim that aligning two features would produce better results than simple concatenation, as discussed in [19]. This contention is motivated by the style transfer research, which focuses on disentangling content and style representations. Many style transfer methods [20]- [22] have produced impressive improvements in performance, and have shown that the content-style disentanglement works. We can interpret styles as class attributes in the ZSL task. Therefore, we can disentangle semantic-unrelated information (contents) and semantic-related information (attributes) from visual features.
In this paper, we propose a novel content-attribute disentanglement architecture for generalized zero-shot learning (CA-GZSL). Our model encodes content-attribute codes from an original visual feature. The model learns content codes by calculating a reconstruction loss and attribute codes by measuring compatibility scores with class prototypes. When reconstruction, to fuse the content-attribute codes effectively, we use adaptive instance normalization (AdaIN) [19] which aligns the statistics of two different codes, leading to strong generalizability.
In summary, our main contributions are three-fold: • We propose a novel content-attribute disentanglement architecture for generalized zero-shot learning (CA-GZSL). It comprises a visual feature generation module, a content-attribute disentanglement module, and an attribute comparator module. • To the best of our knowledge, this is the first attempt to introduce adaptive instance normalization into the generative-based GZSL method to improve contentattribute disentanglement. • The proposed method achieves state-of-the-art and competitive results on four datasets in GZSL, CZSL, and zero-shot retrieval tasks. Our approach is the first to obtain over 80% result on CUB in GZSL and over 50% result on aPY in CZSL.

A. GENERALIZED ZERO-SHOT LEARNING
The aim of zero-shot learning (ZSL) is to transfer knowledge from seen to unseen classes. The ZSL research can be divided into inductive and transductive approaches. In inductive ZSL training [23]- [25], unseen class prototypes are used along with seen data. In transductive ZSL training [26]- [28], unlabeled visual features of unseen classes can be used in addition to unseen class prototypes and seen data. With respect to testing, ZSL is divided into conventional zero-shot learning (CZSL) and generalized zero-shot learning (GZSL). While CZSL predicts classes in unseen data, GZSL categorizes classes in both seen and unseen data. In general, GZSL is considered to be harder to achieve than CZSL, since models would be biased toward the seen data which is the only type used in training. Our method belongs to the inductive GZSL category. GZSL can be achieved in two ways: embedding-based [2], [14], [29]- [32] or generative-based methods [8], [33]- [36]. The embedding-based method has focused on learning visual features and class prototypes by aligning them in a joint embedding space. As an embedding-based method, DAZLE [5] uses an attention mechanism to highlight important local features. DCEN [6] learns task-independent knowledge via contrastive learning, to transfer representations.
The generative-based methods have synthesized unseen visual features using generative adversarial networks (GANs) [37] or variational autoencoders (VAEs) [38], which transforms the ZSL problem into a supervised classification problem. CADA-VAE [7] leverages two VAEs to align the latent distributions of different modalities. TF-VAEGAN [35] uses a feedback module to reflect feedback from the decoder to the generator. CE-GZSL [9] is a hybrid framework which integrates generation and embedding-based methods using contrastive learning. Our approach belongs to generativebased GZSL.

B. CONTENT-STYLE DISENTANGLEMENT
Content-style disentanglement separates content and style representations from an image or two different images. This concept has been widely used in style transfer [19], imageto-image translation [39], and style classification [20] tasks. Numerous studies have shown enhanced qualitative results using content and style encoder-decoder architectures. They have used style information to stylize the content of an image. The content and attribute representations are then combined to reconstruct the original image, to demonstrate that they are complementary, and thus well-separated. As one of the operations combining content and style representations, adaptive instance normalization (AdaIN) [19] has been used, and has produced significant improvement. AdaIN is an instance normalization, which aligns feature statistics between two different feature distributions. It has also demonstrated a strong generalization ability.
MUNIT [39] provides an unsupervised image-to-image translation using content codes that are domain-invariant, and style codes that contain domain-specific properties. AL-ADIN [20] learns fine-grained style similarities among digital artworks by leveraging content-style disentanglement. ALADIN shows the remarkable impact of style representations constructed using a set of style codes. In GZSL, some approaches use a disentangling framework. DLFZRL [40] incorporates a hierarchical disentanglement structure for the discrimination of latent features. SDGZSL [41] uses a total correlation penalty for the disentanglement of semanticrelated and semantic-unrelated features.
Unlike the existing works, we define the style of an image as a set of attributes. Thereby, we focus on the impact of the content-attribute disentanglement architecture using an encoder-decoder network equipped with AdaIN to improve the attribute representation.

A. PROBLEM DEFINITION
For zero-shot learning (ZSL), we use a seen dataset S and an unseen dataset U. d res is the dimensionality of a visual feature extracted using ResNet-101, and d att is the dimensionality of a set of class prototypes. The seen dataset S is defined as S = {x s , y s , a ys | x s ∈ X s , y s ∈ Y s , a ys ∈ A s }, where x s ∈ R dres is a d res -dimensional visual feature extracted from ResNet-101, y s ∈ R 1 is a label in the seen classes, and a ys ∈ R datt is a d att -dimensional class prototype of the class y s . The unseen dataset U is defined as is a label of the unseen classes, and a yu ∈ R datt is a d att -dimensional class prototype of the class y u . The two classes-seen classes and unseen classes-are disjoint, Y s ∩ Y u = ∅.

B. MODEL OVERVIEW
Our method is divided into two stages: the first stage is introduced in Fig. 2, and corresponds to Subsections 1) to 4), and the second stage is for final classification, corresponding to Subsection 5).
In Fig. 2, the proposed architecture comprises three modules: (a) a visual feature generation module consisting of a variational encoder Q and a variational decoder P ; (b) a content-attribute disentanglement module consisting of a content encoder E, an attribute encoder H, an adaptive instance normalization (AdaIN), and a decoder D; (c) an attribute comparator module consisting of a comparator T . First, the visual feature generation module synthesizes unseen visual features from unseen class prototypes using a variational autoencoder (VAE). To let the VAE know how to correctly synthesize visual features, we first train it with seen visual features. Then, the content-attribute disentanglement module encodes the content and attribute codes using encoders. We combine the content and attribute codes using AdaIN [19] and reconstruct the original visual features from the combined codes using the decoder. Finally, the attribute comparator module measures the compatibility scores between the attribute codes and the class prototypes, and makes the attribute codes resemble the corresponding class prototypes.

1) Visual Feature Generation Module
There are many generative-based GZSL approaches which use a VAE to synthesize unseen visual features [7], [8], [33], [35], as unseen visual features are not allowed to be used in training. We use a conditional variational autoencoder (CVAE) to generate synthesized visual featuresx ∈ R dres conditioned on seen or unseen class prototypes. The CVAE first generates synthesized seen visual featuresx s ∈ R dres using the seen dataset S for use in training. The objective function of the CVAE is formulated as: where the first term is the reconstruction loss and the second term is the Kullback-Leibler (KL) divergence between q(z|x s , a s ) and p(z|a s ). x s ∈ R dres and a s ∈ R datt are the seen visual features and the seen class prototypes. The CVAE encoder Q models q(z|x s , a s ) to produce the latent variables z using seen visual features x s and seen class prototypes a s . The CVAE decoder P models p(z|a s ) and p(x s |z, a s ) to synthesize visual features from the latent variables z and the seen class prototypes a s . We use seen visual features x s and synthesized seen visual featuresx s as an input to the networks in the content-attribute disentanglement module. As previously mentioned, it is noteworthy that a s and a u share a common semantic space. Thus, after training the CVAE, the CVAE decoder can synthesize unseen visual featuresx u ∈ R dres using unseen class prototypes a u ∈ R datt . Both seen visual features and synthesized unseen visual features are used in training a zero-shot classifier. We use seen visual features extracted by ResNet-101, which was pre-trained on ImageNet [42] or fine-tuned with seen class images.

2) Content-Attribute Disentanglement Module
We define semantic-unrelated information as contents and semantic-related information as attributes that represent the styles of classes. The content-attribute disentanglement module comprises four main parts: a content encoder, an attribute encoder, an AdaIN, and a decoder. The content-attribute disentanglement module divides visual features into the contents and the attributes. The content encoder E : R dres → R datt and the attribute encoder H : R dres → R datt map a visual feature x into a content code e c and an attribute code e a , respectively. Then, the content code and attribute code can be denoted as: where e c , e a ∈ R datt are d att -dimensional representations. We use AdaIN which aligns two different feature statistics to reconstruct an original visual feature. AdaIN takes as input a content code e c and an attribute code e a , and aligns the channel-wise mean and standard deviation of e c to match those of e a . AdaIN is defined as follows: where σ(e c ) and µ(e c ) are the standard deviation and mean, respectively, of a content code, respectively. σ(e a ) and µ(e a ) are the standard deviation and mean, respectively, of an attribute code. We first normalize the content code e c with the mean and standard deviation of the content code µ(e c ) and σ(e c ), scale the normalized content code with standard deviation of the attribute code σ(e a ), and shift it with the mean of the attribute code µ(e a ).
The reconstructed visual features obtained from the content and attribute codes should resemble the original visual features. Therefore, we measure the reconstruction loss to learn the content codes from the original and reconstructed visual features. The decoder D : R datt + datt → R dres reconstructs the original visual feature from the aligned codes AdaIN (e c , e a ). The reconstructed visual featurex and the reconstruction loss function can be formulated as: where x is an original visual feature.x is a reconstructed visual feature from AdaIN followed by the decoder D. We measure the mean squared error (MSE) between the visual seen feature x and the reconstructed visual featuresx as the reconstruction loss.

3) Attribute Comparator Module
Inspired by [43], we adopt an attribute comparator module to let models learn the compatibilities between attribute codes and class prototypes. As shown in Fig. 3, we concatenate an attribute code e a and a class prototype a, as an input to the comparator T . The comparator T measures the compatibility score between e a and a, while learning to maximize the score. The compatibility loss function can be formulated as: where y ∈ Y s indicates a ground-truth label and φ(y) indicates a one-hot label of y. We calculate the MSE between the compatibility score T (e a , a i ) and the one-hot label φ(y) to measure the compatibility loss.

4) Total Loss
Consequently, the total loss can be formulated as: where λ 1 , λ 2 , and λ 3 are the factors contributing to each loss, controlling each impact. We use the Optuna package [44] to search for the hyperparameters λ 1 , λ 2 , and λ 3 .

5) Generalized Zero-Shot Classification
As unseen visual features are not used in training, we generate unseen visual features for generalized zero-shot classification. The decoder of CVAE generates unseen visual features with an unseen class prototype a u and a Gaussian noise z. The attribute encoder H encodes attribute codes from the synthesized unseen visual features. Then, we concatenate the seen attribute codes and the synthesized unseen attribute codes. Using these codes, we train a classifier. For the classifier, we use only one linear layer, to make the system consistent with existing work. Because we have a complex architecture with which to effectively extract attributes in the first stage, the classifier in the second stage has a simple architecture. This classifier can be formulated as: where l is a linear layer followed by a softmax. e a ∈ R datt is an attribute code extracted from the attribute encoder H.

A. DATASETS
We use four popular benchmark datasets, Caltech-UCSD Birds-200-2011 (CUB) [49], Animals with Attributes 2 (AWA2) [50], Oxford Flowers (FLO) [51] and a Pascala Yahoo (aPY) [12] to measure the CZSL and GZSL performance. For evaluating the GZSL performance, we split each dataset into training-seen, test-seen and test-unseen classes as following proposed split which is suggested in [50]. CUB dataset is a fine-grained bird dataset that contains 11,788 CUB images, including 7,057 training-seen images, 1,764 test-seen images, and 2,967 test-unseen images. The total number of classes is 200, divided into 150 seen classes and 50 unseen classes. CUB has 312 attributes.
AWA2 dataset is a coarse-grained animal dataset. The total number of AWA2 images is 37,322, and this is composed of 23,527 training-seen images, 5,882 test-seen images, and 7,913 test-unseen images. The total number of classes is 50, divided into 40 seen classes and 10 unseen classes. AWA2 has 85 attributes.
FLO dataset is a fine-grained flower dataset. The total number of FLO images is 8,189, and this is composed of 1,640 training-seen images, 5,394 test-seen images, and 1,155 test-unseen images. The total number of classes is 102, divided into 82 seen classes and 20 unseen classes. FLO has 1024-dimensional attribute embeddings extracted from a character-based CNN-RNN using fine-grained visual descriptions [15].
aPY dataset is a coarse-grained dataset. The total number of aPY images is 15,339, and this is composed of 5,932 training-seen images, 1,483 test-seen images, and 7,924 testunseen images. The total number of classes is 32, divided into 20 seen classes and 12 unseen classes. aPY has 64 attributes.
We also use fine-tuned datasets from [33], where ResNet-101 is fine-tuned by the seen class images of each dataset.  Results of the GZSL methods. There are three blocks in the table. The first block concerns embedding-based methods, the second block concerns generative-based methods, and the last block concerns our method. U is acc Y u and S is acc Y s for simplicity. The best H results are highlighted in bold, since H is the major metric in GZSL. * indicates fine-tuned dataset, which was fine-tuned using seen class images.

B. IMPLEMENTATION DETAILS
We use three fully-connected layers with 2048 hidden units for the VAE encoder Q and VAE decoder P . Leaky ReLU is used as an activation function. We use the same hyperparameters of the visual feature generation module as SDGZSL for fair comparisons. We use the 64 mini-batch sizes in all the datasets. For the content and attribute encoders E, H and the decoder D, we use two fully-connected layers with d att hidden units. We optimize all the networks with the Adam optimizer for each module. Two fully-connected layers for the comparator module T are used with 2048 hidden units. We use a single fully-connected layer for the classifier for both CZSL and GZSL. To search hyperparameters, we use the Optuna package [44]. We will make the source code available for the detailed hyperparameters. All the models are implemented with the PyTorch framework v1.7.0 [52]. We use a single RTX 2080 Ti 11GB GPU for each training.

C. EVALUATION METRICS
We assess the performance of conventional zero-shot learning (CZSL) and generalized zero-shot learning (GZSL) with average per-class top-1 accuracy (T 1) and harmonic mean (H) between seen and unseen T 1 accuracies as presented in [50]. T 1 is defined as follows to evaluate ZSL:

Number of correct predictions in c
Total number of samples in c (10) where ∥Y∥ means the number of classes in Y. The equation of H is defined as follows to evaluate GZSL: where acc Y s and acc Y u mean the average per-class top-1 accuracies for seen and unseen classes, respectively.

D. COMPARISON WITH STATE-OF-THE-ARTS
Previous generative-based methods have used visual features extracted from ResNet-101 pre-trained on ImageNet or finetuned on the seen class images of each dataset. We measure the CZSL and GZSL performance of our CA-GZSL method. We select the recent state-of-the-art embedding-based and generative-based methods, as listed in Table 2 and 3.

2) Results of Generalized Zero-Shot Learning
In Table 2, we show the results of the evaluation of the GZSL performance and compare our CA-GZSL with recent GZSL methods. Our method surpasses the baselines by about 2% on CUB, FLO, and aPY, and obtains the second-best H result on AWA2. Notably, on the aPY dataset, our method is the first to achieve performance over 50 on H compared with the existing methods. Our intuition for GZSL is in accord with that of SDGZSL, the previous best model among the generative-based methods. Both aim at disentangling semantic-related and semantic-unrelated information from visual features. To do so, SDGZSL applies a total correlation penalty to ensure independence between semanticrelated and semantic-unrelated features by dividing a feature generated from a single encoder. SDGZSL combines the two features to reconstruct the original images by concatenating the two features. However, we assume that it is difficult for a single encoder to separate two independent features. For better disentanglement, we introduce two different encoders to encode content and attribute codes from visual features. We use AdaIN when combining content and attribute codes, to improve the generalization ability, and then we reconstruct the original images using the combined codes. Compared with the performance of SDGZSL, as shown in Table 2, for all the fine-tuned datasets, our method outperforms SDGZSL in H results. This result indicates that our approach is more effective in learning discriminative attribute features by effectively disentangling the contents and attributes from the visual features.

3) Results of Conventional Zero-Shot Learning
We report the CZSL performance in Table 3. Our method achieves state-of-the-art results on all of the datasets. Notably, on CUB, we are the first work to obtain a performance over 80%. Table 3 shows that we outperform SDGZSL in all fine-tuned datasets. In particular, on AWA2, our model outperforms SDGZSL by about 4.7%. The results listed in Table 2 and 3 indicate that our method is more generalizable than the alternatives in ZSL tasks.

1) Zero-Shot Retrieval Protocol
We follow the zero-shot retrieval protocol proposed in SDGZSL [41]. First, ResNet-101 extracts the visual features from all unseen images. Then, the attribute encoder encodes the unseen visual features into attribute codes. The attribute codes act as reference features. Third, the VAE synthesizes N unseen visual features per class. The attribute encoder encodes the unseen visual features into synthesized unseen attribute codes. The total number of synthesized unseen attribute codes is N × ∥Y u |, where ∥Y u ∥ is the number of unseen classes. Fourth, we average N synthesized unseen attribute codes to produce a representative feature of each class. The ∥Y u ∥ representative features act as query features. Lastly, we measure the cosine similarity between a query feature and the reference features, and rank the reference features by similarity in descending order. Fig. 4 shows a comparison of the zero-shot retrieval performance of CVAE, SDGZSL, and CA-GZSL (ours). The metric we use to evaluate the zero-shot retrieval performance is the mean average precision at k (mAP@k). Our method, CA-GZSL, outperforms the other approaches in all of the datasets except for mAP@25 on AWA2. For CUB and aPY, CA-GZSL has notably better performance of mAP than the oth-  Methods  CUB  AWA2  FLO  aPY  T1  H  T1  H  T1  H  T1   ers. Specifically, on CUB, CA-GZSL outperforms SDGZSL by 14.5%, 23.3%, and 25.2% in mAP@100, 50, and 25, respectively. On aPY, CA-GZSL surpasses SDGZSL by 1.9%, 2.6%, and 5.5% in mAP@100, 50, and 25, respectively. On FLO, CA-GZSL shows better performance than SDGZSL by 1.6%, 1.3%, and 1.9% in mAP@100, 50, and 25. On AWA2, CA-GZSL is better than SDGZSL by 3.4% and 0.6% in mAP@100 and 50, respectively. In contrast, SDGZSL is slightly higher than CA-GZSL, by 0.7% in mAP@25. Since our method outperforms SDGZSL in almost all of the evaluations, we argue that the attribute codes extracted using our method produce better discriminative information than SDGZSL, which helps to distinguish unseen classes, resulting in performance improvement. Fig. 5 illustrates the unseen images retrieved using CA-GZSL on AWA2. We perform the zero-shot retrieval according to the protocol described in Subsection E.1 of the experiments Section. As shown in Fig. 5, our model tends to be confused between blue whale and dolphin images, although the first false image in the blue whale column was mislabeled as a dolphin. Because the correct label is the blue whale, it can be seen that the model answered correctly. The model is more likely to be confused between walrus and seal images. At first glance, both classes look very similar. Also, the model is more likely to be confused between giraffe and bobcat images. Both classes have spot patterns, so the model tends to identify the discriminative patterns. Notably, the classes of the top three false predictions are consistent. We argue that our model can consistently identify meaningful attribute information from various images.

F. MODEL ANALYSIS 1) Ablation Study
In Table 4, we show the results of an ablation study of all four datasets, which is used to measure the impact of each component of our CA-GZSL. We first evaluate the performance of the visual feature generation module, as a baseline. Then, we evaluate the performance of the content-disentanglement module and the attribute comparator module, by gradually adding the reconstruction loss L rec and the comparator loss L comp . When L rec is added to the baseline, the GZSL performance is enhanced by 25

2) Impact of the Number of Synthesized Features
The generative-based GZSL works synthesize unseen visual features, as only seen visual features are available in training. Thus, in this experiment, we evaluate the impact of the number of synthesized visual features. Fig. 6 shows the seen accuracy (S), unseen accuracies (U ), and harmonic mean (H) when changing the number of synthesized visual features from 5 to 4,000. When the lowest number (5) of synthesized visual features is used, S is the highest, and U is the lowest in all datasets. There is a trade-off between S and U . When S goes up, U goes down, and vice versa. On AWA2, our model achieves the best H result (75%) in 2,400. On CUB, we observe the best H performance (77.2%) in 400 and 800. On FLO, it produces the best H result (89.7%) in both 2,800 and 3,200. On aPY, it shows the best H performance (50.5%) in 1,200.

3) Impact of Feature Fusion Operators
We evaluate the impact of several fusion operators which combine content and attribute codes. As listed in Table 5 than 'Concat' and 'Sum,' respectively. On aPY, AdaIN is better by 5.3% and 5% than 'Concat' and 'Sum,' respectively.

4) t-SNE Visualization
We visualize the attribute codes using the t-distributed stochastic neighbor embedding (t-SNE) algorithm [53], as shown in Fig. 7. The attribute encoder in our architecture encodes the attribute codes from unseen visual features. Most of the clusters are isolated from each other. In contrast, the seal and the walrus clusters are most closely mingled, and are hard to distinguish. The blue whale and dolphin clusters are close to each other. The bat and rat clusters are also close to each other. As shown in Fig. 5, these pairs of classes share similar attributes, leading to difficulties in discrimination. On the CUB data, our method produces discriminative attribute codes, even though CUB has fine-grained classes that require models to catch more subtle differences in attributes than coarse-grained classes. Given this ability, our method outperforms the others by a significant margin in the zero-shot retrieval task, as shown in Fig. 4.

V. DISCUSSION
We investigate the effectiveness of content-attribute disentanglement architecture for generalized zero-shot learning (GZSL). Our method for separating contents and attributes from images is found to outperform most of the existing approaches in GZSL and ZSL, following extensive experiments. As mentioned in Subsection E.2 of the experiments Section, the image of the first false prediction in the blue whale column was mislabeled as a dolphin, but the blue whale is the correct answer. We believe that refining the datasets widely used in GZSL would be valuable as future work. Although our method achieves state-of-the-art performance on most datasets, there is a limitation in classifying classes that have similar class prototypes, as shown in Fig.   5. We argue that it is hard to tackle the problem of subtle differences in values of class attributes only using the class prototype information. Therefore, auxiliary information such as knowledge graphs or additional text descriptions would be helpful to improve the zero-shot inference abilities of models.

VI. CONCLUSIONS
In this paper, we propose a novel content-attribute disentanglement architecture for generalized zero-shot learning, consisting of a visual feature generation module, a contentattribute disentanglement module, and an attribute comparator module. In addition, we present the first attempt to utilize adaptive instance normalization to improve disentanglement and generalization ability. Our method recognizes discriminative class attributes from visual features in both zeroshot classification and retrieval tasks, as demonstrated by extensive experiments. We evaluate our approach on four benchmark datasets widely used in ZSL. Compared to existing GZSL approaches, our method achieves state-of-the-art results using the CUB, FLO, and aPY datasets, and competitive results on AWA2. We will utilize additional information, such as knowledge graphs, in our architecture to discriminate classes with similar attributes in future work.