EAD-GAN: A Generative Adversarial Network for Disentangling Affine Transforms in Images

This article proposes a generative adversarial network called explicit affine disentangled generative adversarial network (EAD-GAN), which explicitly disentangles affine transform in a self-supervised manner. We propose an affine transform regularizer to force the InfoGAN to have explicit properties of affine transform. To facilitate training an affine transform encoder, we decompose the affine matrix into two separate matrices and infer the explicit transform parameters by the least-squares method. Unlike the existing approaches, representations learned by the proposed EAD-GAN have clear physical meaning, where transforms, such as rotation, horizontal and vertical zooms, skews, and translations, are explicitly learned from training data. Thus, we set different values of each transform parameter individually to generate specifically affine transformed data by the learned network. We show that the proposed EAD-GAN successfully disentangles these attributes on the MNIST, CelebA, and dSprites datasets. EAD-GAN achieves higher disentanglement scores with a large margin compared to the state-of-the-art methods on the dSprites dataset. For example, on the dSprites dataset, EAD-GAN achieves the MIG and DCI score of 0.59 and 0.96 respectively, compared to 0.37 and 0.71, respectively, for the state-of-the-art methods.

The concept of disentangled representation has been defined in several ways in the literature [9]- [11].The necessity of explicit inductive biases both for learning approaches and the datasets is discussed in [9].Inductive bias refers to a set of assumptions that a learner uses to predict outputs of given inputs that have not been encountered [12], [13].For instance, in the dSprites dataset, objects are displayed at different angles and positions.Such prior knowledge helps to detect and classify objects.However, the inductive biases in existing disentangled representation approaches are mostly implicit.The explicit affine disentangled generative adversarial network (EAD-GAN) proposed in this article utilizes affine transform as an explicit inductive bias, leading to better disentangled representations with clear physical meaning in terms of affine transforms.Fig. 1 shows entangled representations with unclear physical meaning.
We define the physical meaning property as follows: the absence of physical meaning indicates that experts cannot interpret or map the latent dimensions of disentangled representations to physical or intuitive concepts (e.g., rotation angle), which is a common issue for the representations learned by existing methods [2], [7], [8], [10], [14]- [18].A disentangled representation usually satisfies two conditions: modularity and compactness [10].In addition, the representations learned by the EAD-GAN also achieve deterministic assignment property for affine transforms.Modularity measures whether a single latent dimension encodes no more than a single data generative factor.Since some of the latent dimensions of an entangled representation may not have a clear physical meaning, which could be a mixture of several data generative factors and lead to worse modularity.Compactness measures whether each data generative factor is encoded by a single latent dimension.An entangled representation may encode one data generative factor with multiple latent dimensions.On the contrary, each latent dimension learned by the proposed EAD-GAN can be one to one mapped to an affine transform, which leads to both better modularity and compactness for affine transforms.
In a deterministically assigned representation, each latent dimension learns a fixed attribute regardless of the training trials and random seeds.For modularity and compactness, the performance of existing approaches could be improved by utilizing techniques such as contrastive learning [16].However, a deterministic assignment cannot be achieved by those techniques.For example, if we train an InfoGAN [8] algorithm two times on the MNIST dataset: trials A and B, then in trial A, the first latent dimension may learn the rotation of the digit and the second dimension may learn the thickness of the digit; while in trial B, the first latent dimension may learn the thickness of the digit and the second dimension may learn the rotation of the digit.[8] with unclear physical meaning.Given different values, −1, 0, and 1, of the latent vector c = (c 1 , c 2 , c 3 ), it is possible that the generated transforms are highly entangled, and thus, they have no clear physical meaning.For example, c 1 may represent both rotation and vertical zoom.In that situation, to know the attribute assigned to a specific latent dimension for each trial, first, we need to generate a sequence of images (e.g., ten images) by changing the value of that latent dimension (e.g., also known as latent traversal).Then, expert knowledge is required to find the pattern (e.g., rotation of the digit) hidden among the sequence of images.This process could be cumbersome if: i) there are many latent dimensions to observe (e.g., 100 latent dimensions in [7], [14], and [15]) and ii) some sequences of images do not have clear physical meaning.For a disentangled representation with deterministic assignment, the attributes learned by the latent dimensions are fixed.For example, in EAD-GAN, we can predefine the sequence as rotation, horizontal and vertical zooms, and horizontal and vertical translations for latent dimensions 1-5.
A disentangled representation learned by the proposed EAD-GAN can explicitly make a tradeoff between compactness and expressiveness.For example, the zoom attribute can be decomposed into horizontal and vertical zooms.A compact representation encodes the zoom by one latent dimension, while an expressive representation decomposes it into horizontal and vertical, encoded by two latent dimensions.This tradeoff between compactness and expressiveness is beneficial [10], as different subsequent tasks may benefit from different feature decompositions.
We are motivated by the importance of a disentangled representation in particular for the affine transform (see Fig. 2), where disentangling object pose is an attractive property of an algorithm in the imaging domain [19]- [21].Few algorithms have been able to successfully disentangle the affine transform.In [20], an algorithm is introduced that disentangles rotation and translation but not an entire affine transform.VITAE [22] proposes to separate the spatial transforms from the appearance of the input data, but the spatial transforms themselves are highly entangled in terms of rotation, translation, and zoom.
We propose EAD-GAN, which is a generative adversarial network (GAN) that utilizes the affine regularizer as an inductive bias to explicitly disentangle the affine transform.We assume that every image X r is formed by the multiplication of an affine matrix M r that describes its pose and a canonical image base X b .If we purposely transform the image X r with a predefined affine matrix M, we obtain another transformed image X t , where X t can also be expressed as the multiplication of an affine matrix M t and the same canonical image base X b .We derive the affine regularizer by decomposing an affine matrix M into two separate transforms M r and M t and inferring the transform parameters by the least-squares method.Unlike existing approaches, the representations learned by EAD-GAN are deterministically assigned and have clear physical meaning, where transform, including rotation, horizontal and vertical zooms, and translations, can be explicitly learned from data and hence can be individually selected to generate specific affine transformed data by the learned network (see Fig. 2).
In the remainder of this article, we first review the related work in Section II followed by reviewing InfoGAN and show its limitations in Section III.We introduce the EAD-GAN in Section IV, while in Section V, we show numerical results of the disentangled representation learned by EAD-GAN.We further discuss the advantages and weaknesses of EAD-GAN compared to other methods in Section VI.
Our contributions are given as follows.
1) The disentangled representations obtained by EAD-GAN have clear physical meaning in terms of affine transforms in images.To the best of our knowledge, EAD-GAN is the first algorithm that can disentangle an entire affine transform, including rotation, horizontal and vertical zooms, skews, and translations in an unsupervised manner.
2) The disentangled representations obtained by EAD-GAN have the deterministic assignment property.Each attribute is assigned to a unique component of the latent vector regardless of training trials and vice versa, which achieves better disentangled representations for affine transforms.
II. RELATED LITERATURE Recent approaches to learn disentangled representations are largely based on variational autoencoders (VAEs) [2] and InfoGAN [8].To promote disentanglement, VAE encourages the factorization of the posterior Q(z|X).InfoGAN [8] proposes to maximize the mutual information between a subset c I of latent representation z and the generated data.Much attention has been paid to regularizers that promote disentanglement.The β-VAE [7] encourages the disentanglement by increasing the weight of the KL regularizer, thus promoting the factorization of the posterior Q(z|X).Both FactorVAE [14] and β-TCVAE [15] penalize the total correlation, while the former relies on adversarial training and the latter directly calculates the total correlation through the decomposition of the β-VAE objective function.The HFVAE [18] proposes a two-level hierarchical objective to control the relative degree Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
of statistical independence.In the ChyVAE [23], an inverse-Wishart (IW) prior on the covariance matrix of the latent code is augmented to promote statistical independence.The DIP-VAE [24] penalizes the difference between the aggregated posterior and a factorized prior.In the AnnealedVAE [25], the encoder concentrates on learning individual factors and variations by gradually increasing the bottleneck capacity.Con-trolVAE [26] adds a nonlinear PI controller to automatically tune the hyperparameter added in the VAE objective.Guided-VAE [27] guides the VAE learning by introducing a lightweight decoder that learns latent geometric transformation and principal components.OOGAN [28] improves disentanglement by introducing an orthogonal regularization term to the loss function.In [29], a regularizer is introduced to punish the disagreement between the extracted feature interactions.The IB-GAN [17] is an extension to InfoGAN rooted in the information bottleneck theory, which includes a mutual information upper bound and forms a mutual information bottleneck.The InfoGAN-CR [16] adds a contrastive regularizer on top of InfoGAN that compares the changes between the image and latent space.
Although the aforementioned methods have achieved better disentanglement performance compared to the baseline established by VAE and InfoGAN, none of them yield disentangled representations with deterministic assignments, nor have they successfully disentangled an entire affine transform, which is a desirable property in the imaging domain.
In [30]- [33], self-supervised regularization is applied, where the difference of images before and after the affine/projective transform is compared.The transform loss is defined as: 2  2 , where M is a parameterized matrix M ∈ R 3×3 .However, those approaches do not achieve disentanglement since no relationship between the data generative factor and the transform is established.By contrast, the proposed EAD-GAN creates a link between the data generative factor and the transform by making a specific definition of each element of the transform matrix and further decomposing it to achieve explicit disentanglement (see Section IV).
As a byproduct of EAD-GAN, the encoder of EAD-GAN can learn the affine transform parameters of a given image and apply the inverse transform to the image to make it invariant to affine transforms.To achieve invariance of affine transforms for image data, spatial transformer network (STN) [19] can actively transform the input images by embedding the spatial transformer block into a target network or algorithm.Inverse compositional spatial transformer networks (IC-STNs) [34] use a recurrent transform manner to further improve the alignment ability of the STN.The intuition behind STN and IC-STN is to fulfill the target network's learning objectives, such as classification or object recognition.Different from STN and IC-STN, the encoder of EAD-GAN is trained in a self-supervised manner and does not need the aid of human-annotated learning objectives such as image classification or object recognition.

A. GAN: Generative Adversarial Network
GAN [1] trains a deep generative model via a minimax game.The goal is to learn a generated data distribution P G (X) close to the training data distribution P data (X) by training a generator and discriminator.During training, first, a latent vector z is sampled from a prior distribution P(z).Then, the "fake" data X f ∼ P G (X) are generated from z through the generator G.To train the discriminator D, the fake data X f are fed to the discriminator D with the label "fake," and the real data X r sampled from training data are fed to the discriminator D with the label "real."By contrast, to train the generator G, the fake data X f are labeled as "real."The generator G is trained by playing against an adversarial discriminator D that aims to distinguish between samples from the generated data X f ∼ P G (X) and the observation X r ∼ P data (X) [1]

B. InfoGAN: Information Maximization GAN
The GAN uses a simple latent vector z without imposing any constraints on how the generator uses this latent vector, which may lead to a highly entangled mapping between the latent vector z and the generated data X f .This is undesirable since there is no intuitive control, i.e., a designer that uses this model would like to generate images with explicit transforms.To overcome this limitation and achieve disentanglement, Info-GAN [8] decomposes the latent vector z into two parts: z representing uncompressible noise and c I representing a semantic generative factor (e.g., the number of the generated digit and the rotation of the generated digit in MNIST).In InfoGAN, the mutual information I(c I ; X f ) between the semantic data generative factor c I and the generated data X f is maximized to promote mapping between c I and X f .Thus, the variation of the generated data X f can be reflected by that of the data generative factor c I .Specifically, InfoGAN maximizes the objective function [8] L Info = L adv + λI(c I ; x f ). (2) However, InfoGAN achieves the disentanglement in an implicit way since it only uses the mutual information as an inductive bias.This representation has several limitations: 1) the latent vector (data generative factor) c I does not necessarily have a clear physical meaning [8], which makes the learned representations difficult to interpret and applicable in downstream tasks, and 2) the modularity and compactness are not optimized and deterministic assignment is not achieved.

IV. PROPOSED EAD-GAN
To mitigate the aforementioned limitations, i.e., lack of clear physical meaning and deterministic assignment, we aim to equip the disentangled representation with clear physical meaning by adding physical priors as inductive biases.Since disentangling the object pose is an attractive property in the imaging domain, we propose to explore the affine transform as an explicit inductive bias to guide the disentanglement process.We propose a network called EAD-GAN that imposes an affine regularizer in conjunction with InfoGAN.For example, a designer that requires an audience foreground could create one by generating many individuals translated and skewed with EAD-GAN.To derive the affine regularizer, we first introduce the matrix construction process in Section IV-A, where an affine matrix M is constructed from a latent vector.In Section IV-B, we describe how to decompose a known affine matrix M into two unknown affine matrices M r and M t .
Next, we estimate the matrices Mr and Mt with a neural network and further compute the matrix M. Thus, we can calculate the affine regularizer with M and M. Next, to align each affine transform to an individual latent dimension, we need to estimate each affine parameter from an affine matrix.As explained in Section IV-C, since the estimation process is nonlinear and overdetermined, we apply the LSE to approximate the optimized solution.Finally, we show the consolidated network structure, algorithm flow, and overall loss function in Section IV-D.

A. Affine Matrix Construction
To build a connection between the latent vector c I (semantic generative factor) and the affine transform, we propose to construct an affine transform matrix M by a given latent vector.Considering all possible combinations of affine transform (rotation, horizontal and vertical zooms, skews, and translations), there are many ways to construct the affine matrix from a semantic latent vector.As an illustration, here, we select rotation θ , horizontal and vertical zooms ( p, q), and translations (x, y) as the components of the affine matrix (see Appendix A in the Supplementary Material for a construction of the entire affine matrix).
From those parameters, the affine matrix M is constructed as in (4) ( A i j are the elements of an affine matrix M).For 2-D affine transfom, a 2 × 2 matrix controls the rotation, zooms, and skews of an image.A 2 × 3 matrix adds control over horizontal and vertical translations.We add [0, 0, 1] as the third row for the matrix for the convenience of inverse matrix calculation

B. Decomposition of Affine Transform
An affine transform links two images before and after the transform, but an encoder infers the affine transform parameter from a single input image.To let a network learn an affine transform encoder, we propose to decompose an affine transform M into two parts M r and M t .We represent the spatial coordinates of an image X by the variables (x, y) and define a column vector x = (x, y, 1) T .Then, an affine transform of an image X by a transform matrix M can be expressed by the matrix multiplication Mx.We express an image X r as the combination of an affine matrix M r that describes its pose and a canonical image base X b , x r = M r x b .If we purposely transform the image X r with a predefined affine matrix M, we obtain another transformed image X t from x t = Mx r .Both X r and X t can be expressed as different transformed versions of the same image X b , where Thus, from one image, we generate a pair of images X r and X t for training the transform encoder E (which is equivalent to the auxiliary network Q in InfoGAN).To map the transform from image space to latent space, we encode both X r and X t to latent vectors ĉr and ĉt using a learned encoder E. The estimated affine matrices Mr and Mt are then constructed from ĉr and ĉt .The estimated affine matrix M is eventually obtained by M = Mt M−1 r (see Fig. 4).The base image X b does not refer to any particular image, rather a canonical basis of the images from the training dataset (see Fig. 5).It could be the average manifold of all images within the same category.For instance, the digits "0," "1," . . ., "9" in MNIST are different categories.If there are n images of digit "1" with α i degrees of rotation in the dataset, X b could be an image of digit "1" with n i=1 (α i /n) degrees of rotation.

C. LSE of the Affine Parameter
Although we can minimize the difference between the ground-truth affine transform matrix M and its prediction M, during training, this does not promote one-to-one mapping between individual affine transform parameters and latent representations c.Thus, we further decompose the predicted affine matrix M into affine parameters θ, p, q, x, and ŷ.
Equation ( 4) leads to a simultaneous equation group that has six nonlinear equations with five unknowns.Hence, there is no closed-form solution since the equation is nonlinear and overdetermined.
To resolve this problem, we propose to infer the affine parameters from the affine matrix M by the least-squares estimation (LSE).To obtain estimations of the affine parameters, we minimize the sum The resulting LSEs are as follows (see more detail in Appendix B in the Supplementary Material): (6) To compare with the ground-truth latent vector c, the estimated affine parameters θ, p, q, x, and ŷ are converted to a latent vector ĉ

D. Framework of the Proposed EAD-GAN
The main framework of EAD-GAN is shown in Fig. 6, where the affine block is shown in Fig. 4. Algorithm 1 describes the procedures to compute the affine regularization loss L affine = min ||c − ĉ|| 2  2 , where c is the sampled latent vector and ĉ is the estimated latent vector.The loss function of the proposed EAD-GAN is Fig. 6.Main framework of EAD-GAN.G stands for generator, D stands for discriminator, and E stands for encoder.X f is the generated image, X r is the image sampled from the training dataset, and X t is the affine transformed image from X r .z is the latent noise sampled from the normal distribution.c I = (c, c ) is the sampled semantic latent vector.c is a subset of c I representing affine transform.ĉt and ĉr are the affine parameter predictions of X t and X r from the encoder.ĉ is the prediction of c from the network.ĉI f is the prediction of X f from the encoder.I(c I ; X f ) is the mutual information loss.L adv (D, G) is the GAN loss.L affine is the affine regularization loss.Fig. 4 shows more details about the affine block.S1 stands for Subblock 1. S2 stands for Subblock 2.
The loss function of EAD-GAN only adds one more loss term L affine to the loss of InfoGAN, which is easy to implement and computationally efficient.To compute the affine regularizer, the encoder of InfoGAN is reutilized.Hence, the EAD-GAN has the same trainable parameters as InfoGAN.Unlike EAD-GAN, InfoGAN-CR uses an additional encoder to compute the contrastive loss.IB-GAN uses an additional encoder to compute the mutual information upper bound.The proposed affine regularization achieves the following targets: 1) clear physical meaning is assigned to each component of the latent vector c and 2) each latent dimension in c is deterministically assigned by constructing the affine matrix from c and decomposing the affine matrix with the LSE to obtain the estimated ĉ. Thus, each latent dimension is motivated to be assigned to a specific affine transform.Besides, by constructing the affine matrix with different combinations of the latent vector, we can flexibly select the desired affine transform.For example, we can construct the horizontal and vertical zooms ( p, q) from a single c 1 for a more compact representation or construct p from c 1 and q from c 2 for a more expressive representation.
Compared to InfoGAN, three new components are integrated to the network.
1) The random semantic latent vector c of EAD-GAN is used to construct affine transform M. 2) Affine transform augmented image X t is introduced.
While in InfoGAN X r is the positive sample fed to the discriminator, in EAD-GAN, we use X t as the positive Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.sample fed to the discriminator, which guarantees that the affine transform is observed by the network.3) Affine regularization loss L affine is added by comparing the ground-truth latent vector c and its prediction ĉ from the network.L affine builds up the correspondence between the representation learned by InfoGAN and affine transform parameters.

V. NUMERICAL RESULTS
The goal of the experiments in this section is to investigate, both qualitatively and quantitatively, the disentangled representations obtained by EAD-GAN.The datasets evaluated in this section are MNIST, CelebA [35], dSprites [36], and colored dSprites [9].MNIST contains 60 000 training and 10 000 testing grayscale hand-written digits.CelebA is a more challenging dataset that involves 200 000 RGB celebrity images with large pose variations and background clutter.dSprites is a well-known dataset designed for evaluating the performance of disentangled representations, which contains 737 280 grayscale images with different shapes, scales, orientations, and positions.Colored dSprites adds random RGB color to the object in the dSprites images, where random scaling for each channel uniformly between 0.5 and 1 is multiplied to the object.A major difference between dSprites/colored dSprites and the other aforementioned datasets is that dSprites/colored dSprites contain the ground-truth value for all the variations, making it possible to calculate the disentanglement score.Some sample images generated by the proposed EAD-GAN trained on the CelebA [35], MNIST, and dSprites [36] datasets are shown in Figs.7-12, in Figs.13-20, and in Figs.21-22, respectively.For quantitative results, the disentanglement score for EAD-GAN is presented and compared to benchmarks on the dSprites and colored dSprites datasets (see Tables I  and II), while the disentanglement scores for MNIST and CelebA datasets are not presentable due to the lack of ground truth of transform in the dataset.As an alternative, we compare the correspondence between the predefined transform value and the latent vector value predicted by EAD-GAN in Appendix F in the Supplementary Material.We also provide manually transformed images as the ground truth to compare with the latent traversal results (see Figs.For all the experiments, we use the Adam optimizer [37] with the learning rate of 0.0002 for the discriminator and   0.0001 for the generator and encoder.The batch size is 128 for MNIST, 128 for dSprites, and 16 for CelebA.The regularization weights α and β in (8) are set to 1 by default.Our code is available at https://github.com/letao1991/EAD-GAN.

A. Qualitative Results
As mentioned before, deterministic assignment refers to the property that each attribute corresponds to a specific latent dimension.In the CelebA dataset, typical attributes are azimuth, sunglasses, emotion, and so on.Existing methods [8], [14], [15] have successfully disentangled those attributes (see Appendix G in the Supplementary Material).However, other attributes, such as the roll, width, and length of the face and relative position of the face in the frame, are rarely tackled.Due to the deterministic assignment property, EAD-GAN can explicitly learn those attributes (see Figs. [7][8][9][10][11][12]. We notice that there are some negligible differences between the ground-truth images and the images generated by the EAD-GAN in Figs. 9, 10, and 12.In Figs. 9 and 12, the ground-truth images have the artifacts due to the interpolation effect, while the images generated by EAD-GAN do not have such imperfections.In Fig. 10, for the images generated by the EAD-GAN, the human faces at the sides tend to gaze at the center of the image frame, while the ground-truth images always gaze at the front.This is because GAN tends to generate "realistic" images that are close to the training data distribution.For the human face dataset, most of the human faces at the sides in the training data gaze at the center of     the frame (this is also observed in StyleGAN [38]).Overall, the EAD-GAN generates more "natural" images compared to manually transformed images.
Figs. [13][14][15][16][17][18][19] show the disentangled representation generated by EAD-GAN with an entire affine transform, which includes rotation, horizontal and vertical zooms, skews, and translations.To the best of our knowledge, EAD-GAN is the first algorithm that can disentangle an entire affine transform in an unsupervised manner (see Appendix A in the Supplementary     Ground-truth (row 1) horizontal translation images and latent traversal (row 2) with latent vector c 6 on the MNIST dataset.Material for the construction of the entire affine matrix).In Figs. 14 and 16, we notice that the images generated by the EAD-GAN have larger transform compared to the groundtruth images, and this is because the transform range of the EAD-GAN is the sum of the predefined transform range and the variation of the data distribution.Since the horizontal zoom and skew are the dominant attributes in the MNIST dataset (also observed in InfoGAN), the overall transform range is larger than the predefined transform range.To disentangle object style on dSprites, we choose to model the latent space by a 4-D continuous latent vector sampled from uniform distribution [−1, 1]-c 1 : rotation, c 2 : zoom, and c 3 and c 4 : horizontal and vertical translations.We also use a 3-D categorical latent vector c cat (three classes) sampled from a uniform categorical distribution [8] to model the shape attribute.Since the rotation and zoomed-in view dSprites is object centered rather than image frame centered, where the objects are located at random positions, we break the training into two steps.We first train an EAD-GAN network that only learns horizontal and vertical translations and then train another EAD-GAN network that learns all the transforms.A detailed process is described in Appendix C in the Supplementary Material.To the best of our knowledge, EAD-GAN is the first algorithm that can disentangle the shape attribute by means of a categorical latent variable in the dSprites dataset (see Fig. 21), while existing methods [2], [8], [10], [14]- [18], [26]- [28]  Besides the affine transform, we show in Appendix E in the Supplementary Material that the RGB color transform can also be explicitly modeled with a similar methodology to our proposed one for the affine transform.To disentangle the object style on colored dSprites, we use a 3-D categorical latent vector c cat (three classes) and a 7-D categorical continuous latent vector: c 1 : rotation, c 2 : zoom, c 3 and c 4 : horizontal and vertical translations, and c 5 -c 7 : red, green, and blue color transforms.Similar to dSprites, we also break the training into two steps (see Appendix C in the Supplementary Material), where we first train an EAD-GAN network that only learns horizontal and vertical translation, and the RGB color transforms, and then train another EAD-GAN network that learns all the transforms.The disentanglement of color transform for colored dSprites dataset is shown in Fig. 22.

B. Quantitative Results
Tables I and II show that the proposed EAD-GAN outperforms the state-of-the-art methods for all disentanglement Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.metrics on the dSprites and colored dSprites datasets.Both BetaVAE [7] and FactorVAE [14] measure the correlation between the change of the ground-truth attribute and the change of the latent vector predicted from the encoder (we omit "predicted from the encoder" for brevity in the following description).DCI [11] measures the deviation between the latent vector and the ground-truth attribute.SAP [24] measures the average difference of the prediction error of the two most predictive latent dimensions for each attribute.Modularity [39] measures whether each latent dimension conveys information about at most one attribute.MIG [15] measures the mutual information between the latent vector and the ground-truth attribute.The clear physical meaning assigned to the latent vector links the learned representations and the ground-truth attributes.The one-to-one mapping between individual transform parameters and latent dimensions promotes the independence between each latent dimension and avoids the permutation between latent dimensions.The results in Table I suggest that the disentangled representations learned by the proposed EAD-GAN are better aligned with the definition of disentanglement on the dSprites dataset.Compared to InfoGAN-CR [16], which achieves the state-of-the-art disentanglement score, both EAD-GAN and InfoGAN-CR utilize the contrastive learning loss.However, InfoGAN-CR does not explicitly model the affine transforms as the data generative factors.

VI. DISCUSSION
In the literature, several methods have been proposed to learn semantic attributes from data.However, oftentimes, the learned representations do not have a clear physical meaning [2], [6]- [9], [14]- [17].Moreover, the learned representations of existing methods are sometimes not one-to-one mapped to the interpretable attributes, which makes the learned representations less explainable and inefficient for downstream Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
tasks.Besides, in existing methods, each latent dimension may not learn a fixed attribute in different trials or with different random seeds.To mitigate these problems, the proposed EAD-GAN introduces the affine transform to facilitate the training process, where the physical meaning of the affine transform is transferred and integrated into the network.The qualitative results in Figs.[7][8][9][10][11][12][13][14][15][16][17][18][19][20][21] show that the proposed EAD-GAN consistently learns the affine transform across different datasets.By comparing the qualitative results on the dSprites dataset between EAD-GAN in Fig. 21 and other methods in Appendix D in the Supplementary Material, we see that EAD-GAN achieves much better disentanglement for affine transform.EAD-GAN achieves the highest disentanglement scores for all the metrics compared to the benchmarks in Tables I and II, which suggest that EAD-GAN is better aligned with the definition of disentanglement.This is consistent with the purpose of modularity and compactness, where each attribute should be assigned to a unique component of the latent vector and vice versa.Several methods have been proposed to promote the independence between each latent component c i for better disentanglement [7], [14]- [16].From the perspective of independence, affine transform parameters are intrinsically independent of each other, which may also explain why EAD-GAN achieves the highest disentanglement scores.The resulting affine parameter estimates are accurate, showing that EAD-GAN does not simply memorize the affine transform, as it can extrapolate beyond the the parameter ranges explored during training.However, there are still limitations for EAD-GAN.The component c in c I = (c, c ), which is not covered by the affine transform, encoding other information, may lack physical meaning and may not be deterministically assigned.

VII. CONCLUSION
This article proposes an EAD-GAN that explicitly learns disentangled representations by incorporating an affine transform encoder in the generative model.The encoder learns to represent the affine transform of images by an unsupervised learning procedure.In contrast to the earlier approaches to disentanglement where inductive biases are not explicit, the disentangled representations obtained by EAD-GAN are explicit; as a result, they are deterministically assigned and have clear physical meaning.As the proposed affine regularizer is model-based, it can be extended to include other forms of expert knowledge as inductive bias.Besides affine transform, we show how to explicitly disentangle color transform on the colored dSprites dataset as an illustration.As a possible extension of the 2-D affine transform, the 3-D transform can be learned by constructing and decomposing the 3-D affine transform matrix.The proposed explicit regularizer provides a task-specific pathway to disentanglement compared to the existing general implicit regularizers.

Fig. 1 .
Fig. 1.Illustration of representations generated by InfoGAN [8] with unclear physical meaning.Given different values, −1, 0, and 1, of the latent vector c = (c 1 , c 2 , c 3 ), it is possible that the generated transforms are highly entangled, and thus, they have no clear physical meaning.For example, c 1 may represent both rotation and vertical zoom.
Given a latent vector c I = (c, c ), we separate c I into c and another latent vector c .The latent vector c I in InfoGAN encodes various attributes; the component c of c I encodes the affine transform.Given a latent vector c = (c 1 , c 2 , c 3 , c 4 , c 5 ) randomly sampled from the uniform distribution Unif[−1, 1], we first normalize it to the given range of affine parameters.As an illustration, we set the affine transform range as rotation θ ∈ [−ε θ , ε θ ], horizontal and vertical zooms p, q ∈ [1 − ε pq , 1 + ε pq ], and horizontal and vertical translations x, y ∈ [−ε xy , ε xy ].The parameter ε is the multiplier that adjusts the latent vector to a proper affine parameter range.For example, if we want the rotation range to be [−π/10, π/10], we should set ε θ = π/10.The affine parameters are computed from the latent vector c as follows:

Fig. 3 .
Fig. 3. Decomposition of the affine transform.The solid line refers to the affine transform from a real image X r to a transformed image X t .The dashed lines refer to the affine transform from the canonical base image X b to the real image X r and to the transformed image X t .
x r = M r x b and x t = M t x b = MM r x b (see Fig. 3).The purpose of introducing canonical image base X b is to construct the equation M t x b = MM r x b .Once the equation is established, X b can be removed from both sides of the equation, and we obtain the relative affine transform equation M t = MM r .The relative affine transform equation is further used to calculate the affine regularizer.

Fig. 4 .Fig. 5 .
Fig. 4. Pipeline of the affine block.Inputs: latent vector c randomly sampled from Unif(−1, 1) and image X r sampled from training data.Output: transformed image X t and predicted latent vector ĉ.The affine regularizer loss is: L affine = min ||c − ĉ|| 2 2 .E stands for encoder.LSE stands for least-squares estimation.Affine transform refers to the operation: x t = Mx r .

Fig. 7 .
Fig. 7. Ground-truth rotated images and latent traversal with latent vector c 1 .Row 1: ground-truth transformed images.The image in the middle is generated by given value 0 to c.The images on the sides are obtained by manually transforming (e.g., rotating) the middle image with the boundary value of the predefined affine transform range (e.g., [−π/9, π/9]).Row 2: latent traversal images: given different values, −1, 0, and 1, of a component c i (e.g., c 1 : rotation) while fixing all other values of the latent vector c = (c 1 , c 2 , c 3 , c 4 , c 5 ), different versions of images are explicitly generated by the proposed EAD-GAN trained on the CelebA dataset.

Fig. 21 .
Fig. 21.Given different values in [−1, 1] of a component c i while fixing all other values of the latent vector c = (c 1 , c 2 , c 3 , c 4 ), different versions of images are explicitly generated by the proposed EAD-GAN trained on the dSprites dataset.c 1 : rotation, c 2 : horizontal and vertical zoom, c 3 : horizontal translation, and c 4 : vertical translation.The variation by rows is the changing of shape controlled by giving different values, 0, 1, and 2, of the categorical latent vector c cat .Rows 1-3: ellipse, heart, and square, respectively.

Fig. 22 .
Fig. 22.Given different values in [−1, 1] of a component c i while fixing all other values of the latent vector c = (. . ., c 5 , c 6 , c 7 ), different versions of images are explicitly generated by the proposed EAD-GAN trained on the colored dSprites dataset.c 5 : red, c 6 : green, and c 7 : blue.The variation by rows is the changing of shape controlled by giving different values, 0, 1, and 2, of the categorical latent vector c cat .Rows 1-3: square, heart, and ellipse, respectively.

TABLE I DISENTANGLEMENT
SCORES ON THE DSPRITES DATASET.FOR VAE APPROACHES, THE REFERENCE VALUES FROM β-VAE TO ANNEALED-VAE APPROACHES ARE THE BEST SCORES OF THE VIOLIN PLOTS FROM [9, TABLE 13], THE REFERENCE VALUE FOR CONTROL-VAE IS FROM [26, TABLE 2], AND THE REFERENCE VALUE FOR GUIDED-VAE AND GUIDED-β-TCVAE ARE FROM [27, TABLE 2].FOR GAN APPROACHES, THE REFERENCE VALUES FOR GAN, INFOGAN, AND IB-GAN ARE FROM [17, TABLE 1], THE REFERENCE VALUE FOR GAN-VARIATION IS FROM [29, TABLE 2], THE REFERENCE VALUE FOR OOGAN IS FROM [28, TABLE 1], AND THE REFERENCE VALUES FOR INFOGAN-CR ARE FROM [16, TABLE 1].A PERFECT DISENTANGLEMENT CORRESPONDS TO A SCORE OF 1.0.THE PROPOSED EAD-GAN OUTPERFORMS STATE-OF-THE-ART METHODS FOR ALL DISENTANGLEMENT METRICS ON THE DSPRITES DATASET.THE RESULTS OF THE PROPOSED EAD-GAN ARE THE AVERAGE OF TEN RUNS WITH RANDOM INITIALIZATION TABLE II DISENTANGLEMENT SCORES ON THE COLORED DSPRITES DATASET.FOR VAE APPROACHES, THE REFERENCE VALUES ARE THE BEST SCORES OF THE VIOLIN PLOTS FROM [9, TABLE 13].FOR GAN APPROACHES, THE REFERENCE VALUES FOR GAN, INFOGAN, AND IB-GAN ARE FROM [17, TABLE 1].A PERFECT DISENTANGLEMENT CORRESPONDS TO A SCORE OF 1.0.THE PROPOSED EAD-GAN OUTPERFORMS STATE-OF-THE-ART METHODS FOR ALL DISENTANGLEMENT METRICS ON THE COLORED DSPRITES DATASET.THE RESULTS OF THE PROPOSED EAD-GAN ARE THE AVERAGE OF TEN RUNS WITH RANDOM INITIALIZATION