Dual Encoder-Decoder based Generative Adversarial Networks for Disentangled Facial Representation Learning

To learn disentangled representations of facial images, we present a Dual Encoder-Decoder based Generative Adversarial Network (DED-GAN). In the proposed method, both the generator and discriminator are designed with deep encoder-decoder architectures as their backbones. To be more specific, the encoder-decoder structured generator is used to learn a pose disentangled face representation, and the encoder-decoder structured discriminator is tasked to perform real/fake classification, face reconstruction, determining identity and estimating face pose. We further improve the proposed network architecture by minimising the additional pixel-wise loss defined by the Wasserstein distance at the output of the discriminator so that the adversarial framework can be better trained. Additionally, we consider face pose variation to be continuous, rather than discrete in existing literature, to inject richer pose information into our model. The pose estimation task is formulated as a regression problem, which helps to disentangle identity information from pose variations. The proposed network is evaluated on the tasks of pose-invariant face recognition (PIFR) and face synthesis across poses. An extensive quantitative and qualitative evaluation carried out on several controlled and in-the-wild benchmarking datasets demonstrates the superiority of the proposed DED-GAN method over the state-of-the-art approaches.


I. INTRODUCTION
Benefiting from the rapid development of deep learning and the easy access to a large number of annotated face images, face recognition [1]- [4] has advanced significantly in recent years. Although impressive performance has been achieved on several benchmarking databases, pose variation is still one of the crucial bottlenecks for many practical applications [5], [6]. Facial appearance variations caused by poses are even larger than those caused by different identities [7]. To mitigate this difficulty, many approaches have The associate editor coordinating the review of this manuscript and approving it for publication was Peter Peer . been proposed for pose-invariant face recognition (PIFR). Existing PIFR methods can be divided into three categories. One approach is to remap non-frontal faces to frontal ones, and then extract facial features from frontalised faces for better face representation [8]- [12]. The second one is to learn pose-invariant representations directly from non-frontal faces [13]- [16]. The last category aims to learn disentangled facial representations so that identity-preserving features can be disentangled from pose variation [17], [18]. Our proposed method belongs to the last category.
The consensus regarding desirable properties of good representations of data has recently been established in [19]- [22]. Disentanglement, one of the properties of good VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ representation, is a kind of distributed feature representation in which disjoint dimensions of a latent code reflect different high-level generative factors of data. The disentanglement is also often described as statistical independence; each independent factor is expected to be semantically well aligned with the human intuition regarding the data generative factors. Specifically, the disentangled representation can separate explanatory factors that interact non-linearly in real-world data, such as object shapes, material properties, light sources and so on. A representation distilling each important factor of data into a single independent direction is hard to learn, but it is highly valuable for many other downstream tasks like PIFR and face synthesis across views [23]- [26]. Deep generative models facilitate learning disentangled representations. It is a methodology that enables learning of the probability distribution of data and generating new samples according to control codes in a latent space. By learning the appropriate parameters, deep generative models can generate new data mimicking the distribution of the target data. Once a disentangled representation is learned, the disjoint dimensions of the hidden code model the data generative factors separately. These underlying factors have the potential to explain the major variations in the data. When only one factor varies but all others are fixed, the generated sequence of samples can show an interpretable change to human beings. For example, when we generate a hand-written digit, a component of the code may be associated with the stroke width. When its value is changed, only the stroke width of the generated digit becomes smaller, while other factors on the images (e.g. class, shape, color) stay the same. In recent years, Variational Auto-Encoder (VAE) [27] and Generative Adversarial Networks (GAN) [28] based methods as two notable branches of deep generative model have successfully been used in the disentangled representation learning. For instance, β-VAE [29] learns disentangled latent codes by encouraging the latent distribution to be close to the standard normal distribution, in which each random variable is independent. DC-IGN [30] is another VAE-based generative model for disentangled representation learning. However, DC-IGN may not apply to unstructured in-the-wild images, since it achieves disentanglement by providing batch training samples with one attribute being fixed. InfoGAN [31] also uses statistical independence, which is motivated by the principle of maximization of mutual information. The Disentangled Representation learning GAN (DR-GAN) [18] learns generative and discriminative facial representations, which disentangle the face identity from pose so that it can better handle cross-pose recognition. DR-GAN is also similar to the prior work [10] in which joint representation learning and face rotation are explored with a multi-task CNN. In summary, most of the existing works disentangle the factors by using statistical independence of a prior distribution.
Although DR-GAN has achieved impressive performance in face synthesis across poses and PIFR, it has some problems: 1) The process of training of DR-GAN is not stable. In a few stable cases, a mode collapse often occurs, producing degenerate images; 2) The pose variations are categorized into several distinct classes by a one-hot vector. Consequently, although it is a strong prior, the pose information is insufficient for disentangled facial representation learning. To improve the training stability of GAN, the encoder-decoder structured discriminator has been successfully used in EBGAN [32] and BEGAN [33], which is also used as a backbone network in our method. To achieve stable model training, an equilibrium enforcing method was proposed in BEGAN, in which a hyper-parameter is introduced to balance the generator and discriminator during the model training. Different from the classical GANs, BEGAN aims to match the auto-encoder loss distributions, not between sample distributions. We also introduce an equilibrium enforcing strategy in our method. However, in contrast to BEGAN, our method not only matches the distributions between samples like in typical GANs, but also the distributions of the reconstruction losses of samples, which is conducive to better representation learning. Accordingly, pixel-wise reconstruction error is used as another loss function, aside identity loss and pose estimation in our GAN model. DR-GAN codes the pose into several classes with a one-hot vector, incurring information loss in the process. Pose changes continuously, non-linearly but smoothly. For this reason, we represent pose code by a continuous variable rather than in a discrete form. This also allows estimating the pose by regression rather than classification.
This paper addresses the problem of learning a generative model for disentangled facial representation extraction. By combining the advanced techniques of GAN-based representation learning methods, we propose to learn disentangled pose-robust features by modeling the complex non-linear transform between face images with different poses through a dual encoder-decoder structured deep neural network in an adversarial way, namely Dual Encoder-Decoder based Generative Adversarial Networks (DED-GAN). The proposed network is evaluated in terms of the quality of face synthesis of different views on the one hand and pose-invariant face recognition (PIFR) on the other hand. Our contributions are summarised as follows: • A new GAN architecture with fast and stable convergence is proposed for disentangled facial representation learning.
• Our proposed method can generate a face with arbitrary pose variations.
• The proposed method learns identity-preserving features simultaneously.
• To the best of our knowledge, this is the first attempt to use pose regression for disentangled face representation. The proposed continuous pose variation model provides more detailed information about the pose. It is used explicitly to control the manifold of identity-preserving face synthesis.
• Experiments in PIFR and face synthesis across poses demonstrate the advantage of our method on multiple benchmarking databases. The rest of the paper is organised as follows: We first overview the existing literature related to the proposed method in Section II. Then we present the proposed DED-GAN in Section III and introduce the implementation details in Section IV. An ablation study and experimental results are reported in Section V. Last, the conclusion is drawn in Section VI.

II. RELATED WORK A. GENERATIVE ADVERSARIAL NETWORK
Recently, the state-of-the-art in deep generative models, especially in VAE [27] and GAN [28], have advanced significantly. As one of the most promising deep neural networks, GAN has attracted widespread attention from the computer vision and machine learning communities. It provides a simple, yet powerful way to estimate data distribution and generate realistic samples by the zero-sum two-player game [34]. Through modeling a real sample distribution, a GAN can encourage the generated samples to move towards the true image manifold, and thus generate photo-realistic images with plausible high-frequency details. However, the classical GAN suffers from computational problems, e.g. the inferior performance caused by unbalanced training of the generator without comparable attention given to updating the discriminator. A collapsed generator will lose the capacity to fit the target data distribution. To address the aforementioned model collapse issue, some improved GAN architectures have been proposed. For example, Zhao et al. [32] proposed energy-based GAN (EBGAN) that considers the generator and discriminator as energy functions. Salimans et al. [35] introduced a bag of tricks to address GAN training strategies and achieved great performance on semi-supervised learning. Karras et al. [36] used a strategy of progressively growing the generator and discriminator of a GAN for improved image generation quality, stability and variation. Further, Arjovsky et al. [37] presented Wasserstein GAN (WGAN) using the earth mover's distance. They proved that WGAN is able to avoid the mode collapse problem to a certain extent.
Existing GAN models can handle most of the challenging cases, in which the pose, illumination and expression of faces are unconstrained. For example, Radford et al. [38] designed DC-GAN that evaluates a set of constraints on the architectural topology of convolutional GANs, which make the model stable to train. Huang et al. [39] focused on the local patches that have some semantic meaning and proposed TP-GAN. Li et al. [40] focused on the missing parts of the face and came up with a novel two adversarial losses as well as a semantic parsing loss to complete the faces. He et al. [41] edited the face images with desired attributes while preserving other details by encoder-decoder structured GAN. Both [42] and [43] applied an extension of GAN to a conditional setting and showed their utility in many tasks, including image in-painting [44], super-resolution [45], style transfer [46], face attribute manipulation [47] and even data augmentation for classification models [48], [49]. The VariGAN model was proposed by Zhao et al. [50] to solve the problem of generating multi-view images from a single viewpoint. Tran et al. [51] put forward DR-GAN, which fuses the pose information and can take one or multiple face images with yaw angles as input to achieve pose invariant facial representation learning. Similarly, Antipov et al. [52] concentrated on improving face synthesis in cross-age scenarios. Considering scene structure and context, Yang et al. [53] presented LR-GAN that learns generated image background and foreground separately and recursively to produce a completely natural or face image.
These successful GANs provide a strong motivation to learn disentangled facial representation and to develop a method for different view synthesis. However, there are several crucial issues with GANs such as training being unstable and a quantitative evaluation proving difficult. The previous work either focuses on the stability of training, the task of synthesising images, or using the features in the discriminator for image recognition. In contrast, we propose an innovative method for constructing the generator for disentangled representation learning, which is stable. The proposed DED-GAN method is also quantitatively evaluated for pose invariant face recognition.

B. POSE INVARIANT REPRESENTATION LEARNING
In conventional face recognition methods, local descriptors [54]- [57] and metric learning [58], [59] are often used to tackle the effect of pose variation. In contrast, deep learning methods handle pose variation through building pose-specific or pose-agnostic models with specific loss functions [60], [61]. For instance, the DeepFace [62] model uses a deep CNN coupled with 3D face alignment. The inception architecture, utilised in FaceNet [15], is used in DeepID2+ [63] and DeepID3 [64] where multi-task learning and metric learning are performed simultaneously. However, such data-driven methods heavily rely on well-annotated data. Collecting labeled data covering all variations is time-consuming and labor-intensive. Our proposed Dual Encoder-Decoder based GAN (DED-GAN) presents an idea similar to Disentangled Representation learning GAN (DR-GAN) [18], which considers both face rotation and representation learning in a unified network. However, our proposed model differs from DR-GAN in the following aspects: 1) we use a continuous pose code for disentangling face representation in DED-GAN, as it provides more detailed information about the pose as a strong prior for training, and 2) DR-GAN suffers from poor generalisation and from optimisation difficulties, which limit its effectiveness in face synthesis and face recognition. In contrast, our DED-GAN overcomes these issues by disentangling the pose utilizing pose regression and adding face reconstruction as a side task.

III. THE PROPOSED APPROACH
Our Dual Encoder-Decoder based GAN (DED-GAN) model learns two tasks simultaneously: synthesis of VOLUME 8, 2020 The encoder-decoder structured generator is used for face rotation and untangling the identity from pose variation. The encoder-decoder structured discriminator is used for facial reconstruction, pose estimation, identity classification and real/fake adversarial learning. The architecture of our DED-GAN is shown in Fig. 1d. We also show different architectures of earlier GANs such as Vanilla GAN, Auxiliary Classifier GAN and DR-GAN for comparison in Fig. 1a, Fig. 1b and Fig. 1c. In contrast to DR-GAN, we add a decoder to the discriminator, which is optimised for pixel-wise loss defined in terms of the Wasserstein distance, to balance the generator and discriminator. We also code the pose using a continuous variate instead of the discrete variate commonly specified by a one-hot vector. As a result, the task of pose disentanglement in the discriminator can be formulated as one of pose regression instead of classification, which further benefits the learning process.
It should be noted that the Encoder-Decoder structured discriminator has also been successfully used in BEGAN [33], to match the pixel-wise loss distributions of reconstructed real and synthesised samples. Our method also incorporates an Encoder-Decoder as the backbone of the discriminator to achieve a balanced learning behavior as part of weakly adversarial learning. Different from previous GANs, including BEGAN, our method not only sets out to match data distributions but also attempts to match image reconstruction loss distributions. This is achieved by using a typical GAN objective combined with an additional equilibrium term. To provide a detailed description of our approach, we start by introducing the original GAN, followed by our proposed DED-GAN method.

A. GENERATIVE ADVERSARIAL NETWORK
A typical GAN model consists of two networks pitted against one another in a two-player game: a generative model, G, is trained to synthesise images resembling the real data distribution and a discriminative model, D, is trained to distinguish the samples synthesised by G and real ones from the training data. The generator generates unlabelled realistic samples from the latent variable model to improve the discriminative ability of the discriminator. To learn the generator's distribution p g over data x, we define a prior on input noise variables p z (z). The mapping G(z; θ g ) of z into the data space is achieved by a neural network with parameters θ g , where G is a differentiable function. A second neural network with parameters θ d is defined by D(x; θ d ) that outputs a single scalar. D(x) represents the probability that x comes from the real data, p d , rather than p g . We train D to maximize the probability of assigning the correct label to both training examples and samples from G. We simultaneously train G to minimise log(1 − D(G(z))). In other words, the generator and discriminator are fighting against each other, which can be formulated as: min (1) where z denotes a random noise, typically sampling from a Gaussian normal distribution, p z . G(z) denotes a sample synthesised by the generator and p d denotes the distribution of real data. It is proved in the original GAN [28] that this minimax game has a global optimum when the distribution p g of the synthetic samples converges to the distribution p d of the real samples. At the beginning of training, the samples generated by G are extremely poor and thus they are rejected by D with high confidence. This minimax game theoretically has a global optimum for p g = p d . G and D are trained to alternatively optimise the following objectives: After several steps of the optimisation process, the generator and discriminator will reach the point at which neither can improve because p g = p d . The discriminator is unable to differentiate between the two distributions, i.e. D(x) = 1/2.

B. DUAL ENCODER-DECODER BASED GAN
Our DED-GAN explicitly disentangles face imaging factors to obtain an interpretable face representation for PIFR and face synthesis across poses. The backbone of DED-GAN consists of an encoder-decoder based generator and encoder-decoder based discriminator, as depicted in Fig. 1d. It learns the representation of a face by using the generator, where the encoded output of the generator is the identity-preserving representation. The representation is one part of the input to the decoder to synthesise various faces of the same subject with different attributes, i.e., by virtually rotating the facial pose code. We not only match the distribution of face images by using classical real/fake adversarial learning, but also the distributions of the reconstruction error of samples reconstructed from the representation by using pixel-wise adversarial learning. As numerous variations manifest in face images such as pose, illumination and expression influence face recognition even more than changes in identity, it is desirable to prevent the generator from generating different facial representations for the same person with different face poses. In this work, we focus on pose variations and disentangle the pose information as an explicit variation. This facilitates learning a truly discriminative face representation.

1) PROBLEM FORMULATION
Our method aims to train a generative adversarial model conditioned on the real face image x and specified pose code c. Given a face image x with label y = {y a , y d , y c }, where y a , y d and y c represent the labels for real/fake, identity and pose. There are two tasks in our learning method: to learn a disentangled identity representation for PIFR and to synthesise faces across poses with different pose code c.
Different from the discriminator in the original GAN, our discriminator could be seen as a multi-task CNN consisting of four components: D = [D a , D d , D c , D r ], where D a ∈ R 1 is for classical real/fake adversarial learning, D d ∈ R N d is for identity classification with N d as the total number of subjects in the training set, D r ∈ R N c * N w * N h is for face reconstruction and D c ∈ R N 1 is for pose regression. For the pose regression task, we first obtain the pose coefficients of all the training images. To obtain the pose of an image, we use the MTCNN method to extract 5 facial landmarks for each face image [65]. Then we transform face land-marks to the pose code using a statistical shape model [66]. Mathematically, we can express the face shape with a base shape s 0 plus a linear combination of n shape eigenvectors s i as: where s 0 is the mean shape, s i is the ith shape eigenvector by applying principal component analysis to all the training shapes and c i is the corresponding coefficient. In general, the first shape eigenvector controls pose variations of the model thus we use c 1 as the pose code c.
The discriminator aims to classify the face image x as real or fake, to maximize the gap between the reconstruction error of real image and that of the synthetic image, and to estimate its identity and pose. Given an input image x, a random pose code c and a random noise z, the generator G generates a synthesised face image G(x, c, z). The discriminator D attempts to classify the image using the following objectives: where k is a trade-off parameter to balance the distribution of reconstruction error of real faces and that of synthetic faces. For clarity, we eliminate all subscripts for expected value notation, as all random variables are sampled from their respected distributions (x, y) ∼ p d (x, y), z ∼ p z (z), c ∼ p c (c). D d is used for identity classification. It should be noted that pose regression D c is used here rather than pose classification. The final objective for training D is the weighted average of all objectives: where λ a , λ d , λ c and λ r denote the weights of the four losses. The generator G consists of an encoder G enc and a decoder G dec , where G enc aims to learn an identity-preserving representation f (x) = G enc (x) from a face image x, G dec is tasked to synthesise a face image G dec (f (x), c, z) with identity y d and a target pose specified by c, and z ∈ R N z is a noise variable, modelling other variations besides identity or pose. The pose code c ∈ R 1 is of continuous value. The goal of G is to fool D to classify G(x, c, z) to the identity of input x and estimate the target pose with the following objectives: (G(x, c, z))], (10)

VOLUME 8, 2020
Similarly, the final objective for training the generator G is the weighted average of each objective: where µ a , µ d , µ c and µ r denote the weights of the four losses.

2) PIXEL-WISE LOSS
While classical GANs try to match data distributions directly with L adv , our method additionally aims to match auto-encoder loss distributions using a pixel-wise loss L pixel based on Wasserstein distance. Firstly, we introduce the auto-encoder loss, and then we compute a lower bound to the Wasserstein distance between the auto-encoder loss distributions of real and generated samples.
Let L : R N x → R + , denote the loss for training a pixel-wise auto-encoder defined as: where D : R N x → R N x is the auto-encoder, η ∈ {1, 2} is the target norm, and x ∈ R N x is a sample of dimension N x . Furthermore, let µ 1,2 be two distributions of auto-encoder losses, and (µ 1 , µ 2 ) be the set all of couplings of µ 1 and µ 2 , whose respective means are m 1,2 ∈ R. The Wasserstein distance can be expressed as: Using Jensen's inequality, we can derive a lower bound to W 1 (µ 1 , µ 2 ): We design the discriminator to maximise |m 1 −m 2 | by forcing m 1 → 0, m 2 → ∞. Given the discriminator and generator parameters θ D and θ G , each to be updated by minimising the losses L D pixel and L G pixel , we express the optimisation problem in terms of a pixel-wise loss function: where k t controls how much emphasis is put on L(G(x)) during gradient descent, λ k is the learning rate for k. β is diversity ratio as a hyper-parameter to balance L(x) and L(G(x)).

IV. IMPLEMENTATION DETAILS
The proposed Dual Encoder-Decoder based GAN (DED-GAN) is composed of a generator G and a discriminator D. Both are based on deep encoder-decoder networks. We follow the design for making G in the DR-GAN. The modified CASIA Net [67] is used as the backbone network. It consists of five convolution blocks, including one double-convolution block and four triple-convolution blocks, followed by an average pooling (AvePool) layer for feature extraction. The generator G is composed of an encoder G enc and a decoder G dec , i.e., G = [G enc , G dec ]. Given a face id (X )+µ c L G t pos (X )+µ r L G t pixel (X , Z , C) using equations (10)- (14). 7: Fix the discriminator parameter t d and compute the back propagation error to optimise generator t g ← Adam(∇ θ t g L t (G), α). 8: end while image x, the encoder's output code e = G enc (x) ∈ R N e from the AvePool layer is concatenated with a pose code c ∈ R N c and a noise z ∈ R N z to form [e, c, z], which is used as the input of G dec . G dec is a de-convolution neural network that transforms [e, c, z] to a decoded face image, i.e.,x = G dec ([e, c, z]). D a and D r are used to force the distributions of both synthesised samples and their auto-encoder losses to match those of real samples. The discriminator D is composed of an encoder D enc and a decoder D dec , i.e., D = [D enc , D dec ]. Same as the generator, the backbone of the discriminator is also an encoder-decoder network where face reconstruction is D r , aiming to increase the divergence of the auto-encoder loss distributions between real and synthesised samples. The code layer of the auto-encoder is followed by D a , D c and D d where D a (x) is for real-fake classification, D c (x) is for pose regression and D d is for identity prediction. In Algorithm 1, we summarise the learning procedure of the proposed DED-GAN model. We use the Adam optimiser [68] for network training.
All the experiments were performed with the following settings. All face images were aligned to a canonical view of 100 × 100 in size. Randomly sampled regions of size 96 × 96 pixels selected from 96 × 96 each aligned face were cropped for data augmentation. The image intensity was linearly scaled to the range of [-1,1]. All weights in the networks were initialized by a normal distribution with 0 mean and standard deviation of 0.02. We set the diversity ratio, β, to 0.9. k t ∈ [0, 1] controls how much emphasis is put on L(G(x)) during the network optimisation. We initialise k 0 = 0 and update k in each training step. λ k is the learning rate for k. We set λ k to 0.001 in our experiments. We define the trade-off between the respective components of the loss function by setting λ a = 1, λ d = 1, λ c = 0.1, λ r = 10, µ a = 1, µ d = 1, µ c = 0.1 and µ r = 10 through numerous experiments. All experiments were run on a NVIDIA GeForce GTX Titan Xp card with CUDA 8.0 and cuDNN 6.0, implemented in Pytorch.

A. EXPERIMENTAL SETTINGS AND DATASETS
We evaluate DED-GAN qualitatively and quantitatively under both constrained and unconstrained scenarios for face synthesis across poses and PIFR. Our models were trained separately on the Multi-PIE [69] and CASIA [67] datasets. For the qualitative evaluation, we show visualised results of face synthesis on Multi-PIE, CASIA and CFP [70]. For the quantitative evaluation, we measure face recognition performance using the learned facial representations with a cosine distance metric on the Multi-PIE, CFP and LFW [71] datasets.
The Multi-PIE database is the largest multi-view face recognition benchmark in the constrained scenario. It contains more than 750,000 images of 337 identities recorded in five months. Each identity has images captured under 15 poses and 20 illuminations. These images were captured in four sessions during different periods. Like the previous methods, we evaluate our algorithm on a subset of the Multi-PIE database, where each identity has images from all the four sessions under nine poses from yaw angles −60 • to +60 • . For a fair comparison, we follow the setting used in DR-GAN [18]. We evaluate our method on the Multi-PIE dataset setting 2. The first 200 subjects are used for training and the remaining 137 subjects are used for testing. Different from DR-GAN in which the supervised pose information is used, we use MTCNN to extract five landmarks and then transform the landmarks to a pose label. In testing, one frontal view with neural illumination is used as the gallery image and other images are used as probes. Therefore, we have N d = 200 for identity classification, N p = 1 for pose regression, N a = 1 for real/fake classification and N r = 3×96×96 for colour image reconstruction. We set the dimension of the embedding feature and uncompressed noise to N f = 320 and N z = 50 respectively. The CASIA database offers 494,414 in-the-wild face images of 10,575 subjects. It is a widely used large-scale database for face recognition. We train our model on this dataset to evaluate the performance of our model on a realistic dataset. We have N d = 10, 575, N p = 1. N f and N z are set as for Multi-PIE. We also evaluate the performance of our model in terms of the quality of synthesised face poses.
The CFP database contains 7,000 images of 500 subjects, where each subject has 10 frontal and 4 profile face images. The data are randomly organized into 10 splits, each containing an equal number of frontal to frontal and frontal to profile pairs, with 350 intra pairs and 350 non-matching pairs, respectively. We evaluate the face verification performance in terms of front-to-front and profileto-front matching. We also evaluate the performance of our model on its ability to synthesise faces across pose variations.
The LFW database contains 13,233 face images of 5,749 identities. The images were obtained by trawling the internet followed by face centering, scaling, and cropping based on the bounding boxes provided by an automatic face detector. The LFW data have large in-the-wild variability, e.g., in-plane rotations, non-frontal poses, non-frontal illumination, varying expressions and so on. The verification set consists of 10 folders, each with 300 matching pairs and 300 non-matching pairs. We measure the face verification performance and compare it with existing methods.

B. ABLATION STUDY
Our discriminator is designed as a multi-task CNN with four components, namely D a , D c , D d and D r for real/fake classification, pose regression, identification and face reconstruction respectively. While D d surely plays a significant role in assisting the model to preserve the face identity, it is instructive to understand the role of the remaining components. In this subsection, the effect of the four loss functions on the recognition performance is investigated. The results are presented in Tab. 1 which reports the recognition performance of DED-GAN partial variants with each of D components removed. While the variant without adversarial loss D a exhibits a slight performance drop, the models without face reconstruction D r and pose regression D c losses are degraded more severely. When removing D c , there is no pose label to supervise the face discrimination, especially for the profile faces. The average accuracy of DED-GAN partial variants without pose estimation reduces from 95.75% to 93.47%. This can be attributed to the pose information being entangled with identity in the feature representation.
Tab. 1 also presents the performance of our model without face reconstruction D r . The average accuracy drops from 95.75% to 93.92%. This shows that facial reconstruction is almost equally important to pose estimation. This suggests that the encoder-decoder structured discriminator successfully balances the training of the two players in GAN.
To gauge the impact of using pose regression, rather than pose classification, we train separate DED-GAN models using the respective formulations. The results show that the performance of the model based on pose classification is lower by about 1%. Thus continuous pose variation used for regression benefits for preserving more information about the pose.
The pixel-wise loss could effectively balance the generator and discriminator and get a fast convergence of training. To evaluate whether the pixel-wise loss could boost the convergence performance of DED-GAN, we compare the GAN loss with and without reconstruction task. Fig. 2 shows that DED-GAN without pixel-wise loss almost achieves convergence after 30 epochs. However, DED-GAN with pixel-wise loss gets a balance between generator and discriminator after about 20 epochs. The additional reconstruction task with pixel-wise loss suggests a fast and stable training manner between the generator and the discriminator of GAN. We also compare the performance of DED-GAN with and without pixel-wise loss on the test accuracy and synthesised faces. As shown in Fig. 3, the DED-GAN with pixel-wise loss almost gets a stable test accuracy after 20 epochs training, while the DED-GAN without pixel-wise loss gets a stable accuracy at about 30 epochs. Fig. 4 shows the synthesised faces of DED-GAN with and without pixel-wise loss every five epochs during training. The result also shows that

C. FACE SYNTHESIS
To verify the performance of our method in terms of the quality of face synthesis across poses, several experiments are conducted on Multi-PIE, CASIA and CFP datasets. In the first experiment, we compared the synthesised faces with different poses between DR-GAN and our method on Multi-PIE. The synthesised faces are verified on the test set of the Setting 2. Hence, there is no overlap of subjects between the training and test datasets. Given a random input face, we generate synthesised faces within a pose range of ±60 • . The experimental results are shown in Fig. 5. We can see that the pose estimation capability helps to generate faces across poses and successfully disentangle pose variation from the feature vector in both methods. However, the quality of the faces synthesised by our method appears to be better than that of those output by the DR-GAN in texture, shape, as well as identity preserving characteristics.
For an objective evaluation of the relative quality of faces generated by the two types of GANs, we use the Fréchet Inception Distance (FID) [72]. For a feature function φ (by default, the Inception network's convolutional feature), FID models φ(p d ) and φ(p g ) as Gaussian random variables with empirical means µ d , µ g and empirical covariance d , g . FID is expressed as FID(p d , p g ) = ||µ d − µ g || + Tr ( d + g − 2( d g ) 1/2 ), which is the Fréchet distance between the two Gaussian distributions. Tab. 2 compares the FID scores between DR-GAN and DED-GAN. DED-GAN achieves a lower FID score than DR-GAN, which means that the faces synthesised by DED-GAN are more similar to real ones than those produced by DR-GAN.
To further demonstrate the ability to disentangle the pose generative factor from other face attributes, we also evaluate the performance of our model on face synthesis across poses on another two uncontrolled datasets CASIA and CFP. We use MTCNN to extract five facial landmarks for each face and then transform the landmarks to pose label by a statistical shape model. The CASIA facial distribution across poses is illustrated in Fig. 7 where the value zero denotes the frontal face. Note that, different from the previous methods, DED-GAN can rotate an input face to any pose controlled explicitly by the pose code. Hence, DED-GAN can synthesise both frontal and profile faces. Fig. 6 shows the pose manifold of generated faces by changing the value of the pose code. Every row denotes the faces with the same identity. The first column is the input face and the other columns show the manifold of the synthesised faces with a smoothly changing value of the pose code from -17 to 17. We can see that our model preserves the identity well as we change the pose code. It also shows that the pose variation is explicitly untangled from the other face attributes including identity.
We also test the face frontalisation performance for unseen faces on the CFP dataset as shown in Fig. 8. Every column shows the faces of the same identity. Given an input profile face, we separately generate the frontal faces by DR-GAN and our method. The up and down rows show the input profile faces and paired real frontal faces separately. The second and third rows show the synthesised frontal faces by setting the pose code to zero. We can see that both methods can untangle the face representation from pose variation and generate frontal faces. However, the faces synthesised by our method appear better in terms of texture detail and in preserving the face identity.

D. FACE RECOGNITION
One motivation for disentangled face representation learning is to see, whether the untangled representation helps to preserve the identity information, and thus boost the performance in face recognition. To verify this, we also show quantitative results obtained in PIFR experiments. We evaluate our method on Multi-PIE, CFP and LFW for identification and verification tasks. The features are extracted from G enc in all the experiments. The cosine distance between two representations is used for face recognition in the test step.

1) FACE IDENTIFICATION ON THE MULTI-PIE DATABASE
In the first experiment in PIFR, we evaluate the performance of DED-GAN on the Multi-PIE dataset. We compare our method with other state-of-the-art face recognition methods. Our model achieves the best accuracy in different pose cate- gories, with the most significant improvement noted for the profile faces as shown in Tab. 3. It shows that our method can remove the effects of the pose and retain the intrinsic face shape and structure information of identity.

2) FACE VERIFICATION ON THE CFP DATABASE
To further demonstrate the advantages of our method in PIFR, we evaluate it on an uncontrolled dataset. For the in-the-wild setting, we train our model on CASIA and test it on the CFP database. The experiments performed on the CFP dataset aim to compare the capacity of the face verification approaches across diverse poses. More specifically, the matching is performed between the frontal view (yaw angle < 10 • ) and profile view (yaw angle > 60 • ). The evaluation reports the mean and standard deviation of accuracy, over 10 splits, for both frontal to frontal and frontal to profile face verification settings. The verification results are shown in Tab. 4. Our method again yields better verification performance on both frontal-frontal and frontal-profile matching sub-tasks. Thanks to the more stable training structure and more detailed pose information injected into our method, DED-GAN achieves about a one percent performance improvement over DR-GAN.

3) FACE VERIFICATION ON THE LFW DATABASE
To evaluate the performance on the in-the-wild dataset further, we test the models described in the previous subsection on the LFW database. Tab. 5 shows the accuracy achieved by different methods. As expected, our method DED-GAN delivers the best accuracy, namely 97.52%, which is comparable with other state-of-the-art methods. Although DED-GAN is not trained on the LFW dataset, the untangled discriminative representation generalises to other datasets, including in-the-wild datasets.

VI. CONCLUSION
We propose a new GAN-based model (DED-GAN) for disentangled representation learning to address the challenging problem of pose-invariant face recognition and photo-realistic face synthesis across poses. To the best of our knowledge, this is the first time that a dual encoder-decoder structured GAN has been used to learn disentangled face representation. The encoder-decoder structured generator is used for face rotation and learning disentangled face representation. The encoder-decoder structured discriminator is used for facial reconstruction and for predicting identity, as well as for estimating the pose. The Encoder-decoder structured discriminator with the additional pixel-wise loss improves the training efficiency and stability of our GAN. A continuous pose encoding provides more detail pose information and benefits the discriminative representation by untangling the identity and pose. Extensive quantitative and qualitative experimental results show that our method is competitive compared to state-of-the-art approaches to PIFR and to face synthesis across poses. In the future, we plan to incorporate more discriminative information into the design of DED-GAN by extending the network to deal explicitly with other image generative factors, including illumination, expression, age and occlusion. XIAOJUN WU (Member, IEEE) received the B.Sc. degree in mathematics from Nanjing Normal University, Nanjing, China, in 1991, and the M.S. and Ph.D. degrees in pattern recognition and intelligent systems from the Nanjing University of Science and Technology, Nanjing, in 1996 and 2002, respectively. He is currently a Professor of artificial intelligent and pattern recognition with Jiangnan University, Wuxi, China. His current research interests include pattern recognition, computer vision, fuzzy systems, neural networks, and intelligent systems.