Detailed Feature Guided Generative Adversarial Pose Reconstruction Network

Face frontalization is a critical and difficult task on face pose reconstruction. Previous researches use simple posture information as guidance, such as pose coding and facial landmarks. To explore the guidance effect of profile faces, we propose detailed features that provide much detailed information. In this paper, a Detailed Feature Guided Generative Adversarial Pose Reconstruction Network (DGPR) is proposed. Firstly, frontal pose coding and profile detailed features are fed into DGPR to generate detailed features of front face. Then, the second generator combines frontal detailed features and profile face to reconstruct front face. Besides, we propose a conditional enhancement loss to strengthen the guiding role of detailed features, and a smoothing loss to reduce edge sharpness in generated faces. Experimental results show that our method generates photorealistic front faces and outperforms state-of-the-art methods on M2FPA and CAS-PEAL. Specifically, DGPR improves the face recognition accuracy under pose angles of ±60°, ±75°, ±90° by 2%, 1%, and 6% respectively over the state-of-the-art methods on M2FPA, achieves the average rank-1 recognition rate to 99.95% and improves it by 0.05% on CAS-PEAL. These results demonstrate the effects of detailed features and corresponding modules.


I. INTRODUCTION
Profile faces exist widely in the real world for the application field of face recognition. Face frontalization converts them into front faces improves face recognition accuracy significantly by preserving the rich identification information from the profile faces. Benefiting from convolutional neural networks, such as Generative Adversarial Networks (GAN) [1], Variational Autoencoder (VAE) [2], great progress has been achieved in the field of face frontalization. However, insufficient feature information of profile faces has made a main obstacle to the frontalized face quality, especially under the extreme pose angles (for example, 90 • ).
Previous researchers have made many outstanding contributions, greatly promoted progress on face frontalization. Some researchers [3]- [5] tried to build a 3D model to generate a reconstruction of front faces. Generally, they reconstructed frontal view faces from estimated 3D surface based on existent 2D profile faces. This type of methods The associate editor coordinating the review of this manuscript and approving it for publication was John See. is time-consuming and hard to learn, and vulnerable to an irregular surface.
Some researches (TP-GAN [6], CAPG-GAN [7]) solve this task by 2D-based methods: learning a pose-invariant embedding to reconstruct frontal faces, where model architectures were meticulously considered and finely designed. After encoding faces as high-level features in latent spaces, multiple sub-modules process features in different aspects to generate photorealistic frontal faces.
This type of method focuses on high-level features, neglects detailed information, such as edges, textures and corners which contain rich information useful for the quality improvement of the frontalized faces. They used only simple pose information as guidance, applying one-hot vector to store pose clues, and facial landmark positions to direct the frontalization process. For example, CAPG-GAN [7] utilizes five facial landmarks (eyes, nose, mouth) which cannot store rich pose features on the other areas of the face skin.
Besides, these methods used the common loss functions which had not considered the importance of the detail smoothness, suffered from rough and unrealistic details in their results. There are three groups and each is consist of three columns, the first column is real faces, the second column is input profile faces and the third column is generated frontal faces. The first and second groups illustrate 90 • profile faces and corresponding real and generated front faces. The third group displays 75 • , 60 • , 45 • profile faces and corresponding generated results.
Based on these observations, we propose to explore more detailed information from profile faces, aiming to provide more effective information for face frontalization. Specifically, we extract profile sketches as detailed features, which preserve rich identifying information for pose rotation. To eliminate the influence of rough detailed features, smoothing loss is proposed to improve the smoothness of generated detailed features. In addition, we present a conditional enhancement loss to strengthen the detailed feature's guidance effect on generation of target pose.
In this paper, the Detailed Feature Guided Generative Adversarial Pose Reconstruction Network (DGPR) is proposed, which include two generators and two discriminators: a detailed feature convert generator G m and a pose convert generator G t . G m was used to generate frontal detailed features from profile faces, and G t was designed to combine profile faces with the corresponding frontal detailed features and eventually generate front faces. Experimental results show that DGPR reaches the state-of-the-art under metric of rank-1 recognition rate. After taking detailed features of profile faces into account, we also reconstruct photorealistic front faces from extreme poses, such as ±90 • , which have much less information compared with smaller pose.
The contributions of this paper can be summarized as threefolds: (1) we firstly utilize the rich sketch information of profile faces as guidance to generate front faces, which provides more details, and the improved effects is demonstrated.
(2) We proposed a structure of dual generators to combine profile faces and detailed features, which achieves excellent results. (3) Conditional enhancement loss and smoothing loss are proposed to enhances pose consistency and edge naturalness. At the special cases of the extreme horizontal poses (such as ±90 • ) with little information, our model can also obtain clarifying frontal faces.
The rest of this paper is organized as follows: Section II describes the related works. Section III details our proposed method. Experimental results are shown in Section IV, and Section V summaries this paper.

II. RELATED WORKS
Lots of researchers proposed effective algorithms for face frontalization. Manifold learning methods [8], [9] improved feature representation for downstream tasks, but failed in capturing more robust data characteristics in specific tasks compared with deep-learning methods. The existing deep-learning methods can be summarized as two classes: the first type of methods was based on 3D models, which learned a 3D-surface from 2D faces and then rotated and projected to front view. Hassner et al. [3] used a single unmodified 3D reference surface to generate the front view. Yin et al. [4] used traditional rotation methods to estimate 3D grid to reconstruct front faces in target pose. Cao et al. [10] proposed to generate front face by estimating 3D faces surface. They employed two pathway generators, which were used to learn to estimate facial texture map and UV map [11] respectively. Zhou et al. [12] generated 3D faces based on 3DDFA [13], and then rendered frontal faces. Zhao et al. [14] estimated 3D surface based on 3DMM [15], and devised a two pathway structure to deal with global and local texture.
These kinds of methods relied on a high precision model trained on a 3D fitting database or a strict and accurate 2D to 3D conversion coordinate system. The error accumulates during the process of projecting faces from 2D to 3D or from 3D to 2D easily. In addition, reducing surface smoothness of final synthesized faces occurs frequently, which reduces the naturalness of the reconstructed face image.
The second type of methods generated frontal faces without assistant of the 3D model. There were lots of methods based on Generative Adversarial Network (GAN) [ [16], which used pose coding guide generator and discriminator to decouple the feature representation of poses. Couple-Agent Pose-Guided GAN (CAPG-GAN) [7] equipped five facial landmarks heatmaps as pose guidance, supplemented by a coupled discriminator to constrain pose and structure distribution. The pose guidance in CAPG-GAN was identity irrelevant, which limited the diversity of generated faces. Ganimation [17] used action units to get target face with attention masks.
Yin et al. [18] proposed a dual attention structure model DA-GAN. DA-GAN equipped self-attention into generator and enhanced discriminator by facial attention, which was assisted by facial masks. Lu et al. [19] devised global and local discriminators to generate clarifying front face. Duan and Zhang [20] proposed a coarse-to-fine architecture BoostGAN. BoostGAN regarded multi-occlusive profile faces as input, and then generated front faces for every profile. Finally, the frontal face was boosted from multi-front faces. Similarly, Banerjee et al. [21] reconstructed front faces from occlusive faces, too. Cao et al. [22] proposed LB-GAN, which is consist of Face Normalizer and Face Editor. Face Normalizer firstly generated frontal view faces, and Face Editor rotated faces to target pose. Liu et al. [23] proposed a novel model PPN-GAN to generate front faces progressively. PPN-GAN started from inferring an intermediate face that had a small view difference to the profile face, and then increased the view difference step by step until the frontal view arrives.
Sagonas et al. [24] solved face frontalization with a statistical method. Yim et al. [25] proposed a front view synthesis method with multi-task learning. Cole et al. [26] decomposed face into a set of sparse landmark points and aligned texture maps, which were combined to generate front faces by image warping. Qian et al. proposed Additive Focal Variational Auto-encoder (AF-VAE) [27] that combined VAE with GAN. They introduced a novel random gaussian mixture hypothesis to improve the de-entanglement effect of face content and obtained realistic frontal images. Kan et al. [28] proposed Stacked Progressive Auto-encoder (SPAE), which iteratively transformed large poses to virtual smaller pose, until target pose was generated. Yang et al. [29] and Feng and Yuen [30] proposed similar methods with virtual poses.
The above mentioned methods all used simple pose coding or landmark points as posture information, the detailed information (such as contours and textures) of profile face had not been extensively used.

III. METHOD A. OVERVIEW
Recent researches on face frontalization barely concerned about the guidance of profile faces. However, edges contain a lot of information, such as the shape of face, facial structure, and even identity. Therefore, making full use of edges in profile improve the quality of reconstructed frontal faces. We believe that using sketch as detailed features improves the generator's perception of details.
Specifically, we firstly generate detailed features for all faces for preprocessing. The profile faces and corresponding detailed features are the input of our model. Then, we employ the proposed generator to convert profile detailed features to frontal ones. Next, we feed profile face and corresponding frontal detailed features together into another generator. Then the frontal faces are generated finally.
In the following sections, detailed features are introduced for better understanding firstly. Then, we describe the architecture of our proposed DGPR. Finally, we introduce the proposed conditional enhancement loss and smoothing loss. The other losses we used are described, too.

B. DETAILED FEATURE
Posture information plays a vital role to reconstruct front faces from profile faces. The posture with rich details make model reconstruct more realistic target face. Inspired by the study of Yi et al. [31], we employ [31] to obtain a sketch image of each profile face.
Compared with one-hot coding, sketches contain almost all edge information of profile faces, which can more efficiently guide model to generate high-quality synthetic faces. Unlike previous methods, which share the same coding for multiple profile persons at one pose, detailed features is one-to-one that improve similarity between synthesized and real front faces. Figure 2 shows the difference between posture guidance we used and CAPG-GAN [7]. C. NETWORK STRUCTURE DGPR can be divided into two parts: The first part is the detailed feature convert generator G m , which generates a frontal detailed feature. The second part is the pose convert generator G t , which combines the output of G m and profile face as inputs to generate front face. Figure 3 shows the overall structure of DGPR based on detailed features.
We represent detailed features as m in this paper. x represents a face image at any pose in the dataset X , and we expect to obtain a synthetic face f corresponding to the target pose c y . The generated f is expected proximity to real frontal face infinitely. DGPR can be expressed as formulas (1) and (2): where c y is one-hot coding of target pose, m x is detailed feature of profile face x. G m reconstruct detailed feature m f . And f is front face generated by G t . In addition, we cycle the training stage by revising input and output at each iteration. The cycle process increases the robustness of model. Specifically, the results obtained are fed back to the model to reconstruct the original input. At the same time, cycle loss is used to optimize the two generators and further improve the performance of our model.

1) DETAILED FEATURE CONVERT GENERATOR G m
In practice, detailed features of frontal face cannot be obtained. Thus, we design detailed features convert generator G m to convert detailed features into a frontal one. The structure of G m is shown in Table 1. The residual blocks ResBlocks are composed of ReflectionPad2d and convolutional layer with 3 × 3 kernel size. The normalization layers in residual blocks adopt Adaptive Instance Normalize (AdaIN) [32], as shown in formula (3).
There are differences in the data distribution under different poses. In order to make G m adaptively learn the difference, we employ Multi-Layer Perceptron (MLP) to learn the parameters µ and σ in Adaptive Instance Normalization layers. The structure of MLP is shown in Table 2, where N is the total number of parameters in Adaptive Instance Normalization.

2) POSE CONVERT GENERATOR G t
Frontal detailed features m f is generated by G m and is input to G t together with profile face x. Then, G t output the final front face f . The x contains face identity and texture information. Meanwhile, frontal detailed feature m f provides posture and structure information. DGPR combines these two kinds of content together to generate a photorealistic face.
The structure of G t is different from G m in the first layer. It adds an additional learnable parameter β for combing m f and x: x + βm f . After down-sampling, a self-attention mechanism is equipped to force model to focus more attention on effective features. In the following ResBlocks, G t adopt Batch Normalization [33] instead of Adaptive Instance Normalize. The settings of other layers are same as G m .

3) DISCRIMINATORS
The discriminator D m tries to distinguish generated and real detailed features, while D t tries to distinguish generated and real images. The adversarial training procedure reduces the difference between generated and real images. In order to make generator achieve better performance and refine the details of generated images at the same time, both discriminators based on PatchGAN [34] are adopted. The structure of discriminators are shown in Table 3.

D. LOSS FUNCTIONS 1) ADVERSARIAL LOSS
Judging the authenticity of results can prompt generator to obtain frontal faces with distribution close to real. We use a modified adversarial loss proposed by WGAN-GP [35], [36] to distinguish real or fake images, which adds a gradient penalty to the original adversarial loss. Specifically, the input of G m is profile detailed feature m x and frontal pose coding c y , the output is frontal detailed feature m f . Adversarial loss L m can be expressed as (4): where P m x is the data distribution of profile detailed features, P˜I is random difference distribution, λ 1 is penalty coefficient.
G t also uses the adversarial loss L t proposed by WGAN-GP, which is shown in (5). P x is the data distribution of X . Põ is random difference distribution and λ 2 is penalty coefficient.

2) RECONSTRUCTION LOSS
With the thought of cycle consistency, we adopt a reconstruction structure that we feed final results back into the model to reconstruct inputs. The performance of generators would benefit from cycle reconstruction loss. We adopt reconstruction loss functions in both G m and G t to optimize the generator parameters. Formula (6) is reconstruction loss for G m , and formula (7) for G t .

3) SMOOTHING LOSS
In the beginning, we used the same method to train G m and G t , but due to the different image types processed by two generators, G t gets good generation while G m reaches collapse easily. Inspired by research [17], smoothing loss L s andL s promote to reduce sharpness of detailed features and avoid collapse. L s andL s is formulated as equations 8 and 9 respectively.
where M = G m m x | c y , M = G m m f | c x , P m f is the data distribution of generated frontal detailed features. λ 3 and λ 4 are the parameters of smoothing loss.

4) CONDITIONAL ENHANCEMENT LOSS
To enhance the constraint of pose conditions on generated results, DGPR employs a fully connected layer to classify the poses of features that output from Conv41 in G t . The cross entropy loss is calculated as conditional enhancement loss L c1 and L c2 . This loss contains two items: the first item calculates the difference between generated and target pose and constrains generated face to match the target pose. The second item uses real images as reference to learn how to make pose judgments from real distribution. For G m , we expect to enhance the influence of guidance information c y . The conditional enhancement loss for G m is as follows: (10) For G t , we expect to enhance the influence of detailed feature m f . The conditional enhancement loss for G t is as follows:

5) IDENTITY LOSS
To enhance the identity consistency of synthetic faces, we need to determine whether the identity between profile and generated front faces are identical. The identity loss L m−id for G m , which can not only punish the differences between profile and reconstructed frontal faces, but also force G m to preserve identity.
For G t , the identity consistency loss L t−id is adopted to maintain the identity of generated images. In addition VOLUME 9, 2021 to L1 loss, we also constrained identity consistency at feature level. The FaceNet [37], which is pre-trained on VGGFace [38], is employed to extract the identity features. Then the mean square error (MSE) distance between test and registered features is computed as L t−id . All data input to FaceNet are scaled to 160 × 160 in practice. L t−id is as follows: where F f is the identity features of generated front faces and F x for profile faces.

6) MULTI-SCALE PIXEL-LEVEL LOSS
Inspired by CAPG-GAN [7], we apply a multi-scale pixellevel loss L pix for f to constrain its content consistency, accelerate optimization and reconstruct more detailed information.
where S is the number of scales, W s and H s represent the width and height of images for each scale respectively, and C is the number of image channel.

7) TOTAL LOSS
The total loss for G m is shown in Equation (16): (15) and the loss for G t : where λ s , λ c1 , λ c2 , λ m−idt , λ f −idt , λ pix are hyperparameters that control weight of each loss item. Their values are shown in Table 4. The final objective function can be formulated as (17), where generators try to minimize the objective function and discriminators try to maximize it. The algorithm flow of DGPR is illustrated as Algorithm 1.

Require:
Input image x, detailed feature m f , target pose one-hot coding c y Ensure: Frontal face image f The gradient is

IV. EXPERIMENTS
We implemented our method by PyTorch [39] and all experiments were conducted on a single GeForce GTX 2080Ti GPU with 11G memory. In the following sections, we will introduce the dataset used in our experiments, qualitative and quantitative analysis of DGPR compared with the state-ofthe-art methods, also the ablation study of DGPR.

A. DATASETS
To demonstrate the effect of DGPR in the task of face frontalization, we conduct experiments on the M 2 FPA [40] and CAS-PEAL [41]. M 2 FPA is provided by the Institute of Automation, Chinese Academy of Sciences. It is a great comprehensive multi-view public dataset currently that available for facial pose analysis research. M 2 FPA contains a total of 397,544 pictures of 229 subjects, with 62 poses, 4 attribute changes, and 7 lighting changes. Each subject's 62 poses were captured simultaneously, including 13 yaw angles (−90 • ∼ 90 • ), 5 pitch angles (−30 • ∼ 45 • ) and 44 yaw and pitch angles. In our experiments, we only concern about face frontalization of different poses of 13 yaw angles, and pitch angles are ignored. Five types of glasses and three kinds of expressions improve diversity. Several previous methods such as DR-GAN [16], TP-GAN [6] and CAPG-GAN [7] are provided effective benchmarks for face frontalization and pose invariant face recognition results on M 2 FPA. For a fair comparison, we evaluate DGPR on M 2 FPA dataset with the official train/test split.
The CAS-PEAL [41] is a public Chinese face database with controlled pose, expression, accessory, and lighting. It contains 30,863 images of 1,040 subjects, and all images are grayscale. Following previous researches, we use images with various poses 0 • , ±15 • , ±30 • and ±45 • . We randomly split the dataset into train/validation/test sets with 7:1:2.

B. IMPLEMENT DETAILS
All the loss balance coefficients can be seen in Table 4. The pose number is 13 on M 2 FPA and 11 on CAS-PEAL.

C. EVALUATION METRICS
We evaluate the quantitative results of DGPR with face recognition performance and illustrate generated frontal face for qualitative evaluation. Specifically, we use rank-1 recognition rate as face recognition evaluation metric. For fair comparison, we follow the previous researches' setting in the evaluation. LightCNN [43] is used as face feature extractor. Faces of different yaw angles are fed as input, which can be expressed as ±15 • , ±30 • , ±45 • , ±60 • , ±75 • , ±90 • respectively. Rank-1 recognition rate is calculated by averaging two corresponding poses. For example, rank-1 value of ±15 • is the average of +15 • and −15 • .

D. QUALITATIVE ANALYSIS
The experimental results on M 2 FPA are illustrated in Figure 4, and results on CAS-PEAL are illustrated in Figure 5. As is shown in Figure 4, DGPR generates detailed features that are similar to real front faces. Facial characteristic can also be generated accurately, such as glasses, bangs, and eyebrows. In addition, DGPR is able to adapt to changes in poses and generate realistic front faces. In Figure 5, faces can be generated with similar characters as input profile faces. These demonstrate the effectiveness of our proposed model.
In order to prove that DGPR can preserve the identity of one person from diverse poses, faces of same person from different poses are fed into DGPR. The comparison about generated frontal faces as shown in Figure 6. It shows that DGPR can reconstruct photorealistic synthetic faces and preserving identity consistency when poses are closer to 0 • . While the pose is larger than 60 • , the identity consistency between reconstructed and real images decreases slightly compared with smaller poses. Although the contour of face   is slightly deformed, the shape of face remains normal and the identity still maintains a high consistency. Table 5 shows rank-1 recognition rate of generated frontal faces on M 2 FPA under poses ±15 • , ±30 • , ±45 • , ±60 • , ±75 • , ±90 • . It can be seen that rank-1 recognition rate of different poses has been improved. While profile poses are at ±60 • , ±75 • and ±90 • , the rank-1 recognition rate is increased by 2%, 1% and 6% respectively, which compared with the state-of-the-art methods. It demonstrated that DGPR significantly explore the effect of detailed features, enrich reconstructed frontal faces from even large poses.

E. QUANTITATIVE ANALYSIS
On CAS-PEAL, we report results in Table 6. We compared our model with state-of-the-art methods CR-GAN [44], TP-GAN [6] and DA-GAN [18]. It can be seen that our method outperforms state-of-the-art methods as the average rank-1 recognition rate is improved by 0.05%. To further demonstrate the effectiveness of our model, we report results under pose ±22 • and ±67 • , which are not reported in  previous researches but exist in dataset. Rank-1 recognition rates are 100% and 98.54% under ±22 • and ±67 • respectively. 56100 VOLUME 9, 2021  To demonstrate the convergence of our model, we illustrate rank-1 recognition rate at different training iteration on CAS-PEAL in Figure 7. Four lines under poses ±00 • , ±15 • , ±30 • and ±45 • are plotted, and the average rank-1 accuracy is plotted as dot-line, too. It can be seen that the model reaches convergence after training 7000 iterations. The results demonstrate our model is well converged. The fluctuating of curves reflect the instability of GAN training. Furthermore, the Fréchet Inception Distance (FID) score [45] and Inception Score (IS) [35] score at different training iteration on CAS-PEAL are shown in Figure 8. The FID score is the lower the better, and IS score is the higher the better. The max value of IS is 2. It can be seen that FID and IS reach stable status at around 7000 iterations.

F. ABLATION STUDY 1) VALIDITY OF DETAILED FEATURES
In order to analysis the effect of different posture guidance on the generation procedure, we compared the front faces generated by DGPR with 68 landmarks and detailed features respectively, which are two kinds of posture guidance. The quantitative comparison results are shown in The Table 7.
It can be seen that regarding detailed features as guidance, resolution and identity of generated front faces significantly improved. Among all posture guidances, the model with detailed features generated better front faces, and rank-1 recognition rate outperforms the model with 68 landmarks. What's more, the rank-1 recognition rate is greatly improved under ±90 • , which proves detailed features is valid.

2) VALIDITY OF DIFFERENT COMPONENTS
In order to prove the effect of different modules in DGPR, each module was sequentially removed to carry out reconstruction experiments from profile faces to front faces. w/o Pose Dis means removing conditional enhancement loss and corresponding pose classifier; w/o Draw means removing detailed feature convert generator G m , then reconstruct front faces without detailed features; w/o FaceNet means removing feature-level identity constraint, including the identity consistency loss and corresponding embedding extracted by pre-trained FaceNet.
As shown in Figure 9, with the absence of detailed feature guidance or posture constraints, more obvious facial deformation occurred. The deformation enlarges when pose angle is larger. As is seen in Table 8, due to the absence of detailed features guidance or posture constraints, facial deformation decreases rank-1 recognition rate of frontal images reconstructed by DGPR.
As shown in Table 9, Fréchet Inception Distance (FID) score [45] and Inception Score (IS) [35] of reconstructed front faces are improved after removing the constraint of identity features, while rank-1 recognition rate was significantly improved with feature-level identity constraint. This observation suggests that the identity constrain has a slight negative impact on image quality, but the identity consistency between reconstructed and real images improved significantly. We set to adjust the coefficient λ m−id for identity feature consis- tency loss L m−id to reach a balance between generating more photorealistic images and higher face recognition accuracy. Experiments show that each component of DGPR has various effects, removing any of them will decrease the performance.

V. CONCLUSION
This paper introduces Detailed Feature Guided Generative Adversarial Pose Reconstruction Network (DGPR), which explores the guidance effect of profile faces to generate front faces. DGPR utilizes face sketches as detailed features, which stores rich pose clues and detailed information, and guides generators to obtain realistic frontal faces. DGPR contains a dual generator structure: the detailed feature convert generator and the pose convert generator, which can take advantage of both profile faces and detailed features. In addition, we propose smoothing loss to reduce edge sharpness in generated faces, conditional enhancement loss to constrain pose consistency. The experimental results show that our method outperforms previous methods on face recognition rate on M 2 FPA and CAS-PEAL datasets. Specifically, our model gains a great improvement for face frontalization from extreme poses, which suggests the effect of our model.