Unpaired Domain Transfer for Data Augment in Face Recognition

Face recognition is one of the hottest issues in the field of computer vision and pattern recognition. Deep learning-based recognition models already have more exceptional recognition ability than the human being on open datasets, but still cannot fully undertake the identity recognition task in real scenarios without human assistance. In this paper, we mainly analyze two obstacles, i.e., domain gap and training data shortage. We propose the unpaired Domain Transfer Generative Adversarial Network (DT-GAN) to relieve these two obstacles. We improve the GAN baseline to bridge the domain gap among datasets by generating images conforming to the style of a target domain by learning the mapping between the source domain and target domain. The generator could synthesize face with an arbitrary viewpoint at the same time. The model is trained with a combination of style transfer loss, identity loss, and pose loss, which ensures the successive domain transfer and data augment. We conduct experiments to testify the effectiveness and reasonability of DT-GAN. Experimental results demonstrate the recognition performance is dramatically boosted after domain transfer and data augment.


I. INTRODUCTION
Face recognition aims to figure out the identity of a specific face. Because of its essential applications in the security and surveillance system, face recognition has been drawing lots of attention from both academia and industry. The performance of face recognition has been significantly boosted because of the development of deep learning.
Different from face verification, face recognition is still not widely deployed in the surveillance system and obstacles remain to hinder the applications of face recognition. One of these open issues is the lack of suitable training data. The training data set is the base to train a deep learning model, and the quantity and quality of training data significantly impact the robust and generalization ability of deep learning models. The sources of training data mainly include self-established datasets and existing public datasets.
Obviously building a new dataset and annotate person ID consumes human resources and time. Generally speaking, the self-established dataset contains few labeled samples at the beginning and then do data augment. Conventional data The associate editor coordinating the review of this manuscript and approving it for publication was Inês Domingues . augment techniques usually refers to rotation/reflection, flip, zoom in/out, shift, scale-up/down, crop, translation, random erasing, add noise. Through the transformation of training data, the network with stronger generalization ability can be obtained to better adapt to the application scene. However, these methods have an intrinsic drawback, namely generated images have the same features as the original photos, which means the distribution of the dataset stays monotonous.
On the other hand, samples of existing public datasets differ from the data collected in real scenarios. Commonly used face data sets usually do not have rich face types and each identity has limited samples, mainly positive face images with constrained environments and high resolution. However, in real scenarios, faces are captured by surveillance under unconstrained environments. These faces have various styles, e.g., complicated background, different illumination, resolution, expressions, and poses. Accordingly, deep learning-based recognition models trained with these existing datasets would fail to cope with these challenges. This phenomenon, i.e., domain gap exists between different face datasets, which results in a severe performance drop (see Section IV.D). The domain gap is widespread in representation learning. Research indicates that domain adaptation, FIGURE 1. The overview of DT-GAN. G XY uses a real image x from source domain as input and outputs a synthesized image G XY (x), while D Y distinguishes its facticity. Identity classifier C gives the identity of both the real and synthesized face. Pose estimator P controls the viewpoint of the synthesized face. To avoid the mode collapse, the G YX translates the synthesized face back. Reversion loss together with adversarial losses encourage G XY (G XY (x)) ≈ x. whose mechanism is closer to human brainpower, is an effective way to bridge the domain gap and the progress in this field is vital to the whole research area of computer vision. The domain gap commonly exists in person re-identification and this obstacle draws the attention of researchers, whereas little research is conducted in the field of face recognition. It does not mean the domain gap is not severe in face recognition; on the contrary, it also faces a bad performance drop due to the domain gap.
In summary, establishing a dataset for each scenario of the surveillance system consumes manpower and time in practical application, and the domain gap between the public datasets and the real scene seriously affects the performance. Both of them have their shortcomings. So we desire to explore the solution to the shortage of applicable training data of the face recognition model deployed in the surveillance system.
Aiming to facilitate the research towards applications in realistic scenarios and inspired by the magic power of generation images of Generative Adversarial Networks (GANs) [1], we propose unpaired Domain Transfer GAN (DT-GAN) to translate a face from the source domain to a synthesized face which has the same style with the samples in the target domain while reserving identity and presenting arbitrary viewpoint, all in the absence of any paired training examples. DT-GAN is constituted by four parts, i.e. Generator, Discriminator, identity Classifier, and Pose estimator. To be more specific, two Generators, G XY and G YX are leveraged, and respectively represent the mapping from source domain X to target domain Y and the mapping from target domain Y to source domain X conversely. Correspondingly two Discriminators D X and D Y determine the face is real or fake in the source domain X or the target domain Y. G and D compete with each other until the discriminator cannot tell the truth. The identity Classifier measures the distance between feature vectors of the real and synthesized face and minimizes this distance to keep the identity. Pose estimator assists in controlling the viewpoint of the synthesized face. By generating faces with different viewpoints, the size of the target domain is significantly enlarged, so the model trained with the target domain would not face the problem of over-fit. Fig. 1 summarizes the whole flow of the system.
Our contributions can be summarized into three aspects:(1) We propose a useful unpaired domain transfer model DT-GAN, which can learn the mapping from the complete source domain to an incomplete target domain by adversarial-based learning method to narrow the domain gap; (2) DT-GAN can generate face images with various viewpoint as well as reserving identity for data augment resulting in a better performance in face recognition; (3) we conduct experiments and compare against several prior methods. The baseline trained with enriched and transferred domains has better performance than that with other domains. The results demonstrate that DT-GAN has the potential to relieve the expensive data annotations on new datasets and make it easy to train face recognition models in real scenarios.

II. RELATED WORK
Our work is closely related to image generation in face recognition and adversarial-based domain adaptation. This section briefly summarizes that the application of GAN in synthesizing face images and domain adaptation.

B. FACE GENERATION
FaceID-GAN [19] adopts a three-play GAN, which treats a classifier of face identity as the third player, competing with the generator by distinguishing the identities of the real and synthesized faces. FaceID-GAN's contribution is proposing a novel deep generative adversarial network to generate faces of arbitrary viewpoints while preserving identity. Zhao et al. [20] propose a Dual-Agent Generative Adversarial Network (DA-GAN) model, which can improve the realism of a face simulator's output using unlabeled real faces while preserving the identity information during the realism refinement. Similar to FaceID-GAN, DA-GAN introduces an extra discriminator as an identity classifier to preserve identity information. What is different, DA-GAN introduces a simulator to extract face RoI, perform face/background segmentation, localize landmark points and produce synthetic faces with arbitrary poses, which are fed to DA-GAN for realism refinement. This method has a complete but complex pipeline for unconstrained face recognition.

C. ADVERSARIAL-BASED DOMAIN ADAPTATION
In machine learning, if the training data is a biased sample, standard classifiers will not perform well. This kind of domain gap widely exists in practical application. To cope with the domain gap, a wide variety of approaches are proposed, including importance-weighting, subspace mapping, domain-invariant spaces, and feature augmentation. Moreover, with the boost of GAN, adversarial-based domain adaptation draws attention from researchers. They also can be divided into two categories by whether paired data is needed.

1) PAIRED IMAGE TRANSFER
The characteristic of this kind of approach is the demand for paired training data and has the same setting, i.e., map pixels to pixels. It goes back at least to Hertzmann et al.'s Image Analogies [21], employing a non-parametric texture model on a single input-output training image pair. More recent approaches, like cGAN(pix2pix) [8], aim to solve the problem of translating an input image into a corresponding output image. Since the release of the pix2pix, it turns out to be clear that GANs can be as a general-purpose solution for image-to-image translation.

2) UNPAIRED IMAGE TRANSFER
The pix2pix has shown its ability to translate images in pixel-level, whereas it is time-consuming and manpowerexpending to pair samples, but in most cases paired samples are inexistent. The researchers turn to unpaired image transfer. CycleGAN [18] and DualGAN [12] are outstanding work to solve general image translation problems without paired training data. They both introduce an extra loss, respectively cycle consistency loss and reconstruction loss, to guarantee the effectiveness of transfer in two directions. Different from DualGAN or CycleGAN, recent work CoGAN [6] learns a joint distribution over images from two domains. By sharing weight parameters corresponding to high-level semantics in both generative and discriminative networks, CoGAN can enforce the two GANs to interpret these image semantics in the same way. PT-GAN [22] is designed to bridge the domain gap in person re-identification. The remarkable contribution of PT-GAN is that it does not require person identity labels on the target domain, because it adopts PSPNET to segment person body region and combines extracted body region with the fixed background of the target domain. Experimental results show that PTGAN generates images with substantially higher quality while identity reserved.
Our work is inspired by the magic power of generation images of GAN, and it is the application of GAN in domain transfer and data augment. CycleGAN is a general solution for unpaired image transfer. Qualitative results are presented on several tasks where paired training data does not exist, including collection style transfer, object transfiguration, season transfer, and photo enhancement. So we adopted the objective of CycleGAN together with the identity loss and the pose loss to realize the identity reserving and the data augment besides the domain transfer.

III. UNPAIRED DOMAIN TRANSFER GAN
To make face recognition more suited for unconstrained environments, we improve GAN as a basic framework. The generator and the discriminator compete with each other until the generator learns the mapping between domains, and the discriminator cannot figure out the input is real or fake for the target domain. However, during the generation, features of the synthesized face may be biased compared with the input. In this condition, the recognition system would fail. An identity classifier is added to the conventional GAN to make the feature vector close to the distribution of the feature map. Meanwhile, we consider introducing a pose estimator which can ensure the synthesized face has a specific viewpoint.
Assume that arbitrary sample x i , i = 1, . . . ,N from the source domain X and y j , j = 1, . . . ,M from the target domain. We denote the data distribution as x ∼ p data (X ) and y ∼ p data (Y ). The superscript represents the status of images, e.g., x r represents the r eal image in the source data set and x s represents the synthesized image.

A. STYLE TRANSFER
To address the problem of performance drop caused by the domain gap, we propose the model DT-GAN to bridge the domain gap between the source domain and the target domain, which has the same structure with the objective in CycleGAN [18] and apply reversion loss to measure the mapping between domains.
The generator and the discriminator are trained from scratch and the objective consists of three parts, adversarial VOLUME 8, 2020 loss of two directions and reversion loss, i.e.
The main task of the generator is to synthesize images that can fool the discriminator, so the most intuitive measurement is to compare the synthesized image with the paired real image pixel by pixel. However there are not paired images in the source domain and the target domain. The conventional adversarial loss [1] is used to train both the generator and the discriminator instead of pixel-wise loss. For the transfer direction from the source domain to the target domain, the adversarial loss is expressed as in which G aims to minimize the objective against an adversary D that tries to maximize it, i.e., min With the symmetry of GAN, an adversarial loss for the other direction is min However only with adversarial loss and unpaired datasets, the network would train a simple mapping which maps a set of input images to any random permutation of images in the target domain. Thus, it cannot guarantee that the learned function reveals the integrated mapping among domains. Inspired by dual learning in machine translation [23], reversion loss is introduced to reconstruct the synthesized image, i.e.
By minimizing reversion loss, train the reversion mappings, G XY · G YX and G YX · G XY , and make the reconstructed image look similar to the original. Together with adversarial-based reversion loss, the discriminator and generator can be fully trained.

B. IDENTITY RESERVING
Solely considering style transfer may result in ambiguous features in transferred faces and identity label changed. One of the most direct and effective solutions is taking advantage of an identity classifier. The classifier is pre-trained with LFW and fine-tuned during the face generation. The feature vectors of the synthesized are expressed as f s , so the objective [29] is where c i is the feature vector of the clustering center of the real face with class i. During the training procedure, adjust the G to push the synthesized face closer to the center. The most advantage of this objective is pushing the samples with the same identity compact while separating the samples with Identity loss. The left shows a situation in which the distance between the real and the generated is considered as identity loss. The generated face is closer to another class in Euclid distance, although the distance between the real and the generated is acceptable. Thus the overlap will arise failing classification. The right shows the objective of this paper. Better see in color: the blue is the real data and the red is the generated data. The points with the same shape have the same identity. a different identity. Fig. 2 visualizes the superiority of the identity loss and Fig. 3 gives the detailed training process of the identity classifier.

C. DATA AUGMENT
Another obstacle is the shortage of training data, which may lead to over-fit. To enrich the synthesized faces in the target domain, we adopt a pose estimator to generate faces with different viewpoints for data augment. The pose estimator is based on packages of OpenCV and Dlib. First, we define a 3D face model with six key points, namely nose tip, chin, left eye left corner, right eye right corner, mouth left corner, and mouth right corner. Dlib offers a model detecting 68 key points so that we can index the six key points for the 3D face model. Then solvePnP function of OpenCV can calculate the rotation vector from the 3D face model and convert it into three Euler angles, including pitch, yaw, and roll. In our study, we could like to generate faces with arbitrary yaw angles, unaltered pitch and roll. Assume the objective yaw angle is p y and the yaw angle of the synthesized face is p s y . The pitch p s p and roll p s r angle are fixed to zero degree to simplify the process of generation.  So the objective is expressed as

IV. EXPERIMENTAL RESULTS
The experiments in the paper are mainly conducted in two aspects. The one is evaluating the effectiveness of the model DT-GAN in both domain transfer and data augment. The other is to research the performance improvement of the face recognition model trained by the transferred and enhanced dataset.

A. IMPLEMENTATION DETAILS 1) NETWORKS
We adopt U-Net [24] as the generator, and Fig.4 shows the details of architecture. For discriminator networks, we use PatchGAN [25]. The difference between a PatchGAN and regular GAN discriminator is that the regular GAN maps from an M * M image to a single scalar output, which signifies ''real'' or ''fake''. In contrast, the PatchGAN maps from M * M to N * N array of outputs X, where each X_ij means whether the patch ij in the image is ''real'' or ''fake''. Patch-GAN guarantees a more realistic output. The generator and discriminator are trained from scratch with a learning rate of 0.0001, while the classifier is a pre-trained model based on VGG16 [26] and fine-tuned during training. For all the experiments, we set λ = 10 in (1) and β = 20 in (6). We use the Adam solver with a batch size of 10. Weights are initialized from a Gaussian distribution N (0; 0.01).

2) LOSS FUNCTIONS
We discuss the feasibility of the research in theory before. Nevertheless, there is not any direct connection between the generator and the identity classifier or the pose estimator, so it is hard to convergence when training the generator only with the adversarial loss. In the practical, the loss functions are modified below: α 1 , α 2 , α 3 are weight parameters between these components. D(·) denotes the output of the PatchGAN, E is the matrix whose elements are 1, and sum(·) calculates the summary of the matrix elements. For the synthesized face, G tries to compete with D by producing high-quality face to minimize the L G . G also is trained to reduce the distance between identity features of f s and the clustering center of the expected class, and to minimize the difference between the viewpoint of the synthesized face and the expected angle to preserve identity and change pose.

3) DATASETS
The datasets involved in this paper can be divided into two types, i.e., the source domain and the target domain. The source domain is gathered under random conditions with different styles, pose, as well as illumination. In contrast, the target domain is collected by the specific equipment, which represents one scene in real scenarios. The source domains involved in our experiments are LFW, CASIA Web-Face [17], CASIA 3D Face [30], and the self-established dataset. LFW is one of the most public benchmarks for face verification, also known as pair matching. CASIA WebFace is a face image set collected by a semi-automatical way from the Internet containing 10,575 subjects and 494,414 images. The size of this dataset rank second in the literature, only smaller than the private dataset of Facebook (SCF). CASIA 3D Face is a 3D face database collected by the non-contact 3D digitizer under a constrained environment consisting of 4624 scans of 123 persons. During building the database, the authors consider not only the single variations of poses, expressions, and illuminations but also the combined variations of expressions. Cameras capture samples in all the self-established datasets, and then the face detector identifies the face region with the size of 160 * 160. The total number of identities in the source domain is 78, and each personality has more than ten bounding boxes. The target domains are all self-established and also captured by surveillance cameras. The images in the target domains No.1 to No.4 are collected by cameras indoor, which have relatively stable illumination, while the images in the target domain No.5 to No.8 are gathered by cameras outdoor. This setting is consistent with the need for real application. Table 1 collects the statistic of all the datasets in detail.

B. PERFORMANCE ON STYLE TRANSFER AND IDENTITY RESERVING
The domain transfer is tested between two datasets that have different conditions, e.g., illumination, resolution, and VOLUME 8, 2020 location of the camera. Each generator represents a relationship across domains. The transferred images have reserve most features of the face and at the same time fitting the style of the target domain. With the assistance of domain transfer theory, the source domain can be transferred to apply to other application scenarios without setting up a dataset from the very beginning. In this part, we transfer the face from the source domain to match the style of the target domain and keep the identity. The viewpoint of the transferred face is the same as the original. The transferred face is reconstructed to assess the performance of the reconstruction Fig.5 gives out some visible experimental results of one domain transfer. Both the source domain and the target domain are self-established. The original images are samples from the source domain, the transferred images are the output of the transferred generator G XY , and the reconstructed are the output of the reconstructed generator G YX . To assess the similarity between faces, we employ a FaceNet [27] trained on LFW, which is totally independent of this work, making the results convincing. This model is also applied to the remaining experiments. FaceNet extracts the feature vector and the distance between them is calculated. The number on the top-left represents the similarity to the original.
In Fig.5, it is not easy to figure out the style difference between the source domain and target domain due to the small size and low resolution, so Fig.6 gives the histogram of images. The distribution in greyscale of the reconstructed image seems more similar to the original compared with the transferred image, which verifies the effectiveness of the transferred and the reconstructed generator.
We also evaluate the DT-GAN on the open datasets. Fig.7 shows some transferred results from LFW to the target domain 2. Fig.8 shows some transferred results from CASIA WebFace to the target domain 5. Fig.9 shows    The DT-GAN model is proposed to achieve the domain transfer from the large dataset to the small dataset, so we compare the performance of style transfer among the CycleGAN [18], DualGAN [22], DiscoGAN [28], CoGAN [6], and DT-GAN under this condition. Fig. 10 gives the results of the comparison. The scores are the average similarity of identity between the original and transferred Faces in CASIA 3D Faces have various poses, including the frontal, diversified yaw, pitch, and roll. We compare the faces generated by DT-GAN with the real samples from CASIA 3D Faces dataset under one domain transfer in Fig. 12. The input of the generator is the frontal image in the dataset CASIA 3D Faces, and the output is generated faces with seven different viewpoints. Fig. 13 gives out the scores, which indicate the average similarity of identity between the original face and VOLUME 8, 2020    of face recognition under the unconstrained environment of the surveillance system. So in this part, we verify the effectiveness of domain transfer and data augment to improve the performance of face recognition. The recognition baseline is also FaceNet. The model is respectively trained with the open dataset LFW, CASIA WebFace, CASIA 3D Face, the self-established source domain, 30% of the target domain, the transferred datasets, and the transferred datasets with data augment, and is tested on the rest of corresponding target domain. The mAP and Rank-1 are adopted for evaluation. Table 2 demonstrates the comparison of recognition performance on FaceNet trained with different training sets. For all the testing sets, the model trained with the open dataset LFW performances worst. The recognition almost fails with the mAP from 32.6% to 40.2%, which implies a substantial domain gap between LFW and the target domains. The models trained with the CASIA WebFace and 3DFace show better performance on the testing sets, but they are still far from the practical demands of the intelligent surveillance system. The reason lies in the quantity and quality of the training data. In LFW, some identities have only one image, so the trained model faces the problem of over-fit. Although the number of the samples in CASIA WebFace is large, the examples in the CASIA WebFace are acquired from websites, and most faces are high-resolution frontal images with make-up. The photos of CASIA 3D Face are gathered under a constrained environment that does not fit the real scenario. Due to the limited number of samples, the model trained with the target domain does not show a better performance. However, with domain transfer and data augment the mAP and Rank-1 are both boosted visibly. Thus we can conclude that it is valuable and necessary to conduct domain transfer and data expand in real scenarios.

V. CONCLUSION AND DISCUSSION
In this paper, we aim to solve the problems of face recognition, i.e., the shortage of training data and domain gap in real scenarios, primarily video surveillance. With GAN as the baseline, we propose an original framework, DT-GAN, to learn the mapping between very different domains without any explicit paired labels. The extra identity loss is employed to preserve identity. Also, the pose estimation is added to the baseline assisting the face generation. The experimental results indicate that the domain gap narrows down, and samples are enriched seven times resulting in the performance improvement of face recognition. Much further study should be conducted, including the vraisemblance studying of generated faces and more comparative experiments in face generation.
Furthermore, 3D face recognition has become popular in practical application. In the process of training, verification, and test, lots of 3D face images are pre-requisite, but capturing 3D images of faces is difficult. One way to solve this problem is a neural network-based 3D face reconstruction. This method reconstructs 3D face from the 2D face. The model DT-GAN can generate faces with an arbitrary viewpoint, and the creative images can be used to reconstruct the 3D face. So our work is also valuable for 3D face reconstruction.