Graph Convolutional Adversarial Network for Human Body Pose and Mesh Estimation

This paper studies reconstruction of human body shape and pose from a single-view image. While most of current work attempts to regress parameters of human body model such as Skinned Multi-Person Linear Model (SMPL) and Hand Model with Articulated and Non-rigid Deformations (MANO), these parametric approaches underperform compared to non-parametric approaches. Due to the lack of the spatial relationship in the input image, the parametric approaches are hardly used to reconstruct the human body precisely. Besides, the rotation parameter regression is a complex task in parametric approaches. Therefore, we introduce a novel graph convolutional neural network (Graph CNN)-based framework for estimating a non-parametric mesh model. Our key innovation is that the proposed model is trained in a generative adversarial manner. Firstly, Graph CNN utilizes mesh topology to capture integral information of the full 3D human shape and then generate a more smooth and high-quality human mesh model. Secondly, the discriminator in our network acts as a supervisor to specify whether a human shape and pose are real or not. The generator is encouraged to generate human body mesh that is close to the manifold of the real human mesh distribution. Extensive experimental results demonstrate the effectiveness of our proposed framework. In contrast to the state-of-the-art methods, our method can achieve better performance in human shape and pose estimation.


I. INTRODUCTION
Reconstructing a 3D space from a 2D image has made a dramatic leap in the human body shape reconstruction. The human body shape reconstruction plays a necessary role in many scenes such as virtual reality games, movie special effects, product customization, augmented reality, etc. Traditionally, a 3D human body or object model can be attained by a binocular camera or a 3D scanner. However, these approaches require expensive devices and it only can be attained when the object is not be occluded. Due to the above constraints, it is difficult to be widely used in real scenarios. Recently, deep learning has made great progress in many applications. A lot of methods for the 3D shape reconstruction have been proposed. By taking advantage of neural networks, these methods present 3D shapes as point clouds [1]- [3], voxels [4], [5], parametric model [6], or mesh model [7].
However, the human body reconstruction is not addressed well in the monocular case because of the wide range of the The associate editor coordinating the review of this manuscript and approving it for publication was Juan A. Lara .
imaging conditions and lack of the depth information. Most previous approaches use a parametric model for human pose estimation, e.g., the 3D SMPL [6] model or MANO [8]. Given a single image of a person, they try to estimate the parameters of the human body model that conform to the human body pose in the image. Although the recent parameter-based methods have achieved significant improvement, they still have two major shortcomings. Firstly, the particular parametric space has great limitations. Because SMPL just models a single human body, it can not present details such as hair, facial expressions, and clothes. Secondly, the mapping from a high-dimensional space to a particular parametric space omits the spatial relationship of the pixels from the input image. The above shortcomings lead to difficulty of training and as a result, reduce the reconstruction accuracy as addressed in Moon et al. [9] and Choi et al. [10]. Therefore the direct regression of SMPL parameters might not an appropriate regression target. Unlike previous work, we adopt a human body mesh model for reconstructing a human shape and pose. Instead, our regression target is a mesh with n vertices. The 3D coordinates define the location of each vertex. VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Our goal is to reconstruct properly the complete 3D geometry of the human body from monocular images. In this paper, we propose a novel end-to-end approach to reconstruct the 3D human body mesh model. The proposed framework contains an encoder, a generator, and a discriminator. The feature vector extracted by the encoder from the input image is attached to template mesh vertices. Then, the graph convolutional neural network (Graph CNN) in the generator is the deformation process of the mesh. In the last layer, the Graph CNN reconstructs the complete 3D geometry of the human body without the SMPL parameters. Both the generated 3D model and the ground truth human body model are sent to the discriminator. Finally, the discriminator is responsible for determining whether a 3D mesh model is from the generator or not. Our key contribution is to generate 3D human body models in a generative adversarial manner. Our framework can not only leverage the traditional heuristic criteria but also leverage the adversarial criteria. In our generative adversarial network, the generator is encouraged to output the mesh which is closer to the manifold of the real human bodies, which helps the generator learn the real data distribution and avoid outputting people with the abnormal body shapes. To improve the performance of the discriminator, we introduce graph convolution in the discriminator. This architecture explicitly encodes the mesh structure and utilizes the spatial information of the human mesh. These components in together generate competitive reconstruction results compared with the state-of-the-art methods.
The major contributions of our work are summarized as follows: • We propose an end-to-end framework based on Graph CNN, which can predict the 3D locations of each mesh vertex directly. This avoids calculating the keypoints loss indirectly and also avoids the need for the SMPL parameters.
• We integrate the graph convolution into the discriminator, which can utilize the graph topology better. The extensive experimental results demonstrate that the proposed approach is effective and necessary for 3D human reconstruction.
• The experimental results show that our approach outperforms the previous state-of-the-art results among human pose and shape estimation on the various publicly available datasets.

II. RELATED WORK
In recent years, the 3D human reconstruction has always been a hot field of artificial intelligence. There are many methods to reconstruct human body, including point cloud [11], simplistic body skeleton [12], [13], posture parameters [14], [15], mesh model [16], [17] etc. These methods have different regression goals. In order to understand the related work, we divide the previous human reconstruction work into the two categories: direct parametric regression and non-parametric regression. Our proposed method belongs to non-parametric regression.

A. DIRECT PARAMETRIC REGRESSION
SMPL model is proposed by Loper et al. [6], which uses posture and shape parameters to represent different human body shapes and pose. After that, the vast majority of the 3D human body estimation works reconstruct the human shape through the regression of the SMPL model parameters.
Bogo et al. [14] minimize the loss between the predicted keypoint locations and the ground truth keypoint locations to infer joint keypoint locations and the SMPL model parameters. Tan et al. [18] design an encoder-decoder network. The network uses the corresponding images and ground truth silhouettes pairs instead of the corresponding shape and pose parameter pairs. Madadi et al. [19] take a deep neural network as an encoder and take the SMPL model as a decoder. The denoising autoencoder model can recover a 3D pose from the structured error. Although they used different methods or models, the above work focused on taking SMPL model parameters as the regression parameters. In the case of the SMPL model, a human body model is expressed by the 3D rotations parameters. It is difficult to precisely regress the rotation parameters [20], because of the discontinuities or the periodicity of parameters.

B. NON-PARAMETRIC RegRESSION AND GRAPH CNN
Varol et al. [21] attempt to use two networks to predict the 2D pose and the 2D body part segmentation respectively. They combine two subnetworks to a final network to infer a voxel 3D body. The convolutional layers are widely used in the image processing and have achieved the excellent results. However, the convolution operations can not be directly applied to the mesh vertices. Kipf et al. [22] proposed an efficient variant of convolutional neural networks that can apply directly to graphs. Then, Verma et al. [23] use a novel graph-convolution operator to connect filter weights and graph neighborhoods. Litany et al. [24] design a variational autoencoder with graph convolutional operations, which firstly encodes the latent vector of the partial shape and then reconstructs the complete shape according to the latent vector. Unlike the previous Graph CNN, Wang et al. [25] use a graph that is not fixed. In other words, the connection between each layer is being changed. They use it to better predict the local geometric features of the point clouds.
Wang et al. [7] propose an architecture based on the Graph CNN that generates 3D shapes from an initial ellipsoid. However, for the details of some complex objects, the model can not reconstruct well. The object mesh reconstruction is still an unsolved problem.
Recently, the non-parametric regression methods based on graph convolution have also been proposed for the human shape reconstruction. Kolotouros et al. [26] proposed to regress 3D mesh vertices to recover human body shape and posture using Graph CNN. Xie et al. [16] extract the anthropometric measurements from the original image, which regresses these parameters to reconstruct a more accurate human body model. Since these parameters can only be extracted from the standard A-pose, its applications still have limitations. Choi et al. [10] propose Pose2Mesh, which reconstructs 3D human body mesh from 2D human body pose. The network adds vertices through upsampling and recovers a high-resolution mesh model. The Graph CNN allows the explicit encoding of the graph structure, which avoids the dependence of the SMPL parameters.

C. GENERATIVE ADVERSARIAL NETWORK
Goodfellow et al. [27] proposed Generative Adversarial Network including a generator and a discriminator. The generated data tends to the distribution of the real dataset during the confrontation between the generator and the discriminator. Brock et al. [28] build a GAN model based on scaling. The architecture is deeper than the other models but only needs about half the parameters of other models, which can generate a more realistic high-resolution image that the human eye can not distinguish. Shaham et al. [29] first proposed an unconditional generative model training on a single natural image, which uses a pyramid of full convolutional GAN to generate images of arbitrary size and aspect ratio. The model maintains the overall structure and the texture of the original image, and it also has obvious variability.
For the human shape reconstruction, Kanazawa et al. [15] proposed an end-to-end recovery model that also contains a discriminator, which acts as weak supervision. The model minimizes the joint reprojection loss and the adversarial loss to regress the SMPL model parameters. Alldieck et al. [30] use a PatchGAN discriminator to enhance the reality. Without the paired accurate 3D pose, it can even recover details including hair and clothes. We propose a capable generative model for human body estimation. The proposed system can benefit from the topology of mesh representation.

III. APPROACH
To reconstruct a full geometric human body model from the input image, we introduce our non-parametric proposed approach for human body estimation. Firstly, we briefly show the model architecture. Then, we introduce the graph convolution layer, which is a vital part in the generator and the discriminator. Finally, loss function is described in detail.
A. MODEL Fig.1 shows the overview of the proposed framework. The first part of our pipeline is an encoder which extracts features from input RGB images. In our approach, the ResNet-50 convolution network pretrained on the Image-Net classification task [31] is adopted as the encoder. By taking advantage of the transfer learning technology, the 2048-D image-based feature vector is extracted as the input of the generator. Template human mesh attached to extracted features is sent to the Graph CNN. The Graph CNN infers the 3D coordinates of the mesh vertices in the generator. The output mesh model of the generator is our estimation result, which is labeled by fake. Ground truth SMPL parameters are used to recover the real human body mesh, which is labeled by real. The last part of our framework is a discriminator. The real and fake mesh are both sent to the discriminator. The discriminator contains a graph convolutional network and a fully connected layer to estimate the probability that a mesh came from the real human dataset. Since the discriminator encourages the generator to output a smooth surface like the SMPL model, we do not need an additional smooth loss function. The generator is responsible for generating the human shape from the image, and the discriminator is responsible for determining whether the input human body mesh is from the generator or the real dataset. In addition, a Multi-Layer Perceptron (MLP) regresses shape and pose parameters of SMPL from regressed human mesh, so that our network can be applied to applications that require SMPL parameters.
There are two main functions of the discriminator. Firstly, the discriminator avoids outputting human mesh with the abnormal body shape because the discriminator can distinguish human mesh via body shape. Secondly, for the situation with heavy occlusions in body parts, the discriminator encourages the generator to output reasonable human shape and limb positions.
Kanazawa et al. [15] take the SMPL parameters as the input of discriminator. Instead, the input of ours is a 3D mesh model. This allows the discriminator to use the topology of the graph to capture the complete information of the human body instead of the simple SMPL parameters. We believe that the human mesh is a better representation. The experimental results in part C of section IV also proved it.

B. GRAPH CNN
The 3D coordinates of the mesh vertices are predicted by the Graph CNN. A 3D mesh can be defined as an undirected graph G = {V, E, X }, where V is a set of mesh vertices with 3D coordinate, E is a set of edges and X is the feature vector of each vertex.
The graph convolution layer is proposed by Kipf et al. [22], which is defined as: where, A ∈ R N ×N is the adjacency matrix of the undirected graph, andÃ is the row-normalized matrix of A. X ∈ R N ×k denotes the input features matrix and W ∈ R k×l is the trainable weight matrix. In order to speed up the training of the neural network, we add residual connections and Group Normalization [32] on the graph convolution layer. They are combined into a Residual Graph Convolution Block, which is based on the residual block [31]. The detail of a residual block is shown in Fig.2. The generator and the discriminator both consist of a serial of residual blocks. We propose to integrate graph convolution layers in both the generator and the discriminator, which is inspired by Kolotouros et al. [26]. In the generator, the Graph CNN takes mesh vertices and feature vectors as input. The graph convolution layers are equivalent to the deforming of the mesh. It aims to estimate the 3D coordinates for each vertex. In the discriminator, the GraphCNN CNN encourages the generator to output 3D human shapes that are more similar to the real dataset.

C. RECONSTRUCTION AND ADVERSARIAL LOSS
We assume that V is the ground truth vertices generated from the SMPL parameters andV ∈ R N ×3 is the predicted 3D mesh vertices. Because human body mesh model is inferred directly, we can minimize the L 1 loss between predicted and corresponding ground true mesh vertices, i.e., (2) The Graph CNN in the generator output 3D mesh vertices directly. Then, we use the MLP to regress the SMPL parameters from infered mesh model, including shape β ∈ R 10 and θ ∈ R 3K . We apply L 2 loss on this regression task, i.e., where,β andθ are the predicted parameters, β and θ are the ground truth shape and pose. In this way, human body model is represented in the parametric space.
Using the predicted 3D mesh vertices, we can obtain the 2D and 3D keypoint locationX of the joints. For some datasets, there are paired ground truth keypoint locations. Let X denote the ground truth keypoints. To align the image with the model, we align the predicted joint locations and the ground truth keypoints. This supervision loss is added to the loss function, which is defined as follows: Let L recon denote the reconstruction loss of the generator. It includes 3D mesh vertices loss, SMPL parameters loss, and joints loss.
L recon = L 3D + L smpl + L joints (5) The distance between the real and fake data distribution is measured by the least square formulation [33]. We introduce adversarial loss to improve the reality of the generated human model because our framework is trained on in a generative adversarial manner. The adversarial loss function for the model is: where, I is the input image. P G and P data are the generated human model distribution and real data distribution respectively. The adversarial loss make generator learn the real data distribution.
In the traditional GAN applications, the generator generates clear images from random noise. Training GAN is unstable because it attempts to search for a Nash equilibrium in a high-dimensional space. Besides, the mode collapse is a serious issue during GAN training, as addressed in Bau et al. [34]. These problems make it difficult to train generative adversarial networks. The above problems do not occur in our GAN. Because our GAN also needs to minimize the reconstruction loss while the GAN minimizes the adversarial loss.
Finally, the overall loss function is: where, λ is a hyper-parameter that assigns a weight on the adversarial loss. 215422 VOLUME 8, 2020

IV. EXPERIMENTS A. IMPLEMENTATION DETAILS
Datasets: Human3.6M [36] and UP-3D [35] is the training dataset of the proposed approach. Human3.6M is a pose dataset collected indoors, which includes common postures such as walking, exercising, smoking, etc. UP-3D includes 8515 natural images of humans with rich annotations and correct limb rotation, which is selected according to the successful fits. They both provide the SMPL parameters and the 3D joints keypoints. We evaluate our proposed model on Human3.6. Ablation study is run on MPII [37] and LSP [38]. LSP is a 2D pose annotated images dataset, which is provided by Lassner et al. [38]. Each image in LSP has been annotated with the 14 joint locations. We use the test set of this dataset as our validation set. MPII is a human pose dataset, which contains 25K images and 40K people with annotated body joints. We demonstrate the model results on two common protocols (P1 and P2, as defined in [39]).
We alternate training the generator and the discriminator. Optimizing the generator to minimize L recon and the L G and then optimizing the discriminator to minimize L D . In our architecture, the generator includes five Residual Graph Convolution Blocks, and the discriminator includes three. We noticed that reducing residual blocks lead to a decrease in reconstruction accuracy. Adding residual blocks will consume the additional resources, while not improving the reconstruction accuracy and the classification accuracy. For our training, the learning rates of the generator and the discriminator network are both set to 3 × 10 −4 . The hyper-parameter λ is set to 0.1, which is selected through experiments on the validation set, LSP. Except for the encoder pretrained on ImageNet [40], all other network components (generator, MLP, discriminator) are trained from scratch. We use the Adam solver [41] and train for 50 epochs with mixed data from Human3.6M and UP-3D.
Our proposed framework is trained using Pytorch on a PC with an Intel i9-9900k at 3.6Hz, 16GB of memory, and an NVIDIA GEFORCE RTX 2080 GPU with 8GB GPU memory.

B. QUANTITATIVE EVALUATION
We evaluate our method both qualitatively and quantitatively on a variety of datasets.

1) QUANTITATIVE EVALUATION
To quantify the accuracy of our generated human body mesh, we use two metrics: (i) the mean per joint position error (MPJPE), and (ii) Reconstruction error, as defined in [42]. First, we compare our method to the two well-performing baseline models, namely the HMR method of [15] and the recent approach of Kolotouros et al. [26]. They also reconstruct body shape from a single image. For a fair comparison, all methods are only trained on Human3.6M. The detailed results on Human3.6 Protocol 1 and Protocol 2 are shown in Table 1. We observe that the proposed approach yields performance improvements compared to the baselines. Second, we evaluate all methods on Human3.6M Protocol 2 and report the errors in Table 2. In this experiment, our approach is trained on Human3.6 and UP-3D and other methods are trained on the different combinations of datasets. Here our method also outperforms the other baselines.

2) QUALITATIVE EVALUATION
We present the qualitative comparison between our proposed approach and HMR [15] in Fig. 3. We regard HMR as a suitable comparison target because it also trained in a generative adversarial manner. As the figure shows, our proposed approach provides more accurate limb positions than HMR. However, both methods are difficult to reconstruct a precise human body model in complex scenes with occlusions.

C. ABLATION STUDY ON DISCRIMINATOR
Kolotouros et al. [26] demonstrate that graph convolution layers are beneficial compared to fully connected layers in the generator. Our contribution is the insight that the discrimination task of mesh vertices can be simplified when the graph convolution layers are used in discriminator. In this subsection, we focus on explaining the effectiveness of graph convolution on the discriminator. The performance of the discriminator plays a vital role during GAN training. If the discriminator can not distinguish between the real and generated samples, the generator will not be able to learn reliable gradients. A better discriminator allows a generator to obtain a higher quality gradient. We design a series of experiments to prove that the graph convolutions are effective on the discriminator.
We fix the generator parameters, only train the discriminator on MPII dataset. For every ten batches of samples, nine of them are used as the training set, and the remaining one is used as the test set. We compare the discriminator loss of Graph CNN architecture with Multi-Layer Perceptron (MLP) with Batch Normalization. For our experiments, we use a batch size of 16 and Adam optimizer. We set appropriate learning rates for two different architecture. The Graph CNN is 3 × 10 −4 and the MLP is 1 × 10 −3 . The results are shown in Fig.4. Compared with the MLP, ablation results suggest that the graph convolutions significantly increase the training speed of the discriminator. Besides, it can achieve lower loss and stable training.   To intuitively explain the effectiveness of graph convolution in human body reconstruction, we use an MLP instead of the Graph CNN in the discriminator to test its performance (Mesh + MLP). Another comparative experiment using an MLP to regress SMPL parameters from the predicted 3D mesh, and send the predicted SMPL parameters and ground truth SMPL parameters to the discriminator with the MLP architecture (SPML + MLP). In Table 3, we show the classification accuracy of three architectures after training. The reconstruction results are shown in Table 4. The architecture using MLP is basically the same as Kolotouros et al. [26] on reconstruction accuracy. Compared with our original architecture, the other architecture is clearly underperforming. The results indicate that it is useful to integrate graph convolutions into the discriminator. The experimental results demonstrate the Graph CNN is beneficial from the mesh structure.

V. CONCLUSION
In this paper, we introduce an end-to-end generative adversarial framework for human pose and shape estimation from a monocular RGB image. Compared with the directly regressing model parameters, using the graph convolution can utilize the topology of the mesh. The extensive experiments demonstrate that our approach outperforms the relevant baselines.
However, there are still several constraints to our method. Recovering 3D mesh models miss the details such as clothes and hair facial expressions. Theoretically, these details can be reconstructed in the non-parametric mesh model. Another problem is that if there are too many vertices in the mesh, the requirement of computing power is unbearable, which will prevent us from recovering a higher resolution model. Future work can focus on the above problems.