Pose-Forecasting Aided Human Video Prediction With Graph Convolutional Networks

Human video prediction is still a challenging problem due to the uncertainty of future actions and complexity of frame details. Recent methods tackle this problem in two steps: firstly to forecast future human poses from the initial ones, and then to generate realistic frames conditioned on predicted poses. Following this framework, we propose a novel Graph Convolutional Network (GCN) based pose predictor to comprehensively model human body joints and forcast their positions holistically, and also a stacked generative model with a temporal discriminator to iteratively refine the quality of the generated videos. The GCN based pose predictor fully considers the relationships among body joints and produces more plausible pose predictions. With the guidance of predicted poses, a temporal discriminator encodes temporal information into future frame generation to achieve high-quality results. Furthermore, stacked residual refinement generators make the results more realistic. Extensive experiments on benchmark datasets demonstrate that the proposed method produces better predictions than state-of-the-arts and achieves up to 15% improvement in PSNR.


I. INTRODUCTION
Video prediction is receiving increasing research attention, but it is still challenging due to complex video contexts consisting of both high-level semantics like walking and running and low-level pixel information [1]. For human action videos, human pose contains much semantic information, which intuitively can be leveraged to benefit accurate video prediction. Many methods use the already generated frames to make further predictions in a recursive paradigm [2]- [9], but the quality of generated videos would degrade over time due to accumulative pixel-level errors. Recent works [1], [10]- [13] adopt a hierarchical framework of disentangling the predictions of high-level semantics and low-level pixels, which avoids pixel-level error accumulation and effectively improves the quality of generated videos. However, most of them do not consider the structural relationships among semantic components in the videos. In addition, some other techniques are also used for improving the visual quality of generated results, like integrating temporal information to obtain higher-quality videos [14]- [18] and performing The associate editor coordinating the review of this manuscript and approving it for publication was Jiachen Yang . refinement for better results [19], [20]. Meanwhile, several works [21]- [24] aim to generate new realistic scenes, not to predict the real future.
In this work, we aim to forecast and generate high-quality future human videos from a single frame and a sequence of poses, which can be viewed as a combined problem of future pose forecasting and conditioned video generation. To solve this problem, we fully exploit the relations among human joints and leverage them for better human motion dynamics modeling and forecasting. We build a hierarchical framework containing two novel components for better pose prediction and video generation, as is shown in Figure 1.
The pose prediction module deploys a graph convolutional network (GCN) [25] which is more suitable and powerful for exploring relations among human body joints through intrinsic graph-based information propagation in representation learning. In particular, the graph takes body joints as nodes and models the relations of these nodes in a global view. For the video generation component, we introduce a stacked residual refinement module consisting of a sequence of generators that iteratively refine future frames over their residuals with the previous one to achieve better quality. The solution of the problem in residual domain focuses on FIGURE 1. Framework illustration of our proposed video prediction model. It consists of a pose prediction module and a stacked residual refinement module. The former deploys LSTM and GCN to predict future poses through exploring temporal context and body joints relations, and the output poses are fed into the latter for pose based video generation. The stacked residual module performs video refinement iteratively based on residual learning. Each stage generates a residual map to refine the final result conditioned on the predicted pose and video frames from the previous stage. The final stage integrates all the residual refinements and outputs the generated video.
motions and simplifies the difficulty. Moreover, we use a multi-stage refinement process which further improves the quality progressively. For each generator, we utilize Generative Adversarial Networks (GANs) and integrate a temporal discriminator which encodes temporal information. By distinguishing the generated video clip from the real one, the discriminator teaches the generator to produce more temporally fluent results.
Our contributions are summarized as follows: 1) We apply GCN to pose prediction. We find GCN achieves accurate prediction by effectively exploiting the relations of human body joints, in particular in presence of some distracting factors like self-occlusion. 2) We propose to encode the temporal information into the process of each single frame generation with a temporal discriminator to obtain more temporally fluent results. 3) We design a stacked residual refinement framework to iteratively improve the realism of the generated videos significantly.

II. RELATED WORK
Traditional video prediction approaches follow a recursive prediction process at pixel level [7], [9], and their performance often gets corrupted due to pixel-level error accumulation. Recent proposed hierarchical prediction methods can avoid pixel-level error accumulation by separating highlevel structure (e.g. face landmarks, pose) prediction and structure-analogy video generation. Villegas et al. [1] first made high-level structure predictions, and then generated future videos by visual-structure analogy. Yang et al. [13] proposed a PSGAN for pose prediction and an SCGAN for video generation. Wang et al. [26] aimed at converting an input semantic video, such as videos of human poses or segmentation masks, to an output photorealistic video, and Wang et al. [24] extended it to synthesize videos of previously unseen subjects or scenes by leveraging a few example images of the target at test time. Existing hierarchical prediction works are rather limited and are still struggling for generating accurate pose predictions and realistic videos. For pose prediction, recent methods [1], [13], [19] do not explore relations of joints w.r.t. the human pose. In Graph Convolutional Network (GCN) [25], relations among components of a structure are explored via the intrinsic graph-based information propagation, which achieves state-of-the-art performance in many domains such as text classification [27], [28], machine translation [29], [30] and image recognition [31]. Some works [32], [33] build visual classification models upon GCN to fuse semantic embeddings and categorical relationships. We apply GCN to encoding the internal association of the joints to make better predictions in videos.
Video generation is a challenging problem. GAN [34] based generative models are proposed benefiting from the development of computing power and achieve great success in image generation [35]- [41], but video generation is much harder for GANs due to more data contents and extra temporal information. Many works [14], [16], [18], [42] directly implement 3D convolutions to encode temporal information. Ji et al. [14] applied 3D convolutional networks to human action recognition. Vondrick et al. [18] used 3D convolutional networks for video generation. Besides 3D convolution, a coarse-to-fine strategy is also applied for more realistic results [19], [43]- [45]. Ma et al. [43] used L1 loss to generate a coarse result, and then used GANs to refine it. Zhao et al. [19] utilized image GANs to generate a coarse result, followed by 3D convolutional GANs for refinement. However, 3D convolutions often lead to fixed-size outputs, which means low scalability, and their coarse-to-fine strategy is usually two-stage. We formulate the temporal information by temporal discriminator in a flexible way. Besides, a stacked residual refinement scheme is devised to recursively refine the target to gain more realistic results.
. , x m } be a sequence from the first to the m-th frame in a human action video and p 1:m = {p 1 , p 2 , . . . , p m } be the sequence of the corresponding poses for these frames. Here each pose p i ∈ R 2×L is represented by the 2D coordinates of L human body joints. We aim to solve the problem of predicting n future video frames x m+1 , . . . , x m+n based on p 1:m and a random frame x s from x 1:m .
We solve this problem through two phases: we first predict future poses p m+1 to p m+n based on p 1:m in x-y coordinates and then generate the video frames x m+1 , . . . , x m+n based on the forecasted poses which are converted into heatmaps. Through such a two-step solution strategy, we alleviate the difficulties by disentangling the pose and frame appearance modeling and leveraging the pose information as effective guidance for frame generation. Figure 1 provides an architecture overview for our proposed model. The model consists of a pose prediction module and a stacked residual refinement module. The pose prediction module deploys GCN to make better prediction, since GCN defines different relations between joints in a graph. For example, hands and shoulders contain more useful information regarding elbows than knees, which can guide more efficient information communication with GCN. The temporal discriminator takes a sequence of generated frames as an entirety to increase the temporal fluency. Stacked refinements further optimize the results in a residual domain by an iterative way.

B. GRAPH CONVOLUTIONAL NETWORK BASED POSE PREDICTION
Pose prediction is a sequence-to-sequence learning problem, which requires to predict the future poses p m+1:m+n from the observed past ones p 1:m . One natural choice of the model for solving this problem is the LSTM based RNN, which has proved to be very effective in other similar problems [1]. However, we observe LSTM-RNN performs not so well in presence of common distracting factors for pose prediction, e.g., self-occlusion and unusual viewpoint. This would introduce errors and degrade the quality of relatively far future poses when the errors accumulate. Considering human poses are well structured, we therefore deploy graph convolutional networks (GCNs) to fully exploit relations among body joints and address such issues.
The pose prediction module in our proposed model consists of encoder, LSTM and GCN. A linear encoder is applied to convert the the pose motion dynamics into hidden states. LSTM encodes the hidden states and predicts the future poses; GCN explores the relations among body joints to produce a better pose prediction.
The encoder and LSTM predictor encode p 1:m in a chronological order, which encodes p t at time t, t ∈ [1, m], into hidden states: where σ (·) denotes an activation function and ReLU(·) is adopted here. e t ∈ R e represents the hidden state of p t . h 0 t ∈ R d 0 represents the observed dynamics up to time t. c t is the memory cell that retains information from the history of pose inputs. The encoder encodes the poses p t into hidden states e t . The LSTM receives the past m poses e 1:m , to recurrently predict the future n hidden states h 0 m+1 , . . . , h 0 m+n . Based on the hidden states h 0 t , t ∈ [m + 1, m + n], a GCN layer is applied to predict future poses p m+1:m+n . More specifically, after obtaining h 0 t from LSTM, a fully connected layer is at first used to convert the hidden state h 0 t containing holisitc features into a vector h 1 t ∈ R d 1 . And a nonparametric reshape operation converts it into a matrix h 2 t ∈ R L×d 2 , satisfying d 1 = L × d 2 , with h 2 t representing the d 2 -dimensional features for each of the L body joints. Then, a GCN layer is integrated as follows: whereÃ = A + I N ∈ R L×L is the augmented adjacency matrix of the undirected graph G over the L body joints, with added self-connections. Here I N is the identity matrix;D ii = jÃ ij , and W * are trainable weight matrices. We build the graph G for capturing body joints relations and the human body structure as follows. For each element a ij ∈ A, we set it as 1 if the corresponding joints i and j are adjacent or symmetrical to each other based on human kinematic priors [46], and 0 if otherwise.
Finally, another GCN layer converts h 3 t to predicted poses p t , t ∈ [m + 1, m + n], where we apply tanh to the activation function, and other parameters are similar to Eq.2. The predictor is trained with the standard mean squared error (MSE) loss of the predicted n poses:

C. POSE AIDED VIDEO GENERATION
With the above predicted poses, the video generation module predicts the video frames with realistic contents compatible with the predicted poses. This is achieved by a stacked refinement architecture that generates videos through multi-stage residual learning. Within each generation stage, the generation component predicts a plausible video frame in a residual form.
Concretely, the generation module has an encoder-decoder architecture. In the first generation stage, a past frame with its pose and a future pose are taken as input, and the output is the corresponding future frame. To match each keypoint to a position on human body, we encode the poses as heatmaps following [43], each heatmap filled with 1 in a radius of 2 pixels around the corresponding keypoint and 0 elsewhere.
The generator learns the residual map to reduce learning redundancy: where x s , p s are the past frame and its pose.x t , p t are the future frame we generate and its pose. g(·) is the function that converts x-y coordinates to heatmaps. G(·) represents the generator we need to learn. It is a difficult task to synthesize the whole frame, and modeling the residual map can simplify this problem and enable the generator to focus on motions.
Particularly, U-Net [47] is employed as the backbone of the generator, where skip-connections are conducted over features at both semantic and texture levels as is shown in Figure 1. With such a backbone, our model concatenates the frame and heatmaps as input to the same encoder to make feature fusion at the two levels. Semantic features give instructions to generate frames with rough poses, and texture features help improve the details of the generated frames.
To improve the quality and realism of generated video frames, we propose a temporal discriminator to supervise and train the generation model, which effectively encodes temporal information (e.g., temporal consistency and smoothness) into the video generation to make generated frames more realistic.
We develop and adopt the following four losses to stabilize the training process and gain realistic results. We use the sparse reconstruction loss to compare the generationx t with the target frame x t , for which 1 distance in the image space is adopted: Reconstruction loss guides the generation to make a rough prediction that reflects most details of the target and stabilizes the training process. We find using only 1 loss leads to reasonable but blurry results, which is also observed in other works [35]. We thus add a perception loss in the feature space to exploit the feature differences and obtain more reasonable results, where VGG19 network is chosen for feature extraction [48], [49]. The perception loss is defined as where VGG(·) outputs the pretrained VGGNet [50] features at dent layers of the network. Following common practice in image and video generation [35], [43], [49], we also adopt the following adversarial spatial loss of Conditional Image Discriminator. A conditional image discriminator D I ensures the output frame to resemble the target one w.r.t. the given pose, and G is our generator introduced in Equation (4). The objective of a conditional adversarial loss [51] is expressed as where we assign b = 1, c = 1 to denote real samples, a = 0 to denote fake samples. x s and x t denote the source frame and the ground truth target frame respectively. p s and p t are the corresponding poses.x t represents the generated target frame. Specifically, the loss for generator of D I is A real video not only looks real in every frame, but is continuous and fluent in contents. We argue that temporal information can make a video look fluent. It is not enough to only generate realistic images. We utilize a temporal discriminator to encode temporal information into frame generation, aiming to get temporally fluent videos. The temporal GAN is trained by with the same notations as Equation (7). The loss for generator of D T is The purpose of this discriminator is to make new generated frames look real. By combining the previous frames and the present frame, the generator targets at getting a fluent result in contexts.
To sum up, our final loss is composing the aforementioned losses with linear combination: The quality of frames generated with one stage as introduced above usually suffers slight artifacts and blurriness. We here introduce a stacked residual refinement module. Generators are integrated in a stacked paradigm to refine the target recursively.
For a k-stage network, there are k repeated generators. The first generator is the same as what we described in Section III-C, which is denoted as G 1 and produces a residual VOLUME 8, 2020 map 1 . At the end of the first stage we getx (1) t = x s + 1 . The generator for stage 2, denoted as G 2 , has the same structure as G 1 , but with a different input. It takes not only all the inputs of G 1 , including x s , p s , p t , but also 1 , which is the result of G 1 . The result of G 2 , denoted as 2 , is a refinement of 1 , so the results of both stages are summed to form a new residual map˜ 2 = 1 + 2 . The result of the second stage isx

Algorithm 1 Stacked Residual Refinement Module
1: procedure Refine(x s , p s , p t , k) K-stage refinement 2: Every loop defines a refinement stage 7: Train G i 8:

A. DATASETS
Penn Action [52] contains 2, 326 video sequences of 15 different actions. Humans in its video sequences are annotated by 13 joints. We follow the official train split and action selection in [1], [19], using actions of baseball pitch, baseball swing, clean and jerk, golf swing, jumping jacks, jump rope, tennis forehand and tennis serve.
Human3.6M [53] contains 3.6 million 3D human poses and corresponding images with 11 professional actors in 17 scenarios. Humans are annotated by 32 joints in both 2D and 3D coordinates. Since several joints are close, we select 17 joints as pose information, including pelvis, right hip, right knee, right ankle, left hip, left knee, left ankle, spine, neck, head, head top, left shoulder, left elbow, left wrist, right shoulder, right elbow and right wrist.
UCF101 [54] contains 101 action classes, over 13k clips and 27 hours of video data. In our experiments, we choose the data from two action classes, pull-ups and Taichi. We run Openpose [55] on the frames and take the results as pseudo-labels.

B. IMPLEMENTATION
Following problem setting in [19], the predictor sees 10 past frames and predicts 32 frames. For generators, the traditional U-Net backbone [47] consists of 4 downsampling layers and corresponding upsampling layers. Firstly, the input is encoded into a feature with the channel number of 128 by a convolutional layer. Then, the downsampling layers are implemented as convolutional layers, where the kernel size is set to 3 and the stride is set to 2. Correspondingly, the upsampling layers are implemented as transposed convolutional layers, where the kernel size is set to 3 and the stride is set to 2. BatchNorm and ReLU are used afer every convolutional and transposed convolutional layer. Adam optimizer [56] is chosen for back propagation. Hyperparameters are set as λ rec = 100, λ per = 10, λ D I = 1 and λ D T = 1 in all the experiments without further search. In temporal discriminator, d is set to 1 to look back for one frame. For equality, we choose our stacked residual refinement number k = 2, since most algorithms define a coarse network and a refinement network. In training, the model is trained stage by stage, where parameters are fixed after the corresponding training stage. The number of training epoch is set to 10 in all experiments.

C. COMPARISON WITH BASELINES
The comparisons with state-of-the-arts over prediction of eight actions from Penn Action in terms of PSNR are shown in Figure 2. The legends in the first subfigure (Baseball Pitch) show the compared methods denoted in different colors, and in the bracket the models used to generate prior poses are given. As we can see from the curves, our method in ground truth pose setting has a higher PSNR than other methods for stable generation. In predicted pose setting, our method also achieves better performance than others. These results clearly validate the superiority of our method over the state-of-the-art baselines. Table 1 shows the comparisons between the proposed method and Villegas et al. [1] using PSNR and SSIM metrics. We calculate the average results of all 32 predicted frames. In both ground truth setting and predicted pose setting, our method always achieves better quantitative results. Especially in the ground truth setting, our algorithm outperforms the baseline by 15% in PSNR.   Some examples of generated frames are shown in Figure 3. With available ground truth poses, the baseline successfully reserves the red shirt in each frame, but the actor's head in the fourth frame is posed inconsistently with adjacent frames. Comparatively, our algorithm gives a more temporally fluent result. In the setting of predicted poses, in the frame the actor's hand and elbow look weird, while the arm is normal in our method. The artifact is possibly caused by the conflict of predicted elbow and wrist while GCN in our method can avoid it.
Experiments in pull-ups and Taichi from UCF101 are also implemented. Table 2 shows the results comparing vid2vid [26]. Since vid2vid does not include the prediction part, we just implement the ground truth pose setting for it. The quantitative results show that our method outperforms the baseline in both pose settings. Figure 4 shows the qualitative result of pull-ups. We can see the frames generated by vid2vid VOLUME 8, 2020   have unstable scenes and the first frame collapses. Since our method learns the results in residual domain, it produces much more stable scenes in both pose settings.

D. COMPONENT STUDY 1) POSE PREDICTION
The pose prediction results of LSTM and our LSTM-GCN in terms of average PCK@0.2 on both datasets are given in Table 3. It can be seen that GCN is helpful to encode the relations between joints. Better pose prediction will contribute to better video prediction.

2) VIDEO GENERATION
We conduct the following experiments to test the effects of the temporal discriminator and refinement components in our method: (1) Ours w/o Temp. Temporal discriminator is removed in this setting to verify the contribution of temporal information. (2) Ours w/o Refine. In this setting, results from only stage 1 are calculated to analyze the influence of stacked refinement. Table 5 shows stacked residual refinement improves the final results in different settings. Higher video quality is obtained as the number of stages increases.

V. CONCLUSION
This paper presented a novel method for human video prediction, which was separated into pose prediction and pose based video generation. GCN explored the relations among joints to make better pose predictions. For video generation, GANs paradigm was followed and temporal discriminator encoded temporal information to produce more realistic results. Meanwhile, stacked residual refinement of generators further improved the final performance in a recursive way. Our method achieved higher performance than state-of-thearts on Human3.6M, Penn Action and UCF101 datasets. In the future, we aim to explore more relationships between the poses and generated frames to further promote the accuracy.