Semi-Supervised 3D Human Pose Estimation by Jointly Considering Temporal and Multiview Information

Three-dimensional human pose estimation is usually conducted in a supervised manner. However, because collecting labeled 3D skeletons is expensive and time-consuming, semi-supervised methods that need much fewer amount of labeled 3D data are urgently demanded. Some semi-supervised learning methods propose to independently consider information from consecutive video frames, or frames simultaneously captured from multiple viewpoints. In this article, we propose to jointly consider temporal information and multiview information in a unified adversarial learning framework. Given a 2D skeleton, a pose generator network is developed to estimate the corresponding 3D skeleton, and a camera network is developed to estimate camera parameters. The estimated 3D skeleton is evaluated by a critic network to examine whether the estimated one is a plausible 3D human pose or not. Based on the estimated camera parameters, the estimated 3D skeleton can be re-projected into a 2D skeleton, which should be similar to the input 2D skeleton. The ideas of re-projection and adversarial learning enable the scheme of self supervision. We design network architectures of the aforementioned networks to take 2D skeletons from multiple viewpoints in temporally consecutive frames. By jointly considering two types of information, we verify that performance can be largely improved.


I. INTRODUCTION
Three-dimensional human pose estimation from monocular images has been actively studied in recent years. It is the fundamental step for many advanced research topics, such as human behavior analysis and simulation in virtual reality. Following the success of deep neural networks, and the availability of 3D pose datasets [9], performance of 3D human pose estimation gets impressive recently [6], [13], [15], [17], [20], [22]. Most existing methods attempted to learn a mapping function between monocular images and 3D skeletons in a supervised manner. Given a monocular image, one way to achieve 3D skeleton detection is formulating it as a regression problem. From the image, a heat map representing positions of joints is estimated, and then a 3D skeleton is estimated based on such information. Another way is detecting 2D skeletons first, and then a 2D to 3D skeleton estimation is conducted. No matter which way, labeled 3D The associate editor coordinating the review of this manuscript and approving it for publication was Li He . skeletons are needed to learn the mapping function. However, manually labeling or correcting 3D skeletons is laborious, making labeled 3D skeletons scarce. In addition, the model learnt in a supervised manner may not be generic to unknown motions and camera positions.
Because of the scarcity of 3D labeled data, some methods have been proposed to find the mapping function in a weakly supervised or a semi-supervised manner. Only a small amount of labeled data are required to guide the initial learning, and self-learning schemes are designed to improve the initial model. For example, in [22], a pre-detected 2D skeleton is input to a pose generator network to generate the corresponding 3D skeleton, and a camera network is developed to estimate camera parameters. The generated 3D skeleton is then re-projected back to a 2D skeleton according to the estimated camera parameters, and the projected one should be similar to the input 2D skeleton.
In our work, we develop semi-supervised 3D human pose estimation by modifying the weakly supervised method proposed in [22]. More importantly, we jointly consider rich information in both temporal and spatial domains. Separately estimating 3D skeletons for each video frame usually yields jiggling. Jointly taking information of consecutive video frames into the estimation model may improve detection robustness. Pavllo et al. [17] proposed a temporal convolution model to take 2D skeleton sequences as input to estimate 3D human poses. To utilize rich information from the spatial domain, Chen et al. [6] proposed to jointly consider videos captured from two different views. 2D skeletons are first detected from each view. The 2D skeleton from the first view is then synthesized by a network into the second view. The synthesized skeleton should be similar to that originally from the second view. This representation constraint guides network learning, and this network is viewed to be able to describe geometry representation of 3D skeletons.
We integrate the ideas proposed in [17] and [6], and make significant modifications to join temporal information and multiview information into a unified semi-supervised framework, which is trained based on adversarial learning. Although each separate module has been proposed before, how performance can be improved by jointly taking them in a unified model was not investigated before, and this is the main contribution of this article. We will mainly evaluate the proposed framework based on the Human3.6M dataset [9], and verify that significant performance improvement can be obtained.
The rest of this article is organized as follows. Section II presents literature survey on 3D human pose estimation. Section III first briefly describes the semi-supervised adversarial learning framework, and then provides details of how we consider temporal and multiview information. Section IV shows comprehensive experimental studies, followed by conclusion given in Section V.

II. RELATED WORKS A. CONVENTIONAL 3D HUMAN POSE ESTIMATION
Inferring 3D skeletons from 2D projections can be dated back to the work of Lee et al. [11] in 1985. They used lengths of bones and binary decision trees to construct a human pose. Jiang [10] used joint correspondence and searched for optimal 3D human pose from a large amount of data to solve the pose estimation problem. Another way to compile knowledge of 3D human pose is building sparse combinations of features representing human poses [2], [5], [18], [23], [26], [27]. Some methods estimated 3D human pose based on features like shape context [14], silhouettes [1], scale-invariant feature transform (SIFT) descriptors [4], and histogram of gradients (HOG) [21].

B. DEEP-BASED 3D HUMAN POSE ESTIMATION
With the availability of large collections like Human3.6M [9] and the effectiveness of deep learning, recently researchers develop deep learning methods to estimate 3D human pose. Some works [15], [16], [25] have been proposed to train a network in an end-to-end manner, i.e., input an image and estimate 3D human pose directly. Such methods often lead to limited results due to variations of brightness, color, or texture. Martinez et al. [13] divided the problem into two parts, i.e., detecting the 2D skeleton from the image first, and then inferring a 3D skeleton from the 2D skeleton. In this way, a simple linear model can be developed to achieve promising results. Our work follows this two-stage scheme.
Most previous works focused on estimating 3D human pose from a single frame. Recently, researchers have been trying to consider temporal information in the video to obtain more reliable predictions and reduce the influence of noise. Pavllo et al. [17] presented a simple and efficient method for 3D skeleton estimation in videos based on dilated temporal convolutions on a sequence of 2D skeletons.
In addition to using temporal information, if the action was captured from multiple views, information from different views may be complementary. Chen et al. [6] extracted features from two different views, and transformed the features from one view into another. If features from different views are transformed well, these features can be used to estimate better 3D human pose.
Supervised learning methods require a large amount of 2D images paired with 3D skeleton labels. However, labeling 3D data is laborious, and thus annotated 3D data are scarce. Therefore, some works [12], [17], [19] have been proposed to achieve 3D human pose estimation in a weakly supervised or semi-supervised way. Wandt et al. [22] proposed the idea of re-projection and adversarial learning, which enable the model to be effectively trained even if only weakly labeled 3D data are available.
In this work, we would like to jointly consider temporal information and multiview information mainly based on the framework proposed in [22], but train the framework based on the semi-supervised scheme. We develop this unified framework to take temporal and spatial factors together, and verify effectiveness of the proposed method.

III. PROPOSED METHOD
We first briefly introduce the reprojection network (RepNet) proposed in [22], and then present details of how to integrate multiple video frames and multiple views into a unified framework.

A. REPROJECTION NETWORK (RepNet)
The RepNet was designed to take 2D skeletons as the input, and focus on estimating the corresponding 3D skeletons in a weakly supervised manner. Figure 1 illustrates the idea FIGURE 1. Illustration of the re-projection network proposed in [22]. VOLUME 8, 2020 of RepNet. Given a 2D skeleton S = (x 1 , y 1 , . . . , x n , y n ) ∈ R 2×n , where x i and y i are the xy coordinates of the ith joint, a pose generator network G is developed to estimate the corresponding 3D skeleton G(S) = T . The 3D skeleton T is represented by (x 1 , y 1 , z 1 , . . . , x n , y n , z n ) ∈ R 3×n , where x i , y i , and z i are xyz coordinates of the ith joint. Conventional supervised methods compare the estimated 3D skeleton with the ground truth to guide training of the pose generator network. However, the 3D ground truth is scarce. To resolve this issue, three components were proposed in [22].
First, whether the estimated 3D skeleton is plausible is examined by a critic network C. This critic network is designed to measure the difference between the distribution of estimated 3D skeletons and the distribution of real 3D skeletons. To train this critic network, the Wasserstein loss function L crt defined in [3] is used.
Second, the camera parameters K estimated by the camera network should describe weak perspective projection. In [22], this property is introduced and used to design the camera loss L cam , in order to guide camera network training. Conceptually, the loss is calculated as the Frobenius norm between the normalized KK T and identity projection [22].
Third, based on the estimated camera parameters K, the estimated 3D skeleton T can be re-projected to a 2D one S = KT , which should be similar to the given input S. To measure the difference, the Frobenius norm between S and S is calculated as the reprojection loss L rep .
Overall, three losses L crt , L cam , and L rep are linearly combined as L overall = L crt + L cam + L rep , in order to guide training of the entire network. Implementation details of the RepNet please refer to [22].

B. CONSIDERING TEMPORAL INFORMATION
Separately predicting 3D skeletons for each video frame usually causes jiggling results. Inspired by [17], we would like to jointly consider 2D skeletons S 1 , S 2 , . . . , S M in M consecutive frames, and train a pose generator network G to predict the 3D skeleton T M /2 in the M 2 -th frame. Figure 2 shows architecture of the pose generator network and the camera network that jointly considers 2D skeletons at multiple frames. The building blocks are basically residual blocks [7] consisting of convolutional layers with leaky ReLU as the activation function. Particularly, the input consists of 2D keypoints in M consecutive video frames. The 2D skeleton in each frame is constituted by 16 keypoints, and can be represented as a 32-dimensional vector. Therefore, the 2D skeletons in M frames form a M × 32 matrix. The convolutional layer denoted by 2J , 3d1, and 256, for example, means that the input channel is 2J , the convolution kernel is 3 × 3 with dilation 1, and the output channel is 256.
The first half of Figure 2 extracts features from the given 2D skeletons. The second half is constituted of two branches, one for pose generation and the other for camera parameter estimation. The pose generator branch outputs a 48-dimensional result representing coordinates of the sixteen 3D keypoints in the M 2 -th frame. The camera branch outputs a 6-dimensional result representing the camera parameters, based on which 3D skeletons can be reprojected back into 2D skeletons.
To guide model training, three losses L crt , L cam , and L rep are calculated and linearly combined as the same as mentioned in Sec. III-A. We jointly consider information from S 1 to S M , and we predict the 3D skeleton corresponding to the middle of the input sequence, i.e., T M /2 in the M 2 -th frame. To calculate losses, the ground truths corresponding to the M 2 -th frame are used. Training data are sliding windows of M consecutive 2D skeletons, with stride 1. When the frame we want to estimate is at the beginning or at the end of the video, we just pad with the skeleton of the first frame or the last frame, making the number of skeletons reach M .

C. CONSIDERING MULTIVIEW INFORMATION
The Human3.6M dataset was collected by simultaneously capturing the same individual's action from multiple viewpoints. Jointly considering multiple views enables us to include richer information for 3D pose estimation. In addition to temporal information, we attempt to include multiview information in the unified framework, as shown in Figure 3.
Taking two views as an example, the inputs are two 2D skeleton sequences S   critic network C. Similarly, we respectively estimate camera parameters K (1) and K (2) for two viewpoints by the camera network, and calculate the camera losses L (1) cam and L (2) cam , respectively.
One thing very important is that, for the 2D skeleton sequence S (1) i at the first viewpoint, the corresponding pose generator network estimates the 3D skeleton T (1) i at the predefined reference viewpoint V . For the 2D skeleton sequence S (2) i at second viewpoint, the corresponding pose generator network estimates the 3D skeleton T (2) i also at the reference viewpoint V . Therefore, no matter based on 2D skeletons from which viewpoints, the estimated 3D skeletons are at the same viewpoint. This can be seen from the middle part of Figure 3. The camera parameters K (1) and K (2) estimated by two camera networks thus can be used to project estimated 3D skeletons from viewpoint V to the first viewpoint and the second viewpoint, respectively.
We also empirically tried different weights, but the experimental results are similar or worse.

IV. EVALUATION A. DATASET AND EXPERIMENTAL SETTINGS
We mainly perform experiments on the Human3.6M dataset [9]. Videos in this dataset were acquired by recording 15 types of actions performed by 5 female and 6 male subjects, under 4 different viewpoints. In a well-set environment, 4 fixed-position digital video cameras were used to simultaneously capture the subject's action from 4 corners of a rectangular room, and 10 motion cameras were rigged on the walls to capture the signals from small reflective markers attached to the subject's body. By tracking and calibrating these signals, 3D coordinates of body joints are labeled. Of the 11 subjects, seven are annotated with 3D poses. Figure 4 shows some screenshots of videos in the Human3.6M dataset captured from four viewpoints. Overall, this dataset contains 3.6 million video frames for 11 subjects from four different viewpoints. According to the settings of many previous works [22], we evaluate on 17-joint human skeleton. The joint at the hip is always set as the origin. Therefore, the human pose estimation network predicts the coordinates of the remaining 16 joints relative to the hip. Following the experimental settings in [19], we train and evaluate the proposed network in a semi-supervised manner. At the training stage, 2D skeletons of the five subjects S1, S5, S6, S7, and S8, and only the 3D skeletons of S1 are used. This design is to simulate that 2D labeled data are available, but much fewer 3D labeled data are available for training. At the test stage, 2D skeletons of the subjects S9 and S11 are taken as the inputs to estimate 3D skeletons.
The camera matrix contains rotational and scaling components. To avoid ambiguities between the camera and 3D pose rotation, all the rotational and scaling components from the 3D poses are removed. We align every 3D pose to a template VOLUME 8, 2020  pose via the procrustes alignment [8], as shown in the middle part of Figure 3.
Two metrics are used to measure performance. The first one is the mean per-joint position error (MPJPE) in millimeters [9], which is the mean Euclidean distance between predicted joint positions and ground-truth joint positions. The second is the error after alignment with the ground truth in translation, rotation, and scale (P-MPJPE) [6], [13], [17].

B. COMPARING WITH SUPERVISED METHODS
We first compare the proposed semi-supervised method with other supervised methods. To train the supervised models, all 2D and 3D data of S1, S5, S6, S7, and S8 are used. Specifically, totally 1,688,984 (1.68 million) frames and thus 1.68 million 3D labeled skeletons are used in the supervised learning methods. Like other works of supervised learning, we use 2D data with a full frame rate of 50 fps. According to [17], when considering temporal information in the full frame rate setting, the length of input fragments is 243 frames. Table 1 shows performance comparison between the proposed method and the state-of-the-art supervised methods, in terms of MPJPE values averaged over 15 different actions in the Human3.6M dataset. The baseline RepNet achieves 82.4 MPJPE, which is inferior to all supervised methods. This is not surprising. The RepNet was originally proposed to train in a weakly-supervised method, and the 3D ground truth is not explicitly compared with the estimated 3D skeleton. Based on the RepNet architecture, we obtain performance improvement when multiview information (denoted as ''RepNet+M)'' or temporal information (denoted as ''RepNet+T)'' is considered. Jointly considering temporal and multiview information significantly improves RepNet from 82.4 MPJPE to 56.9 MPJPE (based on the one-tail Student's t test, the p value is 8.8e-05). Comparing with the supervised methods shown in the first half of Table 1, we see that RepNet with the designed improvements can achieve encouraging results competitive with [13] and [16]. Figure 5 shows some sample results on the Human3.6M dataset when two types of information are jointly considered (RepNet+T+M).

C. COMPARING WITH SEMI-SUPERVISED METHODS
We next compare the proposed method with current semi-supervised methods. To train the semi-supervised model, 2D data of S1, S5, S6, S7, and S8, and only the 3D data of S1 are used for training. Specifically, only the 3D data in 271,436 frames from S1 are used as the 3D labels, which are much fewer than 1.68 million as mentioned in Sec. IV-B. The main goal of this comparison is to show that, even without most 3D labeled data, the semi-supervised model can still achieve competitive performance.   At this stage of comparison, all 2D and 3D data are downsampled to 10 fps according to the setting mentioned in [19]. Because the data is downsampled, we modify the network architecture proposed in [17] and integrate it into the RepNet framework. The pose generator network and the camera network are designed to be able to take 27 frames as the inputs, as shown in Figure 6.

1) EFFECTIVENESS OF ADVERSARIAL LEARNING
The idea of weak supervision shown in Figure 1 is jointly accomplished by the critic network and the re-projection layer. One may think that re-projection has already enables weak supervision. To verify effectiveness of the critic network, we intentionally remove influence of the critic network by removing the loss L crt when training, and see how performance changes. Table 2 shows performance of the framework without the critic network, in terms of MPJPE. As can be seen, the performance becomes very poor without the critic network. This verifies necessity of the critic network and the effectiveness of adversarial learning.
2) COMPARING WITH SOTA Table 3 shows performance comparison between the proposed method and the state-of-the-art semi-supervised methods, in terms of MPJPE. In our method, when we only consider multiview information, performance of ''RepNet+M'' outperforms the baseline RepNet. But if we  only consider temporal information, the performance drops. It may be because the amount of data is decreased and the network structure appears too simple. Relatively it is more difficult to well train the generator and discriminator in this situation. When considering both temporal and multiview information ''RepNet+T+M'', the performance is better than considering one factor only.
In order to make full use of 3D data for training, we further make the following settings. We especially pick out the data belonging to S1 in each mini batch, calculate the mean square error between the estimated 3D pose and the corresponding ground truth, and use this loss to fine-tune the pose generator network once. The setting ''RepNet+T+M+S1-supervised'' shows that, with this finetuning, the obtained results largely outperform existing works.
Inputs of the proposed method and the RepNet are 2D skeletons. Quality of the input 2D skeletons thus obviously influences the final performance. To verify the influence, we purposely input ground truth 2D skeletons and estimate the corresponding 3D ones. Table 4 shows performance difference between approaches when truth 2D skeletons are taken as the inputs or not, in terms of MPJPE. The row ''RepNet+T+M+S1-supervised (w. 2DGT)'' shows that, by taking truth 2D skeletons as the inputs, the MPJPE value largely decreases from 61.2 to 54.8. This shows importance of quality of the input 2D skeletons.  Table 5 shows performance comparison between stateof-the-art semi-supervised methods, in terms of P-MPJPE. Again we see the proposed method outperforms existing methods. The row ''RepNet+T+M+S1-supervised (w. 2DGT)'' shows that, by taking truth 2D skeletons as the inputs, the P-MPJPE value decreases from 50.3 to 45.6. Comparing with the difference shown in Table 4, the performance difference in terms of P-MPJPE is slightly smaller than MPJPE. We think this is contribution of the procrustes alignment.

V. CONCLUSION
We have presented a semi-supervised 3D human pose estimation framework that jointly takes temporally consecutive frames captured from multiple viewpoints. Based on a limited amount of 3D labeled data, this network predicts 3D skeletons from the given 2D skeletons, and estimate camera parameters as well. Based on the estimated camera parameters, the predicted 3D skeletons are re-projected into 2D ones, which should be similar to the input 2D skeletons. In this manner, a semi-supervised 3D pose estimation network is constructed. In this article, we largely enhance this network by jointly considering temporal information and multiview information. Through comprehensive evaluation, we show that the proposed network achieves the state-of-the-art performance, comparing to other semi-supervised methods.
Occlusion has always been one of most challenging problems in 3D human pose estimation. Currently we take temporally consecutive frames captured from multiple viewpoints as richer resources to extract more representative features. In the future, we would like to design occlusion-aware mechanisms and more completely take advantage of temporal information and multiview information. In addition, there may be difference between the same action taken by different individuals due to cultural difference or gender. We may be able to leverage embedding matching like [24] to learn individual-invariant representations.