PosePropagationNet: Towards Accurate and Efficient Pose Estimation in Videos

We rethink on the contradiction between accuracy and efficiency in the field of video pose estimation. Large networks are typically exploited in previous methods to pursue superior pose estimation results. However, those methods can hardly meet the low-latency requirement for real-time applications because of their computationally expensive nature. We present a novel architecture, PosePropagationNet (PPN), to generate poses across video frames accurately and efficiently. Instead of extracting temporal cues or knowledge someways to enforce geometric consistency as most of the previous methods do, we explicitly propagate well-estimated pose from the preceding frame to the current frame by leveraging pose propagation mechanism, endowing lightweight networks with the capability of performing accurate pose estimation in videos. The experiments on two large-scale benchmarks for video pose estimation show that our method significantly outperforms previous state-of-the-art methods in both accuracy and efficiency. Compared with the previous best method, our two representative configurations, PPN-Stable and PPN-Swift, achieve <inline-formula> <tex-math notation="LaTeX">$2.5\times $ </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$6\times $ </tex-math></inline-formula> FLOPs reduction respectively, as well as significant accuracy improvement.


I. INTRODUCTION
Video pose estimation aims at localizing human body joints across video frames. It can be applied in many areas, such as human-computer interaction, computer animation and video surveillance. Most of the research works on pose estimation focus on the single-image level, while less attention has been paid to video-based pose estimation mainly because of the limited number of large-scale annotated datasets. Compared with image-based pose estimation, video-based pose estimation is more challenging due to several inevitable troublesome factors, including motion blur, perspective change and scale variation.
Previous methods for video pose estimation task mostly rely on large networks to produce high-quality image representations, facilitating body joint localization at pixel level. Temporal cues are additionally extracted and leveraged to ensure temporal dependency, improving preliminary pose estimation results. As shown in Fig. 1(a), LSTM units are employed to transfer temporal knowledge as hidden states. Besides, optical flow is also widely exploited [1]- [3] The associate editor coordinating the review of this manuscript and approving it for publication was Shuhan Shen. as a strong temporal cue. Although these methods demonstrate applaudable experimental performances, most of them are proven to be computationally expensive, preventing them from meeting the low-latency requirement for real-time applications such as real-time surveillance and autonomous driving.
Lightweight networks are weak in producing satisfying single-image pose estimation results because of their relatively low representational capacity when no supplementary information is provided. However, in the video domain, consecutive frames share great geometric consistency, which makes it possible for lightweight networks to perform accurate pose estimation if temporal knowledge can be somehow transferred across frames to provide guidance. As shown in Fig. 1(b), temporal knowledge is distilled and transferred in the form of pose kernels, providing guidance for lightweight networks in joint localization. Based on this understanding, we take efficiency problem into consideration and propose a novel architecture, PosePropagationNet (PPN), to enhance the capability of lightweight networks in the field of video pose estimation.
The pipeline of our proposed end-to-end trainable PPN is shown in Fig. 1(c). Instead of bothering to generate temporal  [4]. (b) Pipeline of the Dynamic Kernel Distillation (DKD) network [5]. (c) The proposed pipeline which takes advantage of pose propagation mechanism, allowing lightweight networks to perform high-quality pose estimation in videos. We provide two representative configurations, PPN-Stable and PPN-Swift. Accuracy and computational efficiency of different methods are compared in (d). Evaluation is implemented on Penn Action Dataset with metric PCK-torso. The floating-point operations (FLOPs) is used to measure computational efficiency. Detailed numerical results are shown in Table. 5.
cues or knowledge in a learnable form, we directly propagate the pose estimated from the previous frame to the subsequent frame as explicit temporal guidance. The subsequent pose can be generated by transforming the previous pose according to joint motion offsets between the two frames. We implement the pose propagation mechanism illustrated above with a specially designed module, namely the Pose Propagation Unit (PPU). As such, the process of localizing body joints is converted to pose propagation across frames, which is a less challenging task for lightweight networks. Compared with LSTM units [4] and pose kernels [5], our PPU carries explicit temporal guidance in a more computationally compact way, leading to dramatically FLOPs reduction while achieving significantly higher accuracy, as shown in Fig. 1(d). We evaluate our method on two widely used video pose estimation benchmarks, Penn Action Dataset [6] and Sub-JHMDB Dataset [7], obtaining state-of-the-art performances in both accuracy and efficiency.
Contributions of our work can be summarized as follows: 1) We propose a novel architecture, PosePropagationNet, for video pose estimation. Geometric consistency is guaranteed in the manner of pose propagation, facilitating the model to generate accurate and consistent pose estimation results in videos and achieve state-of-the-art accuracy on two major benchmarks. 2) Benefitting from the pose propagation mechanism we present, lightweight networks employed in PPN can perform pose estimation accurately and efficiently in videos. Significant FLOPs reduction over previous state-of-the-art methods allows our PPN to meet the low-latency requirement for real-time applications.

II. RELATED WORK A. HUMAN POSE ESTIMATION IN IMAGES
Early research works studying image-based single-person pose estimation are mostly based on pictorial structures [8]- [11], which model human body as a tree-structured graph. However, those methods naturally lack the ability to deal with complex occlusions. Most of the recent works take advantage of deep Convolutional Neural Network (CNN) and follow a regression fashion: regressing joint coordinates [12] or regressing joint heatmaps [13]- [17]. These CNN-based methods either employ multistage architectures [13], [15] to recursively refine estimation results, or build strong backbones [14], [16] to efficiently extract high-level image representations, in order to achieve competitive performance on popular benchmarks [18], [19].

B. HUMAN POSE ESTIMATION IN VIDEOS
Video pose estimation has attracted less attention compared with image-based pose estimation mainly because of the limited number of large-scale benchmarks in video domain. Existing research works focus on extracting temporal cues, such as optical flow [1]- [3], [20], to help refine framewise estimation results generated by large networks. Song et al. [1] propose a deep spatio-temporal network, namely Thin-Slicing, which aligns joint heatmaps across frames based on dense optical flow computation. Recurrent architectures are exploited in [4], [21] to transfer temporal information in the form of hidden states. A large network is typically required to serve as image encoder, producing highlevel image representations. 3D CNN is investigated in [22] to capture temporal dependency, facilitating multi-person pose estimation in videos. Nie et al. [5] propose a method that distills pose kernels and thus simplifies joint localization as a matching problem. We take the efficiency problem into consideration and explicitly propagate poses across frames as temporal guidance, allowing lightweight networks to perform accurate pose estimation in videos.

III. METHODOLOGY
As shown in Fig. 2(a), we build our PosePropagationNet (PPN) as a streamline architecture so that consecutive frames within a temporal range can be processed in a single-shot feed-forward manner. In the following, we first introduce the overall pipeline of our network and then go through the details of each component.

A. OVERALL PIPELINE OF PosePropagationNet
Given a video sequence that contains T consecutive frames F = {I t } T t=1 , where I t ∈ R H ×W ×3 denotes the frame at time step t, we enable our proposed PPN to generate a set of S ×K denotes the estimated joint heatmaps for frame I t . We use H and W to denote the height and width of frames, and use S and K to denote the total stride of the network and the number of joints, respectively. For frame I t , the lightweight BodyNet takes charge of generating preliminary joint heatmapsĥ t ∈ R H S × W S ×K . Afterwards, together with the joint heatmaps h t−1 from the previous frame,ĥ t is fed into Pose Propagation Unit (PPU), which is able to propagate the previous pose to the current time step according to joint motion offsets between the two frames, outputting the propagated joint heatmaps for frame I t is computed by combiningĥ t andh t with elementwise addition. Since there is no predecessor for the first frame I 1 , we additionally design a HeadNet that is generally much larger than BodyNet, to generate reliable initial joint heatmapsĥ 1 . In order to reduce parameter amounts in our network, BodyNets and PPUs throughout all time steps follow the weight-sharing principle. Loss is computed on the produced final joint heatmaps h t across all frames. Note that the first frame appears twice in the feed-forward process, so both two sets of joint heatmaps for the first frame,ĥ 1 and h 1 , are involved in loss computation. Given the ground truth joint heatmaps g t for frame I t , the loss is defined as the Mean Squared Error MSE(·) shown in Eq. 1.

B. FROM PoseWarper TO POSE PROPAGATION UNIT
We get the inspiration of designing PPU from PoseWarper, which is proposed by Bertasius et al. [23] to solve the problem of pose estimation in sparsely-annotated video datasets. Specifically, the relationship between two adjacent frames with opposite annotation status (one labeled, one unlabeled) is investigated. PoseWarper is designed to build that relationship by estimating joint motion offsets between the two frames and performing pose estimation on the unlabeled frame by transforming the labeled pose according to the estimated offsets. We recognize the capability of PoseWarper to transfer labeled pose to adjacent unlabeled frames and build our PPU on the basis of PoseWarper architecture along with several significant modifications. In the following, we first mathematically formulate the pipeline of PoseWarper and then introduce the modifications we make. Given labeled frame I t and unlabeled frame I t+1 , Pose-Warper is trained to estimate poses for I t+1 by transferring the labeled pose of I t . Firstly, I t+1 is fed into an image-based pose estimation network, outputting preliminary joint heatmapŝ h t+1 . On the other side, the joint heatmaps h t for frame I t can be obtained from ground truth. Afterwards, the difference between h t andĥ t+1 is computed as and fed into a stack of convolution blocks (·). The output feature maps are then fed into a set of convolution layers with different dilation rates C (d) (·) to generate a set of offset tensors, namely where o t,t+1 denotes the estimated joint motion offset tensor between time step t and t + 1 with dilation rate d and D is an ensemble of different dilation rate values. Finally, those offset tensors are used to transform joint heatmaps h t via deformable convolution layers [24] Conventionally, preliminary joint heatmapsĥ t+1 can be viewed as the final pose estimation result of I t+1 following a single-image manner regardless of temporal dependency. In PoseWarper, geometric consistency is taken into consideration in the process of pose transferring and transformation, leading to a better pose estimation result for I t+1 . We design our PPU on the basis of PoseWarper. As shown in Fig. 2(b), PPU takes the estimated joint heatmaps from the previous frame and preliminary joint heatmaps of the current frame generated by BodyNet as inputs, producing the propagated joint heatmapsh t+1 based on pose propagation mechanism illustrated above. The modifications we make are mainly in three folds: 1) We unify the pose propagation path during training and evaluation phases. As shown in Fig. 3(a), for PoseWarper, frames are sparsely annotated. During the training phase, poses are transferred from unlabeled frames to the labeled frame to meet supervision. On the contrary, during the evaluation phase, the labeled pose is reversely transferred to unlabeled frames to perform dense pose estimation. In our architecture, the training path illustrated above fails to fit our HeadNet-BodyNet configuration, as it is somewhat counterintuitive to refine a better pose by transforming a worse one, which intrinsically increases the difficulty of training. One possible pipeline for unify training path and evaluation path is shown in Fig. 3(b). Benefitting from densely-annotated datasets, poses can be propagated along that path to meet supervision at each time step during the training phase.
2) We modify the cascade scheme across frames. The pipeline shown in Fig. 3(b) propagates the high-quality pose from time step t to several subsequent time steps respectively. It perfectly corresponds to our HeadNet-BodyNet configuration, as the high-quality pose remains undamaged throughout the propagation process. However, it can be expected that the above pipeline would perform poorly when applied to long-range video sequences, since poses from temporally distant frames can hardly provide any useful guidance to the current frame in videos containing complicate human motions. We design our pose propagation path by building a connection between each neighboring frame pair, as shown in Fig. 3(c). In such a pipeline, poses are iteratively propagated from the previous frame to the current frame, ensuring the validity of the information flow. Therefore, our method is expected to be more scalable, and is capable of dealing with video sequences with different frame ranges, meeting various requirements in real applications.
3) Instead of treating the propagated joint heatmapsh t as the final joint heatmaps for I t , we further fuse them with the preliminary joint heatmapsĥ t via skip connection. In our architecture, HeadNet, BodyNet and PPU can be simultaneously trained. The propagated joint heatmapsh t are generated by transforming joint heatmaps from the previous frame h t−1 via deformable convolution. The preliminary joint heatmapŝ h t generated by BodyNet are somehow vanished in that course and thus not directly involved in loss computation, which prevents BodyNet from receiving sufficient training. To solve the problem, we perform identity mapping forĥ t and combine it with the propagated joint heatmapsh t via elementwise addition. Following that fashion, the preliminary joint heatmapsĥ t are explicitly involved in loss computation, facilitating the effective training of BodyNet.

C. HeadNet AND BodyNet
We employ two pose estimation networks for different time steps, namely HeadNet and BodyNet. HeadNet is responsible for performing pose estimation on the first frame. Generally speaking, the quality of initial pose decides the overall level of pose estimation results in that video sequence. Therefore, large networks are typically employed as HeadNet to guarantee high performance. Afterwards, BodyNet takes charge of generating preliminary pose for each frame. Since pose propagation mechanism brings geometric knowledge from the previous frame to the current frame, BodyNet can be much more lightweight.

IV. EXPERIMENTS A. DATASETS 1) PENN ACTION DATASET
Penn Action Dataset [6] contains 2326 video sequences of 15 different actions, where 1258 clips are used for training and 1068 clips are used for testing. The number of frames varies among different video sequences. The 2D locations and visibility of totally 13 body joints are annotated for each frame, including head, shoulders, elbows, wrists, hips, knees and ankles. During testing, only visible joints are involved in evaluation.
2) SUB-JHMDB DATASET JHMDB [7] is another dataset for video-based pose estimation. For fair comparison with previous works, only a subset of JHMDB is used in our experiments, which is named as Sub-JHMDB. Sub-JHMDB consists of 316 video clips with 11200 frames in total. In Sub-JHMDB, only complete human bodies are involved and totally 15 body joints are annotated for each human instance. There are three split schemes for Sub-JHMDB and the split ratio of training and testing samples is roughly 3:1. Following previous works [1], [4], [5], we train and evaluate our method separately and report the average result over the three splits.

B. IMPLEMENTATION DETAILS 1) DATA AUGMENTATION
We perform data augmentation strategies following previous works [4], [5], including random scaling ([0.8, 1,4]), random rotation ([−40 • , 40 • ]) and random flipping. On account of sequential input, the transformation remains consistent across frames within a video sequence. All the frames are cropped based on the center and scale of the person instance and padded to a fixed size (256 × 256) as input.

2) EXPERIMENT SETTINGS
Following previous works [4], [5], we pretrain all imagebased pose estimation networks exploited in our experiments on MPII dataset [18]. The frame range T of each sample is set as 5. Deconvolution layers used in our experiments follow the settings illustrated in [16]. Adam optimizer [25] is adopted with 10 −5 weight decay, and the learning rate is decreased linearly from 10 −4 to 0. We set the batch size as 8 and train our network for 300k iterations. During evaluation phase, seven scales {0.8, 0.9, 1.0, 1.1, 1.2, 1.3, 1.4} are used for multi-scale inference.

3) EVALUATION METRICS
We adopt the PCK metric proposed in [11] to evaluate our pose estimation results. In PCK, a joint is considered as being correctly localized if it falls within a predefined threshold α·L, where α is a controlling coefficient and is conventionally set to 0.2. L is the reference distance, which is set as L = max(H , W ) in [1], [4], where H and W denote the height and width of bounding box of the person instance. However, since the scale of person is large, this metric has been considered to be too loose to differentiate different methods. Following [5], [26], we additionally adopt the definition of reference distance L as torso diameter, which is defined as the distance between left shoulder and right hip of ground-truth skeleton [26]. To avoid ambiguity, we term the above two metrics as PCK-body and PCK-torso respectively.

C. ABLATION STUDIES
We perform ablation studies to verify the effectiveness of our proposed PPN from two aspects. On the one hand, PPN can largely improve the performance of existing imagebased pose estimation networks in video domain by introducing pose propagation mechanism. On the other hand, PPN endows lightweight networks with the capability of performing accurate pose estimation by explicitly propagating highquality pose generated from a large network forward across frames.
Firstly, we investigate one of the state-of-the-art imagebased pose estimation networks, Simple Baseline [16], which follows a high-to-low-to-high pipeline that first extracts high-level low-resolution feature maps with ResNet family [27] and then raises the resolution back to a decent level via 2-strided deconvolution layers. Specifically, we vary the backbone of Simple Baseline models among ResNet-x, x ∈ {18, 34, 50, 101} and evaluate each configuration on Penn Action Dataset following the single-image framewise manner as baselines, which are denoted as Framewise (ResNet-x) in Table. 1. Following original settings in [16], three 2-strided 4 × 4 deconvolution layers are appended to recover resolution. For comparison, we adopt Simple Baseline models as both HeadNet and BodyNet in our PPN. We use PPN (ResNet-x) in Table. 1 to denote our proposed network with ResNet-x as the backbone of HeadNet and BodyNet.
It can be observed from Table. 1 that PPN improves singleimage framewise pose estimation results by a large margin. The improvement appears more obvious on evaluation metric PCK-torso, since results on PCK-body tend to be somewhat saturated. We can find that by introducing temporal pose propagation mechanism, PPN lifts the accuracy of pose estimation by 0.90% on PCK-body metric and 2.28% on PCK-torso metric in average. The performance of PPN with a relatively smaller backbone, ResNet-18, even significantly surpasses the level of single-image framewise pose estimation results with a larger backbone, ResNet-34 (92.1% versus 90.4% on PCK-torso). The above results convincingly verify the effectiveness of our proposed Pose Propagation Unit for providing temporal guidance to refine single-image framewise pose estimation results.
Furthermore, we investigate the potential of lightweight networks for performing accurate pose estimation in videos by enforcing pose propagation mechanism. From Table. 1, we can find that lightweight networks alone are weak in producing satisfying pose estimation results. For example, Framewise (ResNet-18) achieves merely 88.7% accuracy on  PCK-torso. We realize that deconvolution operation can be especially computationally intensive if applied on feature maps with large spatial size during the upsampling phase. Taking Framewise (ResNet-18-w-Deconv) shown in Table. 2 as baseline, we implement an ablation study to further reduce network computational intensity and enhance network capability at the same time. On the one hand, instead of using expensive deconvolution layers, we investigate the usage of Dense Upsampling Convolution (DUC) layer that is proposed in [28] to implement 2× upsampling. As shown in the 1st and 2nd rows of Table. 2, by replacing deconvolution layers with DUC layers, we achieve over 2× FLOPs reduction with minor accuracy decrease. On the other hand, in order to introduce pose propagation mechanism, we adopt the state-of-theart architecture on MPII benchmark [18], HRNet-W48 [14], as our HeadNet to generate high-quality initial pose for better performance, which is denoted as PPN (ResNet-18-w-DUC) in Table. 2. It can be observed from Table. 2 that despite of its weak performance on single-image level, the capability of lightweight network ResNet-18-w-DUC in video domain is dramatically boosted by propagating high-quality pose generated by strong HeadNet across frames.
To further verify the capability of PPN to facilitate lightweight networks to perform accurate pose estimation in videos, we experiment with another smaller backbone for BodyNet, MobileNet-V2 [29]. The effectiveness of MobileNet family is broadly evaluated in the field of image classification, object detection and semantic segmentation. As shown in Table. 2, we use Framewise (MobileNet-V2w-Deconv) to denote single-image pose estimation with MobileNet-V2 as backbone and deconvolution layers as upsample unit. Likewise, we replace deconvolution layers with DUC layers to perform single-image framewise pose estimation and denote it as Framewise (MobileNet-V2-w-DUC). Significant FLOPs reduction can be observed following that setting, which is down to no more than 0.6G. Then we treat that tiny network as BodyNet and employ HRNet-W48 as HeadNet to constitute our PPN, which is denoted as PPN (MobileNet-V2-w-DUC). Compared with PPN (ResNet-18-w-DUC), dramatic FLOPs reduction can be witnessed, while high performance is still maintained. By comparing the 3rd and 4th rows, as well as 7th and 8th rows of Table. 2, we demonstrate the necessity of skip connection (SC) in PPN that fuses the propagated joint heatmaps with the preliminary joint heatmaps. Additionally, we visualize the joint heatmaps of left hip generated by BodyNet (2nd row), PPU (3rd row) and their combination via skip connection (4th row) in Fig. 4. It can be observed that the propagated joint heatmap of left hip is noisy with plenty of false positive points with relatively high response values. The preliminary joint heatmap generated by lightweight BodyNet is relatively clean, while high responses fall in a large region around the precise location of left hip. With skip connection, the final joint heatmap is somewhat clean with high responses compactly aggregated.
Finally, we specially verify the scalability of our method to adaptively perform pose estimation for video sequences with different frame range T. To better simulate real application scenes, we directly apply our representative model, PPN (ResNet-18-w-DUC) that is trained with T = 5, to testing samples with different frame range T ∈ {1, 2, 5, 10, 15}. As shown in Table. 3, without being specially trained, our network still maintains a competitive performance within long frame ranges. Based on the experimental results shown above, we adopt PPN (ResNet-18-w-DUC) and PPN (MobileNet-V2-w-DUC) as two major configurations in our experiments that are capable of generating poses across video frames accurately and efficiently, and term them as PPN-Stable and PPN-Swift respectively for simplicity.

D. COMPARISON WITH STATE-OF-THE-ART METHODS
To verify the superiority of our method, we compare our PPN with the previous state-of-the-art, which is the Dynamic Kernel Distillation (DKD) network proposed in [5], under the same settings. Specifically, in DKD, a large pose initializer is designed to generate initial pose and the following frame encoders for feature extraction are much smaller, which is fairly similar to our HeadNet-BodyNet configuration. Modified from Simple Baseline [16] models, the pose initializer and frame encoder used in DKD both follow a highto-low-to-high pipeline, where ResNet family is exploited to encode image representations and two 2-strided 4 × 4 deconvolution layers are appended to perform upsampling. Therefore, the total stride of pose initializer and frame  Table. 1. The FLOPs and evaluation result on PCK-torso of each configuration are reported.
It can be observed from Table. 4 that our PPN significantly outperforms DKD in the stricter metric PCK-torso, with 1.13% accuracy improvement in average. Especially for localization of shoulder and wrist joints, PPN achieves 1.63% and 1.90% accuracy improvement in average, respectively. Moreover, compared with the pose kernels employed in DKD that transfer temporal knowledge, our designed PPU propagates well-estimated poses across frames to provide temporal guidance in a more compact manner (0.20G versus 0.99G additional FLOPs against baselines). The superiority of our method is thus verified from the perspective of both accuracy and efficiency.
In addition, we compare our two representative configurations, PPN-Stable and PPN-Swift, with previous stateof-the-art methods in the field of video pose estimation on Penn Action Dataset, as shown in Table. 5. We can observe that our method significantly outperforms all of the previous state-of-the-art methods in both accuracy and efficiency. As for accuracy, PPN-Stable achieves 1.0% improvement on PCK-body and 1.3% improvement on PCK-torso over the previous best method. Our tiny configuration PPN-Swift also produces better results compared with the state-of-thearts, achieving 0.7% improvement on PCK-body and 0.9% improvement on PCK-torso over the previous best method. Moreover, our method diminishes computational complexity by a large margin compared with the state-of-the-arts. Compared with LSTM Pose Machines proposed by Luo et al. [4], PPN reduces FLOPs by a magnitude over (3.87G/1.39G versus 70.98G). Compared with the previous best method [5], our two configurations, PPN-Stable and PPN-Swift, achieve 2.5× and 6× FLOPs reduction respectively. We visualize the comparison of accuracy and efficiency between our method and the above two state-of-the-art methods in Fig. 1(d), demonstrating the great superiority of our method. Table. 6 shows the comparison results on Sub-JHMDB Dataset between our method and the previous state-of-thearts. The scale of person instance in Sub-JHMDB Dataset is generally smaller than that in Penn Action Dataset, which makes it more challenging to generate accurate pose estimation results on Sub-JHMDB Dataset. Compared with the previous best method [5], our two configurations, PPN-Stable and PPN-Swift, achieve 2.4% and 1.9% accuracy improvement on metric PCK-body, and 3.9% and 3.0% accuracy improvement on metric PCK-torso.

E. QUALITATIVE RESULTS
We provide some qualitative results generated on randomly selected frames from Penn Action Dataset and Sub-JHMDB Dataset to demonstrate the capability of our PPN. As shown in Fig. 5, PPN can robustly produce accurate pose estimation results against several troublesome factors, such as motion blur (the 3rd row of Fig. 5(b)), scale change (the 4th row of Fig. 5(b)) and articulated occlusion (the 3rd and 4th rows of Fig. 5(a), the 1st and 2nd rows of Fig. 5(b)). Besides, frames with crowded background can be effectively dealt with, as shown in the 1st row of Fig. 5(a). Moreover, the person scale, viewpoint and illumination vary among frames, reflecting the great robustness of our proposed PPN.

V. CONCLUSIONS
In this paper, we propose a novel architecture, PosePropaga-tionNet, for video pose estimation. We implement pose propagation mechanism via the design of pose propagation unit in PPN, allowing well-estimated poses to be propagated across frames as the most explicit temporal guidance. Benefitting from the pose propagation mechanism, lightweight networks gain the capability of performing accurate pose estimation in videos. Our experiments on two large-scale benchmarks, Penn Action Dataset and Sub-JHMDB Dataset, show that our method significantly outperforms previous state-of-theart methods both in accuracy and in efficiency. Our two representative configurations, PPN-Stable and PPN-Swift, achieve 2.5× and 6× FLOPs reduction respectively over the previous best method, as well as significant accuracy improvement.
YU LIU received the bachelor's degree from the Nanjing University of Aeronautics and Astronautics, Nanjing, China, in 2017. He is currently pursuing the master's degree with Tsinghua University. His research interests include machine learning and action recognition.