Co-Evolution of Pose and Mesh for 3D Human Body Estimation from Video | IEEE Conference Publication | IEEE Xplore

Co-Evolution of Pose and Mesh for 3D Human Body Estimation from Video


Abstract:

Despite significant progress in single image-based 3D human mesh recovery, accurately and smoothly recovering 3D human motion from a video remains challenging. Existing v...Show More

Abstract:

Despite significant progress in single image-based 3D human mesh recovery, accurately and smoothly recovering 3D human motion from a video remains challenging. Existing video-based methods generally recover human mesh by estimating the complex pose and shape parameters from coupled image features, whose high complexity and low representation ability often result in inconsistent pose motion and limited shape patterns. To alleviate this issue, we introduce 3D pose as the intermediary and propose a Pose and Mesh Co-Evolution network (PMCE) that decouples this task into two parts: 1) video-based 3D human pose estimation and 2) mesh vertices regression from the estimated 3D pose and temporal image feature. Specifically, we propose a two-stream encoder that estimates mid-frame 3D pose and extracts a temporal image feature from the input image sequence. In addition, we design a co-evolution decoder that performs pose and mesh interactions with the image-guided Adaptive Layer Normalization (AdaLN) to make pose and mesh fit the human body shape. Extensive experiments demonstrate that the proposed PMCE outperforms previous state-of-the-art methods in terms of both per-frame accuracy and temporal consistency on three benchmark datasets: 3DPW, Human3.6M, and MPI-INF-3DHP. Our code is available at https://github.com/kasvii/PMCE.
Date of Conference: 01-06 October 2023
Date Added to IEEE Xplore: 15 January 2024
ISBN Information:

ISSN Information:

Conference Location: Paris, France

Funding Agency:


1. Introduction

Recovering 3D human mesh from an image or a video is an essential yet challenging task for many applications, such as human-robot interaction, virtual reality, and motion analysis. The challenges of this task arise from the 2D-to-3D ambiguity, cluttered background, and occlusions. Recently, many studies [8],[13],[16],[19],[22],[33] have been proposed to recover the 3D human mesh from a single image, which can generally be categorized into RGB-based methods and pose-based methods. RGB-based methods predict human mesh end-to-end from image pixels, typically predicting the pose and shape parameters of the parametric human model (e.g., SMPL [27]) to generate the 3D human mesh. However, the representation ability of the parametric model is constrained by the limited pose and shape space [18],[19]. To overcome this limitation, non-parametric approaches have been proposed to predict the 3D coordinates of mesh vertices directly, which generally use Graph Convolutional Networks (GCNs) [8],[42] or Transformers [5],[24],[51] to capture the relations among vertices. In contrast, pose-based methods leverage 2D pose detectors [4],[36] as the front-end to recover human mesh from the detected 2D poses. With the significant advancements in 2D pose detection, pose-based methods have become increasingly robust and lightweight, making them popular for real-world applications [51].

Contact IEEE to Subscribe

References

References is not available for this document.