1. Introduction
Recovering 3D human mesh from an image or a video is an essential yet challenging task for many applications, such as human-robot interaction, virtual reality, and motion analysis. The challenges of this task arise from the 2D-to-3D ambiguity, cluttered background, and occlusions. Recently, many studies [8],[13],[16],[19],[22],[33] have been proposed to recover the 3D human mesh from a single image, which can generally be categorized into RGB-based methods and pose-based methods. RGB-based methods predict human mesh end-to-end from image pixels, typically predicting the pose and shape parameters of the parametric human model (e.g., SMPL [27]) to generate the 3D human mesh. However, the representation ability of the parametric model is constrained by the limited pose and shape space [18],[19]. To overcome this limitation, non-parametric approaches have been proposed to predict the 3D coordinates of mesh vertices directly, which generally use Graph Convolutional Networks (GCNs) [8],[42] or Transformers [5],[24],[51] to capture the relations among vertices. In contrast, pose-based methods leverage 2D pose detectors [4],[36] as the front-end to recover human mesh from the detected 2D poses. With the significant advancements in 2D pose detection, pose-based methods have become increasingly robust and lightweight, making them popular for real-world applications [51].