I. Introduction
Hands play a central role in the interaction between humans and the environment, from physical contact and grasping to daily communications via hand gesture. Learning 3D hand reconstruction is the preliminary step for many computer vision applications such as augmented reality [1], sign language translation [2], [3], and human-computer interaction [4], [5], [6]. However, due to diverse hand configurations and interaction with the environment, 3D hand reconstruction remains a challenging problem, especially when the task relies on monocular data as input.