Light and Fast Hand Pose Estimation From Spatial-Decomposed Latent Heatmap

We present a light and efficient approach named Latent Fusion network for fast and accurate hand pose estimation from a single depth image. Our method innovatively decomposes 3D joint regression into 2D plane localization and 1D axis estimation from different spatial perspectives. We design multiple latent heatmap regression branches to predict hand pose separately and a fusion network to output the final result. Experiments on three public hand pose datasets (ICVL, NYU, MSRA) demonstrate that our system achieves state-of-the-art accuracy. Moreover, our method outperforms all top-ranked approaches by a large margin both in terms of inference speed (nearly a thousand frames per second) and model size (less than 10 MB).


I. INTRODUCTION
3D hand pose estimation is the primary technique in human computer interaction and an essential research topic in computer vision community [1]. With the wide availability of all kinds of depth sensors like Intel Realsense [2], Microsoft Kinect [3], depth-based hand pose estimation has attracted much research attention [4]- [8]. Despite the fact that great improvement has been achieved in this field, it's still challenging for accurate and robust estimation in real-time and low-cost. On the one hand, the flexibility of articulated hand pose and severe self-occlusion have made localizing hand joints in 3D space quite difficult. On the other hand, proposed methods should be highly efficient and lightweight to suit real-world applications.
Recent years have witnessed the success of utilizing convolutional neural networks (CNN) in hand pose estimation. CNN-based methods [9]- [18] have an advantage in efficiency and robustness. Given an input depth image, traditional CNN-based methods simply feed it into 2D CNN, while recent studies convert it into 3D voxel beforehand and use 3D CNN to better exploit spatial information. Though greatly improve the estimation accuracy, 3D CNN-based methods suffer from large memory request and low running speed.
The associate editor coordinating the review of this manuscript and approving it for publication was Fanbiao Li . Joint coordinates and spatial heatmaps are two commonly used representations for the output of CNN-based methods. Coordinates regression-based methods [10], [12]- [14], [17], [19] use fully connected layers to output target joint coordinates directly. Heatmap regression-based methods [11], [15], [20] produce a probability heatmap for each joint in Gaussian distribution, whose peak is positioned at the ground truth joint location and standard deviation σ is manually assigned, representing per-pixel or per-voxel likelihood. Heatmap regression-based methods outperform coordinates regression-based methods in estimation accuracy [8], but it's time-consuming to extract joint locations from heatmaps by non-differentiable argmax operation. Furthermore, explicitly assigned distribution and fixed σ are not ideal for different joints. Therefore, we need to solve the above defects before applying heatmap regression into depth-based hand pose estimation.
In this work, we propose a novel spatial-decomposed latent heatmap regression method with single 2D depth image input and 2D CNN architecture. To tackle the aforementioned problems of heatmap regression, we introduce latent heatmaps in our network. The automatically learnt heatmaps are fully differentiable, and only multiply-accumulate operation is needed to extract joint coordinates. In order to better alleviate the self-occlusion problem and boost estimation performance, we extend our method to multiple regression branches with different spatial decompositions. The final output is the weighted average of different branches' predictions via a fusion network. Experiments show that our Latent Fusion network has the best overall performance (estimation accuracy, inference speed and model size) on three challenging hand pose datasets (ICVL, NYU, MSRA). Our method outperforms all previous approaches on ICVL dataset and achieves state-of-the-art accuracy on NYU and MSRA dataset. Moreover, our method can run at nearly a thousand frames per second during testing on a single GPU, 8 times faster than the current best result [18] among top-ranked approaches. We also maintain the lightest model size which is less than 10MB.
Our contributions are summarized below.
1) We present a light and fast Latent Fusion network for depth-based hand pose estimation. The network takes a 2D depth image as input and adopts latent heatmap regression to localize 3D hand joints. 2) We design a multi-branch network to regress hand joints from different spatial decompositions. The proposed architecture can better excavate 3D information and alleviate the self-occlusion problem. 3) We propose a novel pose fusion strategy to combine different branches' predictions to give the most accurate final output, which can significantly boost the estimation performance.
The remainder of this paper is organized as follows. Section II reviews related work. Section III introduces the details of our proposed Latent Fusion network. Comprehensive experiments and ablation study are provided in Section IV. Section V gives a conclusion of this paper.

II. RELATED WORK A. DEPTH-BASED HAND POSE ESTIMATION
Hand Pose estimation from a single depth image can be generally categorized into three classes: generative methods, discriminative methods and hybrid methods. Generative methods [21]- [25] use pre-defined hand models to fit input depth images. Model parameters are optimized by an energy function. Although generative methods leverage kinematic constraints, accumulative estimation error and tedious optimization process have made them impractical for real-world applications. Discriminative methods learn optimal hand joint positions directly. They are data-driven and the most popular way for hand pose estimation. Hybrid methods [26]- [30] combine generative methods and discriminative methods, but still suffer from the same problem as generative methods. Earlier discriminative methods use random forest [31]- [33] to estimate hand joint locations. They are gradually replaced by CNN-based methods with more powerful feature extractor. CNN-based methods can be divided into two groups: 2D-based and 3D-based. 2D-based CNN methods [10], [12], [14], [17] treat depth image as a single channel image while 3D-based CNN methods transform depth image into voxel [13], [15], [20] or point clouds [19], [34]- [36].
Since it's time-consuming for voxelization in 3D CNN or sampling in point clouds, we adopt 2D CNN in our method without complicated data preprocessing.

B. HEATMAP REGRESSION
Heatmap regression benefits from a stronger supervised signal and fully convolutional network (FCN) architecture. 2D joint heatmap for hand pose estimation was first proposed in [9] and greatly improved by later works. Moon et al. [15] adopted 3D heatmap for joint regression. Wan et al. [16] utilized both 2D and 3D heatmaps and obtained output by the mean-shift algorithm. However, 2D heatmaps are inadequate for 3D pose estimation, while 3D heatmaps require 3D input. Mixed representations like [16] can't be trained in an endto-end manner. To this end, we cast hand joint space into multiple heatmap groups from different spatial decompositions to better excavate 3D information and regress hand joint coordinates.
Another challenge to heatmap regression is the selection of σ . If the selected σ is small, the supervised signal will be sparse, and the situation can even degrade to coordinates regression. If the selected σ is large, joint locations will be inaccurate as candidate regions are large. To solve this issue, Wu et al. [37] employed dense guidance map with geodesics approximation, but it brought arduous computation. Iqbal et al. [38] proposed latent heatmap regression towards RGB images input, which is somewhat inaccurate for depth images with severe self-occlusion. Inspired by this work [38], we combine the idea of latent heatmap regression and spatial decomposition to boost the accuracy for depth-based hand pose estimation.

C. ENSEMBLE AND FUSION
Ensemble and fusion technique has been widely used in hand pose estimation. Moon et al. [15] applied epoch ensemble to average pose predictions from different training stage models. Guo et al. [12], [17] proposed region ensemble to extract most representative feature regions and fused them for holistic regression. Ge et al. [11] projected point clouds of the depth image on different views as multiple inputs and obtained final fusion result by the maximum a posteriori estimation. Different from all previous works, our network consists of multiple separate regression branches with single depth image input. We further design a tiny fusion network to straightly output the weighted average result over different branches' predictions by assigning each prediction a normalized fusion weight.

III. LATENT FUSION NETWORK
The whole architecture of Latent Fusion network is illustrated in Fig. 1. Latent Fusion network takes a 2D depth image as input and outputs corresponding hand joint locations. The main backbone of our network is hourglass module [39] which consists of recursive sampling blocks and residual blocks. The input depth image first goes through a feature extraction network to generate sharing features as input for VOLUME 8, 2020 FIGURE 1. Overview of our proposed network architecture. Our model consists of three subnetworks: the feature extraction network, the multi-branch latent heatmap regression network and the pose fusion network. Feature extraction network takes a single 2D depth image as input and generates sharing low-level features for following latent heatmap regression branches. Each branch estimates 3D hand joint coordinates from a group of latent heatmaps with different spatial decomposition. The final output is the weighted average of all three branches' predictions via the fusion network. The whole framework is trained in an end-to-end manner by introducing a separate loss function for each branch and the fusion output. different regression branches. Each branch conducts latent heatmap regression from different spatial decomposition separately. The final pose is the weighted average of all branches' predictions via the fusion network.
We describe details in the following order. III-A introduces latent heatmap regression. III-B extends it to multi-branch prediction with spatial decomposition. III-C describes the fusion network. III-D presents the loss function. Implementation details are discussed in III-E.

A. LATENT HEATMAP REGRESSION
The goal for 3D hand pose estimation is to learn a mapping function between input depth image I and output hand joint is the 3D location of joint j and M is the total number of joints. Instead of using explicit heatmap regression with fixed σ to generate Gaussian distribution, we adopt latent heatmap regression proposed in [38] to learn an optimal distribution automatically without constraints. In order to legalize the learnt distribution, spatial softmax normalization is applied on the latent heatmap to enforce its summation equals to one strictly. Since we use latent heatmaps in 2D form, so normalization is operated on each joint j's 2D heatmap plane as follows: where u and v are pixel locations in the 2D plane, H j and H * j are latent 2D heatmaps before and after spatial normalization.
Since latent 2D heatmaps only provide per-pixel likelihood and depth information is needed for localizing hand joints in 3D space, we output latent 1D heatmaps for depth estimation along with latent 2D heatmaps. Each element in latent 1D heatmap D j is a predicted depth value of the corresponding position at normalized latent 2D heatmap H * j . We can obtain 3D coordinates of joint j via H * j and D j by the following The detailed architecture of latent heatmap regression. As 3D joint regression is decomposed into 2D plane localization and 1D depth estimation, the network outputs a latent 2D heatmap and a latent 1D heatmap for each joint separately. After applying spatial softmax normalization on latent 2D heatmaps, output pose prediction can be obtained by equation 2. equations: where u and v are corresponding normalized pixel coordinates within [−1, 1]. Pixel location (u j ,v j ) is the centroid of normalized latent 2D heatmap H * j while depth estimation d j is the weighted average of latent 1D heatmap D j and normalized latent 2D heatmap H * j . The detailed architecture of latent heatmap regression is shown in Fig. 2. Our network outputs 2M feature maps simultaneously with M latent 2D heatmaps and M latent 1D heatmaps. Therefore, after applying spatial softmax normalization on latent 2D heatmaps, by using equation 2 and multiply-accumulate operation, we can obtain joint coordinates {p j } M j=1 easily.

B. SPATIAL DECOMPOSITION
Section III-A introduces single branch latent heatmap regression, as normalized latent 2D heatmap H * j provides per-pixel likelihood in XY-plane while latent 1D heatmaps predict corresponding depth value along the Z-axis. In this way, 3D joint regression is decomposed into 2D plane localization (XY-plane) and 1D axis estimation (Z-axis). Since single branch regression is insufficient for accurate hand pose estimation under some extreme cases, spatial decomposition strategy can learn richer 3D representation and better alleviate the self-occlusion problem. Specifically, we design three separate latent heatmap regression branches and cyclically decompose X, Y, Z three-axis into 2D plane localization and 1D axis estimation inside each branch. Hence the multi-branch regression network contains three different spatial decompositions (XY-plane and Z-axis, XZ-plane and Y-axis, YZ-plane and X-axis). Each regression branch has the same architecture and receives common sharing input features, as shown in Fig. 3. Besides, considering the fact that the spread of latent 2D heatmaps for the same joint should be identical no matter how spatial perspective changes, we introduce a group of sharing parameters ω as cross-branch constraint for different joints. The learnable group of parameters ω control the spread of heatmaps and are updated synchronously among all branches. Thus, the spatial normalization for joint j in equation 1 can be reformulated as follows: where ω j is the sharing parameter of joint j in all three branches. All other operations inside each branch remain the same as section III-A.

C. FUSION NETWORK
As each branch gives an independent estimation {p j } M j=1 , the final output should be the aggregation of all predictions. Therefore, we assign a fusion weight for each branch's prediction. The fusion weights are produced by the fusion network which receives transformation input from sharing features and latent heatmaps illustrated in Fig. 4. Inside the network, we use two convolutional layers and two max pooling layers to reduce the spatial size of feature maps. In order to generate a 1×1 fusion weight for each prediction, we employ global average pooling [40] and sigmoid activation function at the end of the network. The final fusion result of joint j is obtained as follows: where p i j denotes the pose prediction from branch i and µ i represents the corresponding normalized fusion weight.

D. LOSS FUNCTION
The loss function for branch i is defined as follows: where J i = {p i j } M j=1 denotes the predicted hand joint coordinates and J is ground truth labels. L sml1 is the smooth L 1 loss proposed in [44]. Loss function L f for the final fusion result is defined in a similar way.
The total loss function for the entire network is:

E. IMPLEMENTATION DETAILS
Our network is implemented in Pytorch [45]. The detailed architecture of feature extraction network, hourglass network and fusion network can be found in Table 6 in appendix. The GeForce GTX 1080 Ti GPU is used for training and testing.

2) PARAMETER SETTINGS
We choose 32 × 32 as the input and output resolution of hourglass module and fix feature channels (128) in each block. All network weights are initialized from zero-mean Gaussian distribution with σ = 0.001. Learnable parameters ω for controlling the spread of latent heatmaps are initialized with 1.

3) TRAINING AND TESTING
We use Adam [46] optimizer to train the network for 100 epochs with batch size of 32. The learning rate is set to 0.001 and divided by 10 every 30 epochs. During testing, our network can achieve 946 fps on a single GPU.

IV. EXPERIMENTS A. DATASETS AND EVALUATION METRICS 1) ICVL DATASET
ICVL dataset [32] contains 330K frames for training and 1596 samples for testing. Each frame is labeled with 16 joints, including 3 joints (Root, Middle, Tip) per finger and 1 joint for the palm.
2) NYU DATASET NYU dataset [9] contains 72757 frames for training and 8252 samples for testing. Since one subject in test set doesn't appear in training set and large hand poses have been covered, the dataset is quite challenging and far from saturation. Following the protocol from previous works, we use 14 joints from the frontal view out of 36 annotated joints for evaluation.

3) MSRA DATASET
MSRA dataset [33] contains 76K frames from 9 different subjects. The leave-one-subject-out cross-validation strategy is utilized for evaluation. Each depth image is labeled with 21 joints, including 4 joints (MCP, PIP, DIP, TIP) per finger and 1 joint for the palm.

4) EVALUATION METRICS
We adopt two most commonly used metrics for evaluation: mean joint error and success rate. The former is the average Euclidean distance between predicted joint coordinates and annotated ones, while the latter is the proportion of test frames whose all joint errors fall below a threshold.

B. COMPARISON WITH STATE-OF-THE-ARTS
To evaluate the overall performance (estimation accuracy, inference speed and model size), we compare estimation accuracy against most state-of-the-art methods on ICVL [32], NYU [9] and MSRA [33] datasets. Among those top-ranked approaches on three datasets, we further compare the inference speed and model size to demonstrate the superiority of our method.
On NYU dataset, we compare with [10], [12]- [19], [28], [30], [34], [35], [41]- [43]. As can be seen from Fig. 6 and Table 1b, our method outperforms most state-of-the-art methods and is on par with the rest of them. When the error threshold is less than 8 mm, our method achieves the best performance among all evaluated methods. However, with the growing of maximum allowed error threshold, our method's performance drops slightly, which can be attributed to the simple hand segmentation strategy used in III-E.1. As NYU dataset [9] is captured by structure light camera and contains many invalid pixels, naive depth thresholding and fixed-size cube can't filter noisy background completely. Other methods like [15] design additional networks to refine hand localizations and extract hand regions. 53076 VOLUME 8, 2020  On MSRA dataset, we compare the performance of our method with [11], [13]- [19], [33]- [35], [42]. Results are shown in Fig. 7 and Table 1c. As illustrated, we get comparable result with [15], [18], [19], [35]. Though inferior to [16], our method gains an absolute edge in inference speed and model size.
Qualitative results for ICVL, NYU and MSRA datasets are shown in Fig. 9, Fig. 10, Fig. 11 respectively. As can be seen, our method can well capture complex hand structures with different joint annotations and effectively alleviate the self-occlusion problem.
In addition, our model has only 2.000M number of parameters, including 0.510M for the feature extraction network, 1.419M for the multi-branch latent heatmap regression network and 0.071M for the pose fusion network. The model size of our method is 7.9MB. We compare the model size against [15], [16], [18], [34], [35]. As shown in Table 3, our method lies in the first place, light enough for real-world applications. The detailed runtime and model parameters profile can be seen in Table 4.

C. ABLATION STUDY
In this section, we conduct ablation experiments on NYU dataset to analyze the impacts of different module design in Latent Fusion network. We incrementally introduce five baselines for comparison: 1) Explicit Heatmap (B1). We adopt single branch explicit heatmap regression. The target heatmap is a 2D Gaussian centred at the ground truth joint position with fixed σ = 1.5. 2) Latent Single YZ&X (B2). We adopt single branch latent heatmap regression with YZ-plane localization and X-axis estimation. 3) Latent Single XZ&Y (B3). We adopt single branch latent heatmap regression with XZ-plane localization and Y-axis estimation. 4) Latent Single XY&Z (B4). We adopt single branch latent heatmap regression with XY-plane localization and Z-axis estimation. 5) Latent Ensemble (B5). We adopt spatial-decomposed multi-branch latent heatmap regression, as discussed in section III-B, but substitute ensemble technique for fusion strategy. The final result is obtained by averaging all branches' predictions.
For a fair comparison, we use two stacks hourglass module in baseline 1, 2, 3, 4 to ensure a similar number of parameters against the other ones. Other network settings     remain the same. As shown in Table 5, latent heatmap regression reduces the mean joint error by a large margin against  the explicit one. Among three single latent heatmap regression branches, spatial decomposition of XY-plane and Z-axis TABLE 6. The detailed architecture of feature extraction network, hourglass network and fusion network. The abbreviations N, K, S, P stand for output channels, kernel size, stride and padding respectively. has the lowest mean error, since it better leverages the spatial geometry of input depth image. However, together with two additional branches by either ensemble or fusion strategy, it can significantly boost the estimation accuracy. The multi-branch network design can better excavate spatial correlations inside the depth image and alleviate the self-occlusion problem. Moreover, the fusion strategy is superior to ensemble one and contributes 0.3 mm accuracy on NYU dataset. As the ensemble technique treats all branches with equal importance, those inaccurate results in three branches can drop the estimation performance. Hence fusion strategy is more suitable for spatial-decomposed multi-branch network and can output the most accurate hand poses.
The visualization of pose predictions from different regression branches as well as fusion output can be seen in Fig. 8. As illustrated, the regression branch of XY-plane localization and Z-axis estimation has high weight under no occlusion circumstances, while the other two branches share more weights as the occlusion in the input depth image becomes severe.

V. CONCLUSION
In this paper, we propose a novel Latent Fusion network for depth-based hand pose estimation. Our method utilizes spatial-decomposed latent heatmaps to estimate hand joint coordinates inside multiple regression branches. We also design a fusion network for aggregating all branches' predictions to output the final hand pose. Experimental results show that our method achieves the best overall performance (estimation accuracy, inference speed and model size) among state-of-the-art approaches. Not only does our method own top accuracy on three public hand pose datasets, but our method can run supremely fast at 946 fps while maintaining the lightest model size of 7.9MB. Possible future work includes hand pose estimation when interacting with objects as well as jointly estimate of hand pose and shape in the wild.

APPENDIX
The detailed architecture of feature extraction network, hourglass network and fusion network can be seen in Table 6.