Embedding Gesture Prior to Joint Shape Optimization Based Real-Time 3D Hand Tracking

In this paper, we present a novel approach for 3D hand tracking in real-time from a set of depth images. In each frame, our approach initializes hand pose with learning and then jointly optimizes the hand pose and shape. For pose initialization, we propose a gesture classification and root location network (GCRL), which can capture the meaningful topological structure of the hand to estimate the gesture and root location of the hand. With the per-frame initialization, our approach can rapidly recover from tracking failures. For optimization, unlike most existing methods that have been using a fixed-size hand model or manual calibration, we propose a hand gesture-guided optimization strategy to estimate pose and shape iteratively, which makes the tracking results more accuracy. Experiments on three challenging datasets show that our proposed approach achieves similar accuracy as state-of-the-art approaches, while runs on a low computational resource (without GPU).

Moreover, most of these approaches rely on high computational resources or are unable to achieve real-time performance.
Based on the above considerations, we propose a novel approach that embeds hand gesture prior to joint shape optimization to accomplish such a task.
Compared to previous works, our proposed approach has several contributions. We propose a GCRL network for pose initialization, which achieves a robust and efficient result compare with directly regressing the key-points or per-pixel hand part classification. The results of the GCRL network are treated as an initial solution for iterative optimization and prevent tracking loss caused by self-occlusion or fast hand motion. Additionally, based on the GCRL network, we design a progressive method that constructs the initial solution from inferred gesture and previous hand shape. This strategy increases the stability of the tracking result. We propose a gesture-guided optimization to estimate the pose and shape of the input hand. During the model fitting, we only optimize the visible bone of hand(which depends on current hand gesture). This may reduce the computations of jointly tracking the pose and shape. In effect, we seamlessly combine the GCRL network and gesture-guided optimization to construct a pipeline(as shown in Figure.2). We conduct a set of experiments on three public datasets to evaluate our approach.
The remainder of this paper is structured as follows: In section 2, we discuss the related works on hand tracking. In section 3, we describe the GCRL network, hand pose/shape model used in our work and real-time tracking approach.
In section 4, we analyze the performance of the hand tracking approach and provide comparisons to the state-of-theart approaches. Finally, we conclude the paper with a brief discussion and a conclusion in section 5.
Discriminative methods [48] mostly tend to learn a function from large quantity of training data, and directly map features to corresponding hand poses. Ge et al. [12] fed a set of hand parts rendered from different views into a 2D CNN for pose regression. Zhou et al. [50] embedded the joint forward kinematic process into a regression network to improve the accuracy. In [28], Oberweger et al. compared several different CNN architectures and chose the best one to estimate hand joint locations. These methods attempt to FIGURE 2. Overview of our proposed approach. Given a depth image, we first crop the hand region and feed it into our trained GCRL network; the GCRL network estimates the gesture and root location of the hand(global position of hand). Additionally, we extract the 3D point cloud and 2D silhouette from the original image. Then, combining the hand initialization, we perform gesture-guided pose/shape optimization that incorporates both data fitting and prior limits to ensure accurate and robust hand tracking. VOLUME 8, 2020 estimate the hand joint configuration directly and usually do not use temporal information and other priors.
Generative methods synthesize observations that are compared to inputs. Then, through optimization, the hand pose that most accurately matches the observation is identified. Oikonomidis et al. [29] fit a hand (spheres/cylinders model) to the depth image with particle swarm optimization(PSO). Qian et al. [31] used an ICP-PSO method to find the possible pose that matches the observed hand point cloud. Sharp et al. [34] adopted a mesh model for tracking in a render-and-compute framework and achieved efficient performance. Melax et al. [24] used rigid body dynamics(RBD) to fit the input depth image and achieves robust results. Generative methods usually focus on more efficient methods for adaptive hand models, model representations, and optimization strategies. They are more computationally complex than discriminative approaches and prone to local minima due to the fast hand motion and complex articulate structure.
These two classes of methods have had a slight overlap over the last years. Hybrid methods combine discriminative and generative methods to improve the robustness and preciseness of model fitting with per-frame pose reinitialization. Tompson et al. [43] first estimated the 2D joint's location with a heatmap and then refined the alignment with PSO-based model fitting. Another type of hybrid algorithm replaces the correspondences found by the iterative closest point(ICP) with the result of a per-pixel classifier(Random Forest(RF) or Convolutional Neural Network(CNN)). Sridhar et al. [35] optimized the hand pose using a detection guide strategy. They achieved correspondences through a per-pixel random forest classifier. These point-to-model correspondences can guide the fitting process.
In a modern motion tracking system, model calibration plays a core role. Unlike face or body model, the large number self-occlusions of hand, the heavy noise of depth sensors, and globally unconstrained pose, and, the self-occlusion of fingers makes the hand shape calibration harder. Current work of hand calibration mainly includes offline and online calibration.
Offline calibration. Taylor et al. [40] generated hand models from input depth frames while the user needs to rotate his fingers manually. Tan et al. [39] personalized a hand model from a set of depth measurements using a trained shape basis and achieved a robust result. Remelli et al. [32] calibrated a hand sphere-mesh in a low dimensional shape-space with a multi-stage optimization.
Online calibration. de La Gorce et al. [7] tracked hand from a set of RGB images with a model with preset hand shape(in the first frame). Makris and Argyros [23] proposed a model-based approach to jointly solve the pose and shape estimation problem in a tracking system. They fit a set of frames with a fixed-size cylindrical model of a hand, it runs in 30 fps, but the render-and-compute framework limits future improvement. Tkach et al. [42] used an online optimization algorithm that jointly estimates pose and shape in each frame. Moreover, it is more robust and precise than other algorithms.
It leverages the estimated shape parameter confidence and builds a tracking sphere model. However, their approach needs to run on high computing resources (GPUs).
Compared to offline methods, Online algorithms can offer immediate feedback to the user and have the potential to adapt to hands with different shapes dynamically. More importantly, they will improve the accuracy of the hand tracking systems.
In recent years, most works in this direction [12], [48], [50] have been based on end-to-end networks. However, existing deep-based approaches mainly have the following problems. Most of them require expensive computing resources(GPUs) to achieve real-time performance, which limits the scope of use of these methods. Existing networks improve the generalization of hands with different shapes only by normalizing the input data or adding training data. Moreover, in some scenarios of virtual/augmented reality, like interacting with deformed objects, the application requires more details of the user's hand, such as the hand-to-object contact position on hand, and only estimating the hand joint locations is not enough.

III. OUR APPROACH A. OVERVIEW
In this paper, we denote pose θ and shape β as [θ, β]. The goal of our hand pose and shape tracking is to find a hand model with parameter [θ, β] that best matches the input depth image, and the [θ, β] must be plausible and realistic. We propose a regularized hand gesture-guided optimization that carefully balances data fitting with suitable priors limits to solve the problem. Our data fitting term contains both 3D point-model and 2D silhouette constraints. The 3D point to model constraint ensures that each depth point is close to the tracked model, it will pull the hand model to depth points. The 2D silhouette constraint pushes the tracked model into the depth's contour. Additionally, we adopt a set of hand priors to ensure that the recovered θ corresponds with the constraints of reality. Above all, we aim to solve the following optimization problem Recently, learning-based initialization has been widely used in hand tracking. These methods attempt to guess a coarse θ init , which is around the global minimum of θ. It avoids trapping in multiple local minima and improves convergence speed when using the θ init as the start 'seed'. We adopt a similar strategy, our approach comprises per-frame initialization and model fitting. We first pre-process the input depth image(including hand interesting area image(ROI) cropping, noise filtering, and 2D silhouette extraction). The hand ROI is then fed into a trained GCRL network to estimate the hand gesture and hand root location, thus providing an 'initialization'. Finally, starting with the initial pose, we optimize x[θ init , β] until it is cover with both data fitting and prior limits. The following sections describe these components. We begin with our GCRL network and pose/shape model, and then describe the definition of our energy function and optimization strategy.

B. INITIALIZATION
Due to the highly non-convex of the object function, the existing approach uses reinitializer like multi-layer random forests [20], convolution neural networks [43] or fingertip detection [31] for search a single good pose solution. In [34], Sharp et al. realized that it is hard to predict a single good pose solution. They designed a two-layer tree-based hand pose reinitialization to predict a distribution over poses. The first layer exclusively focuses on predicting global hand rotation. The second layer is trained to infer finger rotations and other elements. In [40], Taylor et al. used retrieval forest to search four postures that are closest to the real value, and this strategy achieved robust results. Inspired by this approach, we believe that the result of the reinitializer should remain stable in sequence. Furthermore, hand gestures can be treated as a label that represents a set of discrete predetermined posture vectors [8], [19].
Based on the above considerations, we split the pose reinitializer/estimation problem into two parts: gesture/pose classification and root position location. The proposed GCRL network solves two sub-tasks: gesture classification and root position location.
The root location stream regresses the per-pixel likelihood heatmap for the hand root, so we can effectively estimate the global position of the hand. It has an encoder and a corresponding decoder, followed by a pixel-wise likelihood heatmap layer. As shown in Figure.4. First, we use three convolution(CONV) layers(each layer follows a max-pooling layer) to down-sample the input, and four convolution layers to capture the low-level image features. Then we perform two unpooling operations between convolutions to up-sample the given depth features to a heatmap of the hand root. The unpooling process balances computational time and accuracy. The per-pixel likelihood of the heatmap is computed as follows: where R * c and R c are the ground-truth and estimated heatmap for the hand root.
The gesture classification stream identifies the category of the gesture, In our experiment, we classify the gesture by employing intermediate features of the hand. The classification branch has five additional convolutions and two fully connected layers, the output probabilities of 17 classes. As shown in Figure.3, each gesture corresponds to a hand pose vector θ gesture , from the θ gesture and global pose(orientation is the direction of the point cloud, position is estimated from root location sub-network), and we can restore the current hand pose θ initial .
The architecture of the GCRL network is shown in Figure.4. We train the CGRL network by minimizing the total training loss: The GCRL network can provide an initial θ init for the subsequent optimization phase. In addition, it is used to guide the optimization, to be specific. We optimize only the visible bones(which usually means high confidence of β) based on VOLUME 8, 2020 the predicted gesture. This reduces the number of optimized parameters.
There are two types of initialization, the first directly estimates joints' locations and uses inverse kinematics(IK) to calculate the rotation angle value of joints, and the second performs per-pixel hand part segmentation, then guides the model fitting with the semantic information. The former may not generalize well for different hand shapes, the latter is slow and violates the inherent topology structure. When tracking is lost, our GCRL network first infers the hand gesture, then, we construct a new initial seed from the last tracked frame's hand shape and the inferred initial posture. This strategy makes our GCRL network robust to various hand shapes, and costs less infer time.

C. POSE AND SHAPE MODEL
We use a 3D mesh of triangles M and vertexes N to represent the human hand. It parameterizes both pose θ and shape β to deform a N -vertex triangular mesh with a hierarchical skeleton J . It is illustrated in Figure.5. To be able to articulate the complex hand motion, we use a standard kinematic skeleton to denote a hierarchy of joints and the transformations between them. We parameterize the hand pose as θ ∈ R 26 (six parameters for global translation and orientation, four for each finger), and hand shape is encoded via scalar length parameters β ∈ R 20 . Given a vector [θ, β] consists of pose θ and shape β, the surface(including vertexes and corresponding normal) of the mesh model can be computed by a standard technique which in computer graphics is called linear blend skinning (LBS). We refer the reader to [18] for details. Figure.6 shows several tracking templates used in recent model-based real-time hand tracking methods. The model in [31] and [35] cannot represent the details of the hand. [24] in terms of some of the problems represented thumb finger. [32] used a spheres-mesh model to efficient estimate hand pose. However, the number of parameters is too large.
To tradeoff the accuracy and computational time, we use the LBS model in our experiment.

D. ENERGY FUNCTION
The objective energy function plays an important role in modern hand tracking systems, and a key choice is the energy terms. In this section, we describe each term in our optimization. We first introduce the 2/3D data fitting terms and the computation of correspondences between input depth and model. Then, we discuss the prior terms and their benefits in terms of tracking accuracy and robustness.
The human hand consists of a set of links(bones) and is connected by joints. The joints have two types, rotational or translation, A rotational joint is denoted by a rotation axis and rotate angles, and a translation joint is parameterized by a direction vector and lengths of the link(bone). In this paper, we denote the set of rotate angle(hand pose) as [θ 1 , θ 2 , . . . θ n ], and lengths of link(hand shape) as [β 1 , β 2 , . . . β m ].
To compute Equation-1. We first introduce the skeleton jacobian, which is first proposed in [3]. The skeleton jacobian J skel (t) is a [3 × n] matrix, n is the number of θ, which represents the affected DOFs in the kinematic chain that each 3D point t determined.
As shown in Figure.7, t i is a depth point, and s i is the corresponding point on the surface of the hand mesh. We compute J skel (t) i,j by manual differentiation. For the j-th joint, let θ j be its angle of rotation, p j is its position, w j is the LBS weight of p j , and let v j be the vector pointing along its current axis of rotation(see [38]). The corresponding entry ∂s i ∂θ j in the skeleton jacobian matrix for joint j affecting the i-th surface point s i is If the i-th surface point is not affected by the j-th joint, then To jointly optimize the hand shape β, we propose a bone jacobian matrix. The i-th column of J bone (t) contains the linearization of the i-th bone about t. Each entry of is J bone (t) i,j ∂s i ∂β j = w j v j (5)

1) DATA ENERGIES a: 3D POINTS ALIGNMENT
Our E 3D term computes the corresponding point y ∈ H (θ, β) on the surface model for each sensor point t. We only compute the correspondence between depth points and the model point on the front-facing part of H (θ, β), which is different from the traditional ICP. In our experiments, for computational efficiency, we set the view ray of the camera model to n = [0, 0, 1] T . We linearize the point cloud alignment energy(pose and shape fitting) as  . Several tracking templates used in recent model-based real-time hand tracking methods. Images courtesy of [24], [31], [32], [35], [38]. where || * || denotes the L2 norm, n is the surface normal of y, and d is the distance between y and t.

b: 2D SILHOUETTE ALIGNMENT
The human hand is highly articulated, and fast motion may cause self-occlusions during tracking. Only the 3D alignment energy will not constrain the occluded parts. Based on this consideration, the term E 2D align the 2D silhouette of our hand model(we project the front-face point as the 2D silhouette) and the 2D silhouette of the input hand part. E 2D energy term is given by where x is the 3D location of a rendered silhouette point p, n is the 2D normal at the sensor silhouette location q.
Here, we compute the 2D correspondences with a 2D distance transform, and we refer the user to appendix B in [38] for more detail of J persp (y).

2) PRIOR ENERGIES
Only considering the data fitting will easily lead to unrealistic hand poses In reality, the motion of the hand is more constrained. We regularize our optimization with finger collision, joint rotation, and temporal priors to ensure that the result is plausible. Each of these terms plays an important role in the stability of our objective function.

a: JOINT ROTATION LIMIT
To discourage the incorrect tracked posture, we adopt the joint rotation constraint and encode this prior to the energy term where θ max is maximal vector of joint angles and θ min is the minimal. ω 1 is set to one if θ i < θ min i and to zero otherwise. ω 2 is equal to one if θ i > θ max i and to zero otherwise. For the θ min and θ max , we also use the values experimentally determined by [5]. Figure.8 shows the example of joint rotation constraint.

b: COLLISION LIMIT
To prevent our model from taking on anatomically incorrect result, e.g. the collision between fingers and palm, we approximate fingers and palm in our hand using a set of S spheres. Using spheres instead of triangles may reduce the computations of collision detection. Figure.9 shows a sample FIGURE 9. Collision limit. VOLUME 8, 2020 of collision limit. The linearization of the collision energy becomes where x i and x j are the end-points of the shortest distance between the collision sphere S i and S j , n i is the surface normal at x i , and d is the distance between x i and x j . χ(i, j) is an indicate function that evaluates to one if the sphere S i and S j are colliding, and to zero otherwise.

c: TEMPORAL LIMIT
We use the temporal limit provided by [38] to smooth the tracking result between the current and previous frames. The purpose of this term ensures that the pose of the current frame should be near the previous pose. We encode the temporal prior as where k pre i is the position of the joints' locations from the previously optimized frame. K is the set of current joints' locations.

E. OPTIMIZATION
We treat the optimization of energy over [θ, β] as a nonlinear least squares problem, and solve it with the Levenberg-Marquardt approach. We adopt Taylor expansion to iteratively approximate the energy terms in Equation-1 and solve the linear system to obtain the update δθ and δβ at each iteration.
To speed the convergence, we propose a gesture-guided optimization strategy. We first perform pose estimation to align hand to points(pose fitting stage). Then, we optimize the hand shape alone(shape fitting stage). Finally, we jointly optimize the shape and pose of the hand(fully fitting stage). During shape fitting, we only need to optimize the visible joint associated β. For example, a finger that extends outside means that the corresponding β is more confident in optimization. Figure.10 shows our multi-stage optimization.

IV. EXPERIMENTS A. PRELIMINARIES
We extract the region ROI of the single hand using a similar method as that presented in [27]. We also employ a data augmentation to increase the generalization of the network. Specifically, we randomly rotate the image along the z axis with a range of −15 • and 15 • and scaling displacement [0.8, 1.2]. We train and evaluate our networks on a PC with Intel Core i7 6700K, 32GB of RAM, and an Nvidia 1080-Ti GPU. Net models are implemented with the Caffe framework [17]. When training the networks, we set Adam optimizer with learning rate 0.005, batch-size 32, and weight decay 0.0005. Our approach runtime is 2 ms pre-processing, GCRL network runs in 5ms per-frame, optimization stage costs 30 ms(20 iterations for pose fitting, 5 iterations for shape fitting and 5 iterations for joint optimization). During each iteration, pose update costs nearly 1.0 ms, and shape update costs 0.8 ms. This translates to 30˜35 frames per second (only use initial network when larger tracking error in previous frame).
On MSRA-2014, following the most commonly used metrics in the literature, we choose the average Euclidean distance between the 3D joint location and the ground truth to compare our approach with other state-of-the-art approaches [29], [31].
We choose MSRA-2015 to evaluate the performance of our gesture classifier and root location network. For the classification task, we select the mean classification accuracy to evaluate our gesture classifier. For the root location task, we evaluate the hand location performance using a 3D distance error between the root and ground truth.
We also use NYU [43] to compare our approach with several state-of-the-art methods. We utilize two metrics in this work to evaluate the performance. One is mean Euclidean distance error for each joint across all the test frames. The second is the worst-case accuracy, which represents the fraction of FIGURE 10. We first initialize the pose with the result provided by the GCRL network, then optimize for a coarse pose(pose fitting stage); after pose fitting, we optimize the hand shape, and finally, perform a refinement in the full-dimensional space(both pose and shape). The red area represents the model in front of the depth data, the blue means model behind the depth data, and the white area represents model near the depth data. (less than 5 mm).
test frames that all have estimated joint Euclidean error below an error tolerance.

C. COMPARISON TO STATE-OF-THE-ART 1) RESULTS ON MSRA-2014
MSRA-2014 consists of six subjects, each subject has a different hand shape and is annotated with 3D position for 21 joints. We compare our approach to two state-of-the-art model-based approaches, including Forth [29], Qian [31]. Forth [29] use PSO to perform fitting between a fixed-size hand model and points. Qian [31] proposed a hand tracking system using ICP-PSO based model fitting. For a fairly comparison, we use the version without the GCRL network as baseline. Additionally, we conduct a version with a fixed size hand model as a baseline w/o to validate the effectiveness of our joint optimization.
As Figure.11 illustrates, our approach outperforms the Forth [29] that uses a fixed-size hand model. We also improve 1 mm than our shape w/o fitting. The results show that our joint optimization strategy can improve the accuracy of hand tracking. The mean error distance for all joints of our strategy is 8.6 mm, which is 0.5 mm smaller than the results of [31] and 9.3 mm smaller than the results of [29]. Some qualitative results of our approach on the MSRA-2014 dataset are illustrated in Figure.12. We see that our approach still obtains reasonable results in complex hand poses.  [29] and Qian [31]. We use the results that public in [31].
In summary, we can draw the following conclusions: (i) local gradient descent is more precise and faster than the global search. (ii) jointly optimizing pose and shape will improve the robustness and accuracy of tracking.
Moreover, we list the runtime of several model-based approaches(results from the papers) in Table.1. Compared with the previous approaches like [29] and [31], our approach improves both speed(see Table. 1)and accuracy (see Figure. 11) while being able to restore the hand shape. We consume fewer computing resources (only CPU) than [38] and [42]. Compared with [10] and [38], our joint tracking strategy performs the shape estimation online.  Baselines: we create several baselines to validate the effectiveness of our network.
1)RF-C: classify the input depth using the Random Forest, this baseline estimate the gesture of input.
2)RF-R: directly regress the hand global root position using the random forest.
3)CNN-R: directly regress the hand global root position using a network similar to the baseline in [28].
For RF-C and RF-R, we choose the pixel difference features, and the maximum depth of trees is set to 20.
We compare the accuracy of our gesture classifier with several baselines on the MSRA-2015. For all approaches, we use the subjects 1-8 for training and the 9th subject for testing. For the gesture classification task, our classifier achieves a mean accuracy of 93.8% of its highest accuracies on 17 different gestures. The RF-C only achieves 85.2%. For the root location task, our sub-network of root location has much higher accuracy than RF-R and CNN-R. As shown in Table. 2, the mean error distance for the root position of our approach is 8.5 mm, which is 2.3 mm smaller than the results of RF-R and 0.9 mm smaller than the CNN-R. The performance gain is more obvious, showing that our network can capture more complex hand structure.
shape as w/o GCRL, and use the GCRL network but only optimize the hand pose using a fixed-size hand mesh as w/o shape. Figure.13 shows the mean error results of our approach compared with these approaches. The results show that our approach is slightly superior to [12], [27], [28], [50], and is comparable to [6], [44]. Our approach achieves a mean joints error of 11.87 mm, which is approximately 4.1 mm smaller than [27], 5.1 mm smaller than [50], and near 9 mm smaller than [28]. The accuracy of our approach is similar to [6] and [44], while both of them rely heavily on GPU to achieve real-time performance.
Compared with the w/o GCRL(mean error 15.37mm), the results show the effectiveness of our GCRL network. In some cases, the self-occlusion and noise cause the w/o GCRL(mean error 13.69mm) tracking loss. The comparison result with the w/o shape also indicates the success of our joint optimization strategy.

V. DISCUSSION AND CONCLUSION
The current implementation of our approach works well for the majority of poses, and the reconstruction is hard when the hand is in serious occlusion cases. In addition, we find that the viewpoint variations of the camera will seriously influence the tracking result. When the hand is in the 'fist' state, although our GCRL network provides an initialization, if the palm is not facing the camera, the occluded part of the hand will be lost. Additionally, in this angle of camera view, although the GCRL network set a corresponding initial pose of estimated hand gestures in the predicted hand center, the serious edge noise and heavy self-occlusion often cause matching failure. In principle, after using the GCRL network and jointly optimizing the shape and pose of the hand, our approach fails badly only in extreme views and severe selfocclusion.
Another contribution of this paper is the demonstration that both the initialization network and joint optimization strategy not only contribute to the state-of-the-art accuracy shown above but also allow us to maintain this approach on a low computation device. Three variables that determine the amount of computing fitting procedure: (i) The number of data points used in the data term; (ii) The number of iterations we perform for each starting point, (iii) The initial result provided by the GCRL network.
In recent years, some works [14], [25], [41] have focused on the tracking two interacting hands. They all adopted the strategy of 'left/right-hand segment + pose/shape optimization'. This strategy makes this problem feasible. The endto-end network has not yet emerged due to the lack of a large quantity of effective training data. Therefore, the modelbased optimization method will play an important role in tracking two interacting hands in the future.
In this paper, we propose a novel approach for hand tracking that consists of deep-based pose initialization and gesture-guided pose/shape optimization. The GCRL network captures a meaningful hand structure to estimate gesture and hand root location, thus providing a robust initial pose. Starting from the estimated pose, we jointly estimate pose and shape. By integrating the deep-based initialization and optimizing the parameters of shape selectively, our approach results in faster convergence and increased robustness. Extensive experiments on three datasets demonstrate the effectiveness of our proposed approach.